A Text Categorization Algorithm Based on Sense Group

Giving further consideration on linguistic feature, this study proposes an algorithm of Chinese text categorization based on sense group. The algorithm extracts sense group by analyzing syntactic and semantic properties of Chinese texts and builds the category sense group library. SVM is used for the experiment of text categorization. The experimental results show that the precision and recall of the new algorithm based on sense group is better than that of traditional algorithms.


INTRODUCTION
Text categorization is an automatic processing that assigns a free text to one or more predefined classes or categories based on its content.Compared with English text categorization, the study of Chinese text categorization started later.And Chinese text categorization mostly makes use of the algorithms of English text categorization.As long as the internal structure of language is concerned, English is a hypotactic language, whereas Chinese is a paratactic language.But the current approaches for Chinese text categorization do not involve syntactic and semantic analysis and often make extraction and matching on the word level, with low categorization accuracy (Dai et al., 2004).
Sense group, in the narrow sense, refers to a meaningful unit of words that are structurally and semantically related to each other.In the wider sense, sense group is the combination of associated concepts.More accurate identification of sense group means easier subject identification (Zhou et al., 2004).
In order to overcome defects of the existing algorithms of Chinese text categorization, this study, considering the features of Chinese language and using the semantic dependency parsing put forward in reference (Li et al., 2004), proposes a text categorization algorithm based on sense group.Sense-group-based text categorization algorithm trains the corpus according to syntactic and semantic features and builds a category sense-group library.In light of the categorization with sense group as the unit, these sense groups of the text to be categorized are extracted.Then these categorization attributes of the text are obtained using Support Vector Machine (SVM).As the sense-group extraction considers the syntactic and semantic features of the Chinese language, the text is presented in a way that is more adapted to the human's mode of thinking.Thus, the meaning of the text is better grasped by a computer, which can performs text categorization with higher precision.

SENSE GROUP AND TEXT CATEGORIZATION ALGORITHM
Sense group is a meaningful unit of words that are structurally and semantically related to each other.Sense group, combining a group of concepts together according to certain association, represents the intended meaning through a cluster of such interrelated units of words (Chen and Martha, 2004).How to acquire the concept, represent these sense groups and build the category sense-group library are the key processes in the algorithm for sense-group-based text categorization.The flowchart of the establishment of category sensegroup library is shown in Fig. 1.
First, ICTCLAS30 (Zhang and Liu, 2002) is used to process the training texts in the corpus for word segmentation and part-of-speech tagging.The obtained results are then subjected to syntactic analysis.Based on the rules of Chinese syntactic understanding, the weights of the words in the clauses are assigned depending on the importance of the clauses to the text.Suppose the training text is T i and the text after syntactic analysis is: where, Semantic hierarchy analysis is performed on the results of syntactic analysis with assigned weights.Considering semantic dependency in Chinese sentences (Li et al., 2004), text hierarchy is divided using semantic and structural analysis.Each hierarchy represents a sense group and thus the text now becomes a set of sense groups.As shown in Formula (1): where, k l n   and C i is the sense group.Each sense group contains the words and the corresponding weights of words.Part-of-speech choice is performed on the words according to their contribution to text categorization.For effective words after part-of-speech choice, the concepts are extracted based on HowNet semantics and the words are mapped into the concept space.Then the sense group and the text can be represented as in Formula ( 2) and (3): where, w i is the weight of concept t i .The first 20 concepts arranged according to their weights are selected as effective concepts.After dimensionality reduction of sense groups, the texts are stored by text category and then the category sense-group library is obtained.Suppose the category sense-group library is SGC, then SGC = {C 1 , C 2 , …, C r }, where C i is a sense group, with each sense group C i having n characteristic values and, j = 1, 2,… as shown in Formula (4): With the sense-group category library obtained, the same procedures of sense-group extraction are repeated for text T i and the sense groups are represented by vectors.An appropriate categorization algorithm is identified between the text to be categorized and category sense-group library.A quantitative relation that can be recognized by the computer is used to determine the category of subject.That is, a mapping relation f is identified so that for i T  , we have   .By this approach, the text is categorized while saving much time of manual categorization.This approach makes possible information processing and collection.

Syntactic analysis module:
It is theoretically believed that the varieties of sentences are infinite, but the types of sentences are finite.Any sentence is classified as basic sentence type or its combination.The major task of syntactic analysis module is to analyze the structure of the sentence and to identify the sentence type.Weights are assigned to sentences according to position of the sentence in the text, the degree of influence of syntax on general idea of the text and the key points of understanding clauses contained in the complex sentence.This process is crucial for the selection of concept features in establishing category sense-group library.
Sentence type classification: Automatic chunk segmentation is used to classify sentence types as well as extract and identify syntactic structure and functional structure on the higher layer.The existing automatic chunk segmentation defines the chunk category from the perspective of syntactic concept (Li et al., 2003).By incorporating semantic concept into the definition of syntactic chunk, the grammatical rules are refined and collocation of the structures is constrained.In this way, the grammar and semantic are closely related.In this study, chunk category is divided into two layers, namely, phase element which is grammatical and functional elements which is semantic.Phrase elements include the common phase types: Adjective Phrase (ADJP), Adverb Phrase (ADVP), Location phrase Increased weight is assigned to the second clause in the complex sentence, and reduced weight to the clause in the first part.The weight difference should not exceed 0.5.For compound sentence, the weights of the two clauses are equal.
For contrastive complex sentence, the weight assigned to the first clause is 0  ; 0 0.5   is the weight assigned to the second clause.
Progressive, transitional, conditional, causeand-effect, purposive complex sentence Increased weight is assigned to the second clause, whereas reduced weight is assigned to the first clause.The weight difference is 0.5.
The weight assigned to the first clause is 1  ; the weight assigned to the second clause is 1 0.5   .
Explanatory complex sentence Increased weight is assigned to the explanatory clause; the weight of the remaining part is assigned according to the rules for ordinary sentences.
The weight of the explanatory clause is 2 2 ( 1).

  
Successive complex sentence Increased weight is assigned to the last clause; the weight of the remaining part is assigned according to the rules for ordinary sentences.
The weight of the last clause is 3 3 ( 1).

  
Selective complex sentence Equal weights are assigned to the first and the second clauses.
No weight adjustment.
The algorithm for sentence type identification adopts sentence type identification strategy based on rule matching.Guided by the syntactic rules of predicate knowledge base and linguistic statistics, sentence type matching is performed for sentence stems.The difficulty in sentence type classification is identification of complex sentences.We use automatic chunk segmentation to divide Chinese complex sentences into 9 categories: coordinate complex sentence, explanatory complex sentence, successive complex sentence, progressive complex sentence, selective complex sentence, transitional complex sentence, conditional complex sentence, cause-andeffect complex sentence and purposive complex sentence (Wen et al., 2008).Depending on the contribution of complex sentences to text understanding, we assign variable weights to complex sentences.
Weight assignment: Generally speaking, the title of the text can best reflect the text category.Then the highest position weight is assigned to the title.
After sentence type classification, the text is composed of complex sentences and simple sentences.In light of the degree of the influence of sentence structure and the rules of understanding Chinese complex sentences, we assign different weights to the complex sentences (Table 1).The weights of these words are expressed as the weights of the clause in which the words are located.For repetitively occurring words, their weights are increased by 1. Relevant parameters can be configured in the experiment.

Syntactic analysis module:
Semantic analysis module is composed of three steps: semantic hierarchy analysis, part-of-speech choice and concept mapping.
Semantic hierarchy analysis uses statistical semantic analyzer to identify semantic dependency in Chinese sentences (Li et al., 2004).The text hierarchies are divided based on semantic dependency, each hierarchy as a sense group.Then the text is composed of the divided sense groups.The elements of sense group are words and the corresponding weights of the words (as shown in Formula: 1).
The theoretical basis for part-of-speech choice comes from the result analysis of a large quantity of text categorization.The general idea of a text is represented by notional words such as verbs, nouns and adjectives; the function words together with high frequency words that appear in various texts are of no use in the categorization.Thus, the function words are filtered out from the sense groups and we will obtain the high frequency stop words.The dimensionality of the vector of characteristic value is reduced, thereby saving the computation time (Xu et al., 2008).
Completing the first two steps, the sense groups comprising effective words are obtained.Concept mapping can summarize the semantic information of words as concepts, which effectively removes the adverse impact of synonyms and near synonyms on the categorization.Concept mapping, combining with HowNet semantics, extracts DEF description information of words and represents them as concepts.Then the expression for the sense group has the form of Formula (2).Thus, concept extraction and representation of sense groups are accomplished.

EXPERIMENTAL RESULT ANALYSIS
Experimental assessment approach: In the study of text categorization based on sense group, the categorization is assessed from mainly three aspects: precision (accuracy rate), recall rate and test value of F1.
Precision is the ratio of the number of rightly categorized texts by the categorization approach to the total number of categorized texts, as shown in Formula (5): Recall rate is the ratio of the number of texts rightly categorized by the categorization method to the total number of texts that should be categorized, as shown in Formula (6): The test value of F1 comprehensively considers the two aspects: accuracy rate and recall rate.It is shown in formula (7):  1; position parameters of sentences: weight of the first sentence is set as 2; weights of the remaining sentences are set as 1.The experimental results are listed in Table 2.
From the above comparison table, we can see that the Chinese text categorization algorithm based on sense group has higher test values.This new algorithm focuses on the features of the Chinese language and combines syntactic and semantic analysis.We obtain increased accuracy rate and recall rate of text categorization.However, due to the lack of distinctiveness in some categories, the accuracy rate of text categorization is affected.Generally, the categories with distinctive contents have higher accuracy rate.

CONCLUSION
Chinese text categorization algorithm based on sense group considers the structural difference between Chinese and English languages.According to the rules of Chinese grammar and semantics, we analyze the sense groups of the trained texts and then extract the sense groups to build category sense-group library.SVM is used for the experiment of text categorization.Chinese text categorization algorithm based on sense group is better adapted to the understanding process of natural language, with more accurately represented texts.Thus, the computer can better understand the contents of the texts.As compared with the conventional categorization approach, the experimental results show that the new algorithm based on sense group has higher precision and its application value is also higher.

Fig. 1 :
Fig. 1: Flowchart of the establishment of category sense-group library d i : A word w i : The weight of word

Table 1 :
Weight assignment to complex sentences

Table 2 :
Categorization result comparison