A Dempster-Shafer Model for Feature Selection in Text Categorization

In this study, we propose a feature selection method based on evident theoretic model for text categorization. The proposed model is formally expressed within the Dempster-Shafer Theory of Evidence. We discuss the way the theory is used to retrieve highly informative and relevant features from the document collection. The formal retrieval function is inferred from the said model and compared our proposed model with many of the conventional feature selection methods. Experimental evaluation on standard benchmark dataset has shown the effectiveness of the proposed method.


INTRODUCTION
Feature selection method focus on the problem of retrieving relevant features from document collection in order to represent the document for categorization (Sebastiani, 2002).In this study, we concentrate on developing a novel feature selection method for textbased categorization systems.Though traditional feature selection methods retrieve features from document collection to some extend but they are not capable of retrieving all possible potential features.Hence, they could not improve the classifier effectiveness.
The combination of traditional feature selection techniques used in TC attempts to overcome such shortcomings (Del Castillo and Serrano, 2004;Doan and Horiguchi, 2004).These combination techniques have been proven successful in improving the performance of the classifier substantially.They aim to extract possible potential features which are then used to represent documents, where features can take on various linguistic forms.
In this study, we use some of the most widely used feature selection techniques as source of evidences from which our proposed model retrieves highly relevant features.Our model is constructed based Dempster-Shafer (D-S) Theory of Evidence (Shafer, 1976).This is a mathematical theory of evidence which deals with uncertainty associated with available evidence (a set of hypothesis and their associated beliefs).The evidences here are the set of features generated by the conventional feature selection methods.

DEMPSTER-SHAFER'S THEORY OF EVIDENCE
Dempster Shafer theory also known as theory of evidence is a flexible framework for representing and reasoning with imprecise and uncertain data (Wang and David, 2004).We first describe some important measures which are ought to be used in our proposed model.Let Ω be a finite non-empty set of mutually exhaustive and exclusive events.The set Ω is called a frame of discernment.Let 2 Ω be the set of all subsets of the set Ω, including the empty set Ø; and Ω itself.Given a frame of discernment Ω, the function m: 2 Ω → [0, 1] is called a basic probability assignment (bpa) if it satisfies the following: (1) The bpa represents a source of evidence supporting various subsets A in 2 Ω with value, or "degree of support", m (A).The subsets A of 2 Ω such that m (A)>0 are called focal elements.Given a bpa m: 2 Ω → [0, 1], a function Bel: 2 Ω → [0, 1], is called a belief function over Ω, is defined as: (2) The measure Bel (A) quantifies the strength of the total belief given to set A alone; but not any of its subsets.In contrast, m (A) quantifies the exact belief committed to A. Unlike a probability theory, a salient characteristic of the evidence theory is that the belief in particular hypothesis does not necessarily imply that the remaining belief is associated to the negation of the hypothesis.Hence when there is no further evidence available regarding belief in negation of the hypothesis, the remaining belief is assigned to the entire frame of discernment (all the possible hypothesis), that represents the uncommitted belief or total ignorance (Smets and Kennes, 1994).

DESCRIPTION OF THE MODEL
This section looks into basic intuition on which our proposed model is built.The indexing features which act as a basic building blocks of text representation is described first followed by illustrating how document collection is represented as a frame of discernment.Then we describe the method by which features are represented within the frame.Feature retrieval rule is derived at last.

Indexing features:
In order to process the document by the classifier, every document has to be converted into meaningful representation of its content (Sebastiani, 2002).Here the conversion of documents is obtained using the standard information retrieval technique known as bag-of-words approach, in which every document is represented as a group of words retrieved from that document.
We construct this representation based on word content only.Therefore, our approach ignores the word ordering and also ignores the concept of syntactic phrases in documents, thus treating every word equally.The purpose of our study is to achieve improvement in classifier effectiveness by extracting highly informative linguistic structures, and also to use them to construct more meaningful representation of document.

Frame of discernment:
To have an insight into this model we define some terminologies regarding document collection and its associated features.
where N is the number of documents.Let S = {s 1 , s 2 , … , s M } be the resulting set of terms after doing all pre-processing tasks in the given document collection C, where M is the total number of single terms in the document collection.
Given a document collection C, we take the frame of discernment as the set S itself.Then elements of the frame are defined as mutually exclusive hypothesis derived from a power set of S.
Definition 2: For the set of single terms S = {s 1 , s 2 , … , s M } of a document collection C, all the 2 S subset of S can easily be obtained using the terms s ∈ S.These subsets represent the elementary hypothesis of the constructed frame.It can be shown that the number of constructed elementary hypothesis is 2 S .(Rogati and Yang, 2002) such as information gain, chi-square or odd-ratio.This kind of resemblance to the existing feature selection metric is modeled as evidence through which our model select highly relevant features.

Focal and informative elements:
In the D-S theory of evidence, an element with its associated positive evidence is considered as focal elements.Hence, set of focal elements can be grouped together as feature groups modeling the informative representation of a document collection.Given a document D i ∈ C, these focal elements are defined upon the set S i .
Definition 3: Every subset S i of S of a document collection C defines a focal element.e.g., the hypothesis h j .Furthermore, every super group S g ⊇ S i also defines a focal element, the hypothesis h k = ∪ h l , where each h l is the hypothesis associated to single subset S i of S. Θ i is defined as the set that includes all the feature groups representing subset and super group of the document D i .
Example 2: Let D 1 be the document with S 1 = {Acquire, Loss, Stake, Merger, Share, Profit}.The following feature groups contain the partial subset of features of S 1 and these feature may belongs to the set of features generated by the conventional feature selection methods such as Information Gain (IG), Chi-Square (CHI) or Odd-Ratio (OR): The feature groups modeling informative elements must be defined in terms of the elementary subgroups defining the frame of discernment.
Definition 4: A feature group is represented as super group which is the union of elementary subgroups as follows: (3) Example 3: The feature group f IG+OR in example 2 is defined in terms of the elementary sub groups defined in example 1 as e 3 ∪ e 4 .Similarly the hypothesis f CHI is defined as e 1 ∪ e 2 .
The frame of discernment along with the feature groups modeling the informative elements of the document D 1 is shown schematically in Fig. 1.These feature groups overlap in some region so these features in this overlapped region play a important role in representing semantic of the document.Obviously, some feature groups have stronger evidence than others.This is represented in the D-S via the use of a bpa.

Basic probability assignment:
A bpa must be defined for every feature in the document collection C to capture the exact belief that the various feature groups (focal elements) provides for good representation of the document content.We compute the bpa values from term statistical characteristics in documents.The bpa formula considered is: The first part of above formula (h j ∈ S) assigns a positive bpa value to hypothesis representing indexing elements of S i .The same formula (h j ∉S) assigns 0 to all other remaining hypothesis.Logarithmic value used in the above formulae ensures that the calculated total bpa value to be always one.
Feature retrieval: To estimate the degree of relevance of each feature term to a semantic of the document, we use the belief function of the D-S theory.To each features f i with bpa m i , we have an associated belief function Bel i defined upon m i .The degree of relevance of the feature term to a document is represented by the hypothesis q is formulated as: (5) This measure encapsulates the evidence of all the feature groups used to describe the document content that imply the hypothesis q.If Bel i (q) = 0, the feature doesn't imply any relevance to the semantic of document.For a document collection, we use the belief values Bel i (q) to rank the features according to their estimated relevance to the semantic of the document.

EXPERIMENTS AND EVALUATION
Three benchmark dataset have been chosen for evaluating the effectiveness of the proposed feature selection method.These datasets are Reuters-21578 (Lewis, 1997), WebKB and 20 News Groups, which are the most widely used text corpus in text-classification research.The details on these data sets are given in Table 2. Since these datasets contains news articles on various topics and to show the effects of our proposed feature retrieval method on different domains, these datasets are intentionally chosen.As for as text classification algorithm, we choose the following most promising algorithm in the domain: SVM and kNN text classifiers.SVM is the most common one, as it was shown to perform better in terms of effectiveness than other text classifiers such as naïve Bayes, kNN, C4.5, and Rocchio (Joachims, 1998).The kNN algorithm is chosen because of its simplicity and superior efficiency than other algorithms (Yang and Pedersen, 1997;Denoeux, 1995).

Evaluation measures:
To evaluate the effectiveness of our approach and compare to the state of the art feature selection research results, we use the commonly used evaluation metrics precision, recall, and F 1 measure.Precision is defined as the ratio of correct  classification of documents into categories to the total number of attempted classifications.Recall is defined as the ratio of correct classifications of documents into categories to the total number of labeled data in the testing set.F 1 measure is defined as the harmonic mean of precision and recall.Hence, a good classifier is assumed to have a high F 1 measure, which indicates that classifier performs well with respect to both precision and recall.We present the micro averaged results for precision, recall and F 1 measure.Micro averaging considers the sum of all the true positives, false positives, and false negatives (Forman, 2003).

RESULTS AND DISCUSSION
We conducted several experiments using our model with various learning algorithms.The idea of each experiment is to generate potential features using the derived measure Bel (q).We simply sort the list of features based on the computed scores and obtain the list of k relevant terms with the highest scores.To evaluate the goodness of each such retrieved list of features, the k relevant terms are tested by the learning algorithm on measures such as precision and recall and compared to the prior reported work.We repeated this experiment with a wide range of k values for each classifier.The range of k value is from 50 to 1000.The results are summarized in Table 3 and 4. The experimental results suggest that the proposed feature selection model called as COM performs better than the conventional feature selection method such as IG, CHI and Odd Ratio in terms of precision.This improvement in effectiveness resulted from the combination of evidence represented by different feature selection methods.However, in some applications due to scalability reason, if a situation warrants only a limited number of features, the best superior one that outperforms others is IG.
We presented the classification results for SVM and kNN algorithm using our proposed feature retrieval model on Reuters 21578, WebKB and 20 News Groups datasets.This series of experiments strongly recommend that that if the precision is central goal, proposed model defeats other traditional methods by a smaller but significant margin.

CONCLUSION
We constructed a Dempster Shafer model for feature selection in text categorization and we observed the model performance on two text classification algorithm namely SVM and kNN.With an enormous outburst digital documents on the World Wide Web, existing traditional feature selection techniques are found to be inadequate in capturing the potential features from the document collection.It has been shown that the proposed Dempter Shafer model could capture the relevant and potential features from the collection and thereby improved the effectiveness of the classifier.We performed experiments on two standard benchmark datasets, Reuters 21578, WebKB and 20 News Groups.We showed that our proposed model significantly perform well than the conventional feature selection methods on SVM and kNN.

Fig. 1 :
Fig. 1: An example of overlapped elementary hypothesis in a frame of discernment t k , d j ) = The number of times t k occurs in document d j . #T r (t k ) = Number of documents in T r in which t k occurs at least once. T r = The total number of training documents in the collection C.

Table 1 :
Sample of elementary hypothesis of the frame S

Table 1 :
Feature group representation: To retrieve highly relevant feature from the document collection, we selectively model each of the subset of the frame of discernment as feature groups.Each such feature group may consists of one or more features and combination of such feature group may resemblance the set of features generated by the conventional feature selection metrics

Table 2 :
Summary of the benchmark datasets used in our research

Table 3 :
Performance of kNN classifier on Reuters, Web KB and