A Hybrid Feature Subset Selection using Metrics and Forward Selection

The aim of this study is to design a Feature Subset Selection Technique that speeds up the Feature Selection (FS) process in high dimensional datasets with reduced computational cost and great efficiency. FS has become the focus of much research on decision support system areas for which data with tremendous number of variables are analyzed. Filters and wrappers are proposed techniques for the feature subset selection process. Filters make use of association based approach but wrappers adopt classification algorithms to identify important features. Filter method lacks the ability of minimization of simplification error while wrapper method burden weighty computational resource. To pull through these difficulties, a hybrid approach is proposed combining both filters and wrappers. Filter approach uses a permutation of ranker search methods and a wrapper which improves the learning accurateness and obtains a lessening in the memory requirements and finishing time. The UCI machine learning repository was chosen to experiment the approach. The classification accuracy resulted from our approach proves to be higher.


INTRODUCTION
With sufficient data, it is well to use all the features, together with those unconnected ones.But in practice, there are two problems which may be evolved by the irrelevant features involved in the learning process: • The irrelevant input features will induce greater computational cost.• The unrelated features may lead to over fitting.
Example: If the identification number of the patient is by fault taken as one input feature, the finale may be that, the sickness is determined by this feature.
Feature selection estimates the primary function between the input and the output; it is reasonable and important to ignore those input features with little effect on the output, so as to keep the size of the approximation model small.
Feature selection is a diminution technique which consists of detecting the related features and discarding the unrelated ones and has been studied for many years.An accurate selection can upgrade the learning speed (Guyon et al., 2006).Feature selection has won triumph in many diverse real world cases (Yu and Liu, 2004;Bolon-Canedo et al., 2011;Forman et al., 2003;Saari et al., 2011).
Filters and wrappers are the two estimation strategies.In filters individual features are evaluated autonomously of the knowledge algorithms whereas wrappers use the knowledge algorithm to assess feature subsets.In this study, we bring in a hybrid technique to choose features more accurately by using a combination of ranker search methods and a wrapper algorithm which uses a distributed learning method from multiple subsets of data processed concurrently.Here the learning is parallelized by distributing the subsets of data to multiple processors and then combining the obtained results into a single subset of relevant features.In this way, the cost of computation and the time required will be appreciably reduced.The experiments are made to study the importance of the feature selection process in the lung cancer data set collected from the UCI machine learning repository.

LITERATURE REVIEW
Wrappers for Feature Subset Selection (Kohavi and John, 1997) searches for an optimal feature subset tailored to a particular algorithm and a domain.Significant improvement in accuracy was obtained using decision trees and naïve-bayes.
Rough-Set Based Hybrid Feature Selection Method for Topic-Specific Text Filtering (Li et al., 2004) selects features using x 2 statistic, information gain and then by means of rough set.Naïve-bayes was used to assess the method.
Evaluating feature selection methods for learning in data mining applications (Piramuthu, 2004) evaluates several probabilistic distance-based feature selection methods for inducing decision trees using five-real world data sets.
Euclidean Based Feature Selection for Network Intrusion Detection (Suebsing and Hiransakolwong, 2009) applies Euclidean distance for selecting a subset of robust features using smaller storage space and getting higher intrusion detection performance.Three different test data sets are used to weigh up the management of the proposed technique.

PROPOSED METHODOLOGY
In our proposed methodology, we make use of a hybrid approach.A first step was added in order to rank the features.After obtaining this ranking, the wrapper model will be the focus of our attention.

Ranker search methods:
• Select from the list given below until all the feature selection methods are applied with ranker search on the given dataset List of feature selection methods: o The Euclidean distance: where, i = (x i1 , x i2 ... x in ) and j = (x j1 , x j2 … x jn ), are two n-dimensional vectors o The manhattan (or city block) distance: o The Minkowski distance between two objects, using p = 3: • To calculate, rank for a feature, compute: o Midrange: (mr) = (largest assessment-smallest assessment) /2 (4) • The output lists the features in a descending array • Assign the weights for the features in the array from n to 1 • Add the weights of all the methods and store it in a descending list.Now grade the features from n to 1 Wrapper model: The idea of the wrapper approach is to select a feature subset using a learning algorithm.
Searching procedure consists of two basic issues.Forward selection technique begins with an empty set and adds features successively.Backward elimination technique begins with a full set and removes features successively (Das, 2001); but forward selection is far less delayed therefore this issue will be used in our experimental research.
The data is split into groups where each group consists of k features obtained successively over the grading.

RESULTS AND DISCUSSION
Lung cancer data set is used for the experiments.50% of the data set is taken as the training dataset and the remaining are taken as the test data set.The univariate filters Euclidean, Manhattan and Minkowski provides an ordered ranking of all the features.Table 1 shows the ranking of features and Table 2 shows the combined ranking of features.
To show the adequacy of the proposed wrapper, it will be compared with the performance when applying the wrapper over the whole set of features directly.Table 3 shows the classification accuracy, the number of features and the execution time required on the dataset.Figure 1 show the increase in accuracy when the data set was split into subsets.
Screen shots for ranking of features using the different metrics: Experiments are carried out using dot net technology for ranking of features using the different metrics such as Euclidean (Fig. 2), Manhattan (Fig. 3) and Minkowski (Fig. 4) on test and validation data.

CONCLUSION
To sum up this study, obtain an ordered ranking of all features using a combination of ranker search methods.Divide each dataset DS into several small disjoint datasets DS i vertically by features.The wrapper algorithm is applied to each one of these subsets and a selection SL i is generated for each subset of data.After all the small datasets DS i were used the combination method constructs the final selection SL as the result of the feature selection process.The experiments showed that our method led to a reduction in the running time as well as in the storage requirements while accuracy did not drop.

Fig. 1 :
Fig. 1: Increase in accuracy when the dataset was split into subsets Pseudo-code for the proposed algorithm: DS (p x r) = Training dataset with p samples and r features q r = Number of subset of k features 1. Apply ranker search methods over DS and obtain a ranking R1 of the features 2. for i = 1 to q r (a) R1 i = first k features in R1 (b) R1 = R1 \ R i (c) DS i = DS (p x Ri) 3. for i = 1 to q r (a) SL = subset of features obtained after applying wrapper over DS i .4. SL = SL 1 5. support = accuracy classifying subset DS (p x r i ) with classifer CL.6.for i = 2 to q

Fig. 2 :
Fig. 2: Rank calculation using Euclidean metric on test and validation data

Fig. 3 :
Fig. 3: Rank calculation using Manhattan metric on test and validation data

Fig. 4 :
Fig. 4: Rank calculation using Minkowski metric on test and validation data

Table 1 :
Ranking of features using the different metrics Euclidean -

Table 2 :
Combined weights of the different metrics

Table 3 :
Classification results for implementations of wrapper Classifier -