Swarm Intelligence Approach Based on Adaptive ELM Classifier with ICGA Selection for Microarray Gene Expression and Cancer Classification

,


INTRODUCTION
Cancer detection and classification for diagnostic and prognostic purposes is usually based on pathological investigation of tissue section, resultant in individual interpretation of data (Eisen and Brown, 1999).The limited information gained from morphological analysis/pathological investigation is often insufficient to aid in cancer diagnosis and may result in expensive but ineffective treatment of cancer.
With the appearance and speedy development of DNA microarray technologies in previous work (Eisen and Brown, 1999;Lipshutz et al., 1999), classification of cancer by identification of corresponding gene expression profiles has previously concerned many efforts from a wide assortment of research communities.From this classification of cancer becomes major important to the diagnosis of diseases and treatment.Without the accurate identification of cancer types, it is rarely possible to give useful therapy and accomplish probable effects.Conventional classification methods are mainly dependent on the following works that is the morphological appearance of tumors, parameters derivative from clinical observations and extra biochemical technique.Their application is restricted by the presented uncertainties and their prediction accurateness needs to further improvement (Golub et al., 1999).DNA based microarray technologies present a new researchers to analysis the cancer, propose a new technique to examine the pathologies of cancer beginning a molecular angle under a methodical structure and more, to make further accurate prediction result in prognosis and treatment.
In order to precisely recognize cancer subtypes, numerous up to date studies have been carried out to recognize genes that might cause cancer (Peng et al., 2003;Saeys et al., 2007;Koller and Sahami, 1996).Advances in microarray technology and improved methods for processing and converted biological data methods have augmented these studies.For illustration, the analysis of Microarray Gene Expression Data (MGED) (Piatetsky-Shapiro and Tamayo, 2003) enables molecular cancer classification through the gene selection, which might serve as markers for dissimilar types of cancers.However, selection of optimal sets of genes features is complex by the occurrence of a huge numeral of genes and the availability of very few and uneven numbers of samples per class (sparse and imbalanced).It significantly affects the classification performance.Computational molecular categorization combined with machine learning techniques may offer a more reliable and costeffective method for identifying different types of cancers, which might lead to better treatment and prognosis for this disease.The recognition of genes profiles to outcome the better cancer classification, it might also improve the identification of each cancer type to survive and thrive.This kind of information make will provide better way for developing suitable drugs to treat precise cancers.Some of the important major issues to classification of the microarray data are: robustness of gene selection and gene ranking, considerate of issues associated to feature selection and assessment of the selected genes (Ein-Dor et al., 2006;Stolovitzky, 2003).Classification, possible use and diversity of feature selection techniques are discussed in Saeys et al. (2007).
In this study, a better gene selection and cancer classification technique is proposed for microarray data that is described by sample sparseness and imbalance.In this study, an Integer-Coded Genetic Algorithm (ICGA) (Saraswathi et al., 2011) is used for well-built and healthy gene selection of microarray data.Next, propose an Artificial Bee Colony algorithm (ABC) (Karaboga and Basturk, 2007) and driven Adaptive Extreme Learning Machine (AELM) (Jia and Hao, 2013), for managing the sparse/imbalanced data of classification problem that occurs in microarray data analysis.

LITERATURE REVIEW
In Alba et al. (2007) discussed the comparison of the optimization techniques such as PSO and Genetic Algorithm (GA) for gene selection and classification of the cancer is performed by using SVM classification for high dimensional in microarray dataset.To validate and estimate the cancer classification result in SVM classifier applies the 10-fold cross-validation.The primary work is to shows that PSOSVM is able to discover interesting genes and to present improved classification result.Improved version of Geometric PSO is evaluated for the primary moment in time in this effort with a binary representation (0 and 1) in Hamming space.PSOSVM based classification results can be compared with a new GASVM and also compared with other existing methods of gene selection.A following important work consists in the concrete finding of new and challenging outcomes on six public datasets identifying significant in the growth of a variety of cancers (leukemia, breast, colon, ovarian, prostate and lung).
In Liao et al. (2006) presented a novel gene selection procedure based on Wilcoxon rank sum test and classification is performed by using Support Vector Machine (SVM).Selection of the best feature is performed by using Wilcoxon rank, afterthat SVM classifier with linear kernel was performed to train and test the classification result.Leave-One-Out Cross Validation (LOOCV) classification consequences on two datasets: breast cancer and ALL/AML leukemia, show that the proposed technique is capable of get 100% success result with last reduced subset.The selected genes are listed and their expression levels are sketched, which show that the selected genes can make clear separation between positive and negative classes. Microarrays

PROPOSED METHODOLOGY FOR GENE SELECTION AND CLASSIFICATIO
Proposed work which combines Integer-Coded Genetic Algorithm (ICGA) and Artificial Bee Colony

ABC based adaptive extreme learning machine classifier for accurate classification with ICGA based gene selection: ABC based Adaptive Extreme
Learning Machine (ABC based AELM) and ICGA based gene selection approach is proposed, which selects optimal feature/genes and the chosen appropriate genes are used for accurate classification of a sparse and imbalanced data set.The fundamental ELM classifier can differentiate the cancer classes between the data denoting the chosen features quickly, but the performance of ELM classifier is based on the nature of the input data distribution.For sparse and highly imbalanced data set classification, random the input weight in ELM classifier degrades the performance of classification to a huge amount (Suresh et al., 2010).ABC based AELM classifier is proposed in this study, where the ABC algorithm is utilized to identify the optimal input weights such that AELM classifier can differentiate the cancer classes significantly, i.e., the performance of the AELM classifier is improved.In this study, the original data are separated into training and predicting datasets.The training data results were obtained based on the input and output weight values are derived in the ABC, the cancer classification can be predicted directly through the well-known AELM.
The proposed methodology of the block diagram is shown in the Fig. 1 as follows.
The schematic diagram for the classification and optimal gene selection procedure is shown in Fig. 1.The performance of AELM classifier is mostly based on the selected input genes from ABC and ICGA methods.In order to decrease the computational aspect, an ICGA is used to select and minimize the number of genes, which can distinguish the cancer classes that is positive and negative efficiently.Based on those selected genes, AELM algorithm generates significant classifier.The proposed research work for classification and optimal gene selection procedure is shown in Fig. 1.Initially, ICGA chooses n independent genes/ features from the existing gene dataset.From this ABC will identify the optimal parameters (number of hidden nodes and input weights) such that the accuracy of the AELM multiclass classifier is improved.The best validation performance (η+) will be utilized as fitness for the ICGA evolution.The validation performance of AELM classifier (η) is used in ABC for selection of AELM parameters.

Adaptive Extreme Learning Machine model for forecasting (AELM):
In general, ELM algorithm may suffer from either under-fitting or over-fitting problems.For these two problems, over-fitting is further significant when the original data is enough and the network is sufficiently difficult.ELM model with overfitting will generally degrades the predictive performance.In order to overcome these problem Adaptive Extreme Learning Machine model (AELM) is proposed and this algorithm can decrease the chance of over fitting, improves the prediction performance of cancer classification.In this model, the perception data is used to alter the inputs of the ELM in the prediction processing making the inputs approach to the learning data.In this study, the output of the network is only one value, which is the predicted the optimal cancer classification result.Thus, we discuss our model only for one output.
Firstly, a strategy is used to initialize the input data , , … .The strategy adopts the adaptation distance space metric which is like similar to the adaptive k nearest neighbor method presented by (Liao et al., 2006) The above mentioned equation gives the assessment of differences among and , but the differences of trends and amplitudes are not presented.In time-series forecasting, the information on trends and amplitudes is the crucial factor.
The assessment of difference among and , doesn't exactly measures the closeness, introduce a adaptive metric to solve this problem and it is represented as: When the equivalent linear system is solved, the solution of the minimization problem can be obtained systematically: where, ∑ , ∑ , ∑ , ∑ .
From this the input vector of the first network can be defined as: Most input values can be close to the historical data using this method.The forecasting error increases dramatically due to the big difference between training data and input data.In order to get more accurate results for time series , , … , , sets of inputs are used and the output vector are , 1, 2, … , .The mechanism for admixture of outputs is presented as follows.
The majority of input values can be close to the historical data only by using this process.Due to this process forecasting error values are dramatically increasing due to the variation among training data and input data.In order to get more accurate results for prediction result , , … , , sets of inputs are used and the output vector are , 1, 2, … , .
The mechanism for a mixture of outputs is represented as follows: where, is the distance among ' nearest pattern and .From Eq. ( 7), the forecasting result is calculated from 1,2, … , with different weighing coefficients, the better coefficient is specified for closer input data of .Based on the methodology proposed above, the forecasting scheme can be formulated as shown in Fig. 2.
In the cancer prediction/forecasting schema, focus on one-step ahead pint forecasting Let , , , … , be a time series for classification.At time for 1, the next value 1 will be predicted based on the observed training results , , , … , .For ELM model, its result is generally different from time to time because the input weights and hidden biases are randomly selected.It is well recognized that the mean value the forecasting/predication is more reliable.So, a regression based integration schema is proposed in this study to obtain higher prediction accurateness.The final predicated classification result data is only the mean of S predicted classification time series: And the final predicted classification data is only the mean of the s classified data series:

∑
In order to compare the method with other methods on all the classified data series, adequate error measure method must be selected.The Mean Squared Error (NMSE) is used as the error criterion, which is the ratio of the mean squared error to the variance of the time series.It defined, for a time series , by.
In order to compare the method with other methods on all the prediction accuracy time series measure the error values in the classification.The Normalized Mean Squared Error (NMSE) is used as the error criterion, which is the ratio of the mean squared error to the variance of the time series.It defined, for a time series , by:  In general ABC algorithm consists three major bees are present employed bees, onlookers and scouts.A bee is waiting to take decision for choosing the food source is named as onlooker bee and the bee which is previously visited in the food source is named as employee bee.A bee which is used to find new food source in a random way it is called as scout bee.The position of the food source solves the optimization problem with finite solution and the nectar amount of the food source measured according to the fitness value is defined as: The cost function of is calculated according to study (Kavipriya and Gomathy, 2013).
An artificial onlooker bee chooses a food source based on the possibility value associated with that food source, computed by the following Eq.( 14): where, SN = The total number of food sources which is equal to the total number of employed bees in the food source = The fitness of the solution given in Eq. ( 13) which is inversely proportional to the In order to produce a new candidate food location from the old solution in the memory, the ABC uses the following Eq.( 15): where, k 1, 2, … . .SN and j 1, 2, … . .D are randomly selected indexes.Even though k is identified randomly, it has to be dissimilar from i.
denotes a random number between interval (-1, 1).It controls the invention of neighbor food sources about z and represents the assessment of two food positions visually by a bee.As can be seen from ( 15), as the difference between the parameters of the z and z decreases, the perturbation on the location z gets decreased, too.Therefore, as the search reaches the optimal result in the search space, the step length is minimized.
The food source of the nectar is empty then the bee is replaced with a new food source location found by the scouts.In ABC, this is simulated by generating a location at random and replacing it with the discarded one.In ABC, if a position/location may not be further improved via a fixed number of cycles, then that food source is judged as discarded.The value of fixed number of cycles is an important organizes parameter for ABC algorithm.It is to be implicit that the discarded/abounded source is and 1, 2, … , , then the scout finds a new food source to be replaced with .This operation can be defined as in ( 16): Following each one candidate source position is generated and then predictable by the artificial bee, its performance is measured with that of its old source position.If a new food source position is equal or improved than old source, then it is replaced with the old basis in the memory.If not, the old one is kept as same in memory.Alternatively, a greedy selection method is employed as the selection procedure among the old and the candidate individual.Finally the global optimal result is obtained.
ABC based AELM classifier: Thus, the greatest particle and the food source position of the equations are attained from the fitness value, the first term denotes the current velocity, the second term denotes the local search and the third term is the global search.
The fitness value of the particles is the evaluation of efficiency of the AELM classifier, whose , and RMSE = b is initialized using the particle: The ABC searches for the best H, V and b values that systematically computed weight in the ELM classifier which results in improved generalization performance.The cross validation performance of best H, V and b is .The main factor in ABC based AELM classifier is to establish the amount of imbalanced data set that the classifier can handle without losing performance significantly (Guo et al., 2011).

Analysis on imbalance data:
The sample imbalance handling capability of ABC based AELM classifier is based on the technique in (Suresh et al., 2008).The number of samples in one of the class was reduced and performance of the classifier was examined for different imbalance criteria.A similar examination was conducted for the proposed ABC based AELM classifier and the average , overall and individual classification efficiencies obtained are shown in Fig. 3.
It is observed that the average and overall classification efficiency of ABC based AELM classifier is almost constant up to 50% sample imbalance in class 2 data.By proper selection of the input weights and bias value, a better classification of performance can be attained.If careful examination is not taken then the classification performance of AELM classifier falls considerably with sample imbalance.
Integer-coded genetic algorithm: Genetic algorithms are widely used to solve composite optimization problems, in which the numeral of parameters and constraint are huge and systematic solutions are not easy to achieve (Michalewicz, 1994).In recent years, a numeral of techniques has been proposed for integrating genetic algorithms and neural networks.Genetic Algorithms are found to be effective in gene selection and classification.The study of selection function and genetic operators of GA are described in (Michalewicz, 1994).String representation: In this study, ICGA is used for selection of N best independent features from the given dataset.The characteristic string, which denotes N independent features, is given as: , , , , Where the selected features belong to the set S and they are independent.

Fitness:
The main aim of feature selection is to determine the features that demonstrates the input output characteristics of the data.The results of the ABC based AELM fivefold cross-validation test are used as fitness criteria, i.e., for the selected features, ABC will identify the best hidden neurons, input weights and biases values and return the validation efficiency obtained by the AELM algorithm along with the best AELM parameters.The features returning the best validation efficiency eventually are chosen as representative of the full data set: The best solution is obtained after a known number of generations are used to expand a classifier (AELM) using the complete training set.This classifier is then used to classify the testing samples.

EXPERIMENTAL RESULTS
In this section, the performance of the proposed approach is compared with other methods based on Global Circulation Models (GCM) data set, in two steps.Initially, with the GCM data set the classification results were compared with other classifiers and then the results for gene selection are compared with other existing results for gene selection.The samples in each class are small with high sample imbalance in GCM data set, that is, large number of classes with high dimensionality requires attention for selection of samples to training and testing.In these experiments, the original data set is dividing into training and testing data.

Global cancer map data:
The GCM data is the collection of six different medical institutions around 14 different types of malevolent tumors.It consists of 190 primary complete tumor samples and 8 samples are not used here called metastasis.Each sample contains the virtual expression of 16,063 genes (take for granted a one-to-one mapping from gene to probe set ID). From 190 samples, 144 samples are utilized for gene selection and classifier growth and the left behind 46 samples are used for assessment of the generalization performance.The amount of training samples per class varies from 8 to 24 which are sparse and imbalanced.Based on these notes, the GCM data set is sparse in environment with a high sample imbalance and a highdimensional feature space for huge number of genes.The main objective is to select sets of genes from the 16,063-dimensional space and identify the smallest number of genes desired to concurrently categorize every tumor types with greater accuracy.
In turn to calculate the classifier performance for sparse and imbalance data set, the results obtained by the proposed ABC based AELM classifier for a given number of genes is compared them with the existing classifiers.Here, 98 genes as selected in Ramaswamy et al. (2001) as the source for the classifier performance comparison.The ABC based AELM classifier is ruined to recognize the paramount number of hidden neurons, input weights and bias by means of 144 training data.With the use of best AELM parameters, an AELM classifier is developed by means of the complete training data and the resultant classifier is tested on the remaining 46 samples.In this study the experiment were conducted for a variety of random combinations samples of 144 training and 46 testing set and the results are account in Table 1.
Table 1: Comparative analysis on classification methods for GCM data set using 98 genes selected as explained in Ramaswamy et al. (2001)

ABC based AELM with ICGA based gene selection and classification results:
The proposed approach is called to select 14, 28, 42, 56, 70, 84 and 98 genes, respectively from the original 16,063 genes using a 10fold cross-validation method on the 144 training samples.The unexploited testing set (46 samples) is worn to assess the generalization performance.ABC based AELM with ICGA is identified best genes for each set.In this experiments, create that the best genes are chosen throughout different runs do not share any common genes.The overlap between the best genes sets (14-98) chosen by proposed approach is insignificant, but their ability to differentiate the cancer classes is more or less similar.These results show that there be real subsets of genes that can discriminate or differentiate the cancer classes efficiently.
The performance of the proposed classifier by creating 100 random trials on the training and testing data sets is done by the optimal gene sets are selected as above.It helps us to predict the classifier sensitivity to data difference.The average, maximum and standard deviations of training and testing performances are given in Table 2.

Performance comparison of proposed ABC based AELM with ICGA classifier with existing methods:
The proposed approach for the GCM data set results is compared with other existing methods.Table 3 shows the minimum number of genes needed by each method to attain utmost generalization performance.From the Table 4, the proposed ABC based AELM with ICGA selects a minimum 42 genes with a high average testing accuracy.GA/SVM, selects a minimum of 26 genes which gives results close to ABC based AELM with ICGA performance.It was seen that genes chosen in a variety of runs for any given subset do not have major overlaps also there is no any overlap of genes between any two subsets.Until now, the classifiers improved by means of these sets of selected genes make similar classification performance and were experiential to have the same discriminatory power to classify various  5.

CONCLUSION
In this study, an accurate gene selection and sparse data classification for microarray data is done by using ABC based AELM with ICGA gene selection for multiclass cancer classification is proposed.ICGA selected genes included with optimal input weights and bias values selected by ABC and used by the AELM classifier, to deal with higher sample imbalance and sparse data conditions resourcefully.Hence, ICGA gene selection approach is integrated with the ABC based AELM classifier to identify a dense set of genes that can discriminate cancer types efficiently resulting in enhanced classification results.

Fig. 1 :
Fig. 1: Schematic diagram of the two-stage ICGA-ABC-AND_ELM multiclass classifier (ABC), together with the Adaptive Extreme Learning Machine (AELM).ICGA and ABC are used for gene /feature selection and ALEM for cancer classification.It handles sparse data and sample imbalance dataset for classification of gene expression data.

Fig. 2 :
Fig. 2: The forecasting scheme for adaptive extreme learning machine for classification

Fig. 3 :
Fig. 3: Effects of the imbalances in data are depicted here, where the performance of the AELM classifier was analyzed for different imbalance conditions Descriptions of string representation and Fitness are given below.

Table 2 :
Performance of proposed classifier for the best set of features selected by ABC based AELM with ICGA gene selection approach

Table 4 :
Minimum number of genes required by various methods to achieve maximum generalization performance

Table 5 :
Results for gene selection and classification by ABC based AELM with ICGA for different data sets The ABC based AELM with ICGA gene selection and classifier was used to select the minimum number of genes necessary for accurate classification is shown in the Table 4.The average classification accuracies are given in Table