Self-Organizing Maps and Principal Component Analysis to Improve Classification Accuracy

The aim of this study is to perform the Kohonen Self-Organizing Map (SOM) using Principal Component Analysis (PCA). SOM is an algorithm commonly used to visualize and classify datasets, due to its ability to project large data into a smaller dimension. However, their performance decreases when the size of the problem becomes too big. Therefore, reducing the size of the data by removing irrelevant or redundant variables and selecting only the most significant ones according to certain criteria has become a requirement before any classification, this reduction should give the best performance according to a certain objective function. Many researchers have tried to solve this problem. This study presents a new approach to improve SOM based on PCA. The experimental analysis of real data from the UCI machine learning repository shows an improvement of the proposed SOM compared to a traditional approach. More than 2% of the improvement in the accuracy of the classification is observed.


INTRODUCTION
In recent years, the data is exponentially expanded, so their characteristics, consequently, reducing the size of the data by removing irrelevant or redundant variables and selecting only the most significant according to some criterion has become a requirement before any classification, this reducing should give the best performance according to some objective function (Devaraj et al., 2002;Dudoit et al., 2002;Narayanan et al., 2004). In general, the performance of a classifier decreases when the dimensionality of the problem becomes too large.
Several approaches are used in classification, to name a few, Hopfield network, K-means, Support Vector machine; most of them are inspired by biological neural networks. Among these, Kohonen Self-Organizing Maps (SOM) are popularly and widely used for the classification. SOM is one type of the neural networks commonly used for visualizing and classifying of multidimensional data. It is applied in various areas: medicine, financial, ecological, engineering, law enforcement and other fields (Ettaouilet al., 2013(Ettaouilet al., , 2012Kohonen, 1998;Pavel and Olga, 2011). However, certain topological constraints of the SOM are fixed before the training phase; the dimension of neurons has a great effect on the classification performance that we had to discuss in this study. The interesting question is which features should be used. Given a set of features; how do we select an optimal subset of features such that? Consequently, the execution time for classification the data decreases and the accuracy increases (Arauzo-Azofra et al.,2011).
One approach to solve this problem is to use feature selection that consists of choosing a subset of input variables and deleting redundant or irrelevant entities from the original dataset. It is divided into three categories; filters, wrappers and embedded or hybrid selectors (Blum and Langley, 1997;Ding and Peng, 2005). The filters extract features from the data without any learning involved by ranking all features and chosen top ones (Guyon and Elisseeff, 2003;Ruiz et al., 2012). There were several and widely used filters in literature, such as Information Gain (IG) (Wang et al., 2006), Minimum Redundancy Maximum Relevance (mRMR) (Ding and Peng, 2005), Relief F (Kira and Rendell, 1992). The wrappers use classifying algorithm to evaluate which features are useful; it means that the features were selected taking the classification algorithm into account (Gheyas and Smith, 2010;Kohavi and John, 1997). The third field of feature selection approaches is embedded methods. It takes advantage of the two models by using their different evaluation criteria in different search stages (Guyon and Elisseeff, 2003;Maldonado et al., 2011;Mundra and Rajapakse, 2010).

d l
The second approach used which called feature extraction that replaces the set of n features by a set of m features; each one is a combination of the original feature. A well-known dimensionality reduction technique is Principal Component Analysis (Abdi and Williams, 2010). PCA tries to find a linear subspace of lower dimensionality, such that the largest variance of the original data is kept. However, note that the largest variance of the data does not necessarily represent the most discriminative information (Jolliffe, 1972).
This research opts for the classification of realworld data from the UCI Machine Learning Repository using SOM and PCA. Accuracy rate is used to evaluate this algorithm. The aim of our study is to reduce the number of features and demonstrate the importance of feature selection to improve classification. The experimental analysis shows the speed up of the proposed SOM training process in comparison to a classical approach.

PROPOSED MODEL
The SOM-PCA proposed is divided into two main steps. In the first, the network was trained by the classical SOM. The neurons resulted from the training phase, were used as input for PCA; to transform them to a new set of vectors with the low dimension. So, the dataset will be reduced to a smaller number of dimensions with low information loss. Figure 1 shows a flowchart of this model.

Self-organizing maps:
The SOM often consists of a regular grid of map units. Each unit is represented by a vector , where d is input vector dimension. The units are connected to adjacent ones by neighbourhood relation. The SOM is trained iteratively. At each training step, a sample vector is randomly chosen from the input data set, a metric distance is computed for all weight vectors to find the reference vector that satisfies a minimum distance or maximum similarity criterion following the Eq. (1). The neuron with the most similar weight vector to the input pattern is called the Best Matching Unit (BMU): (1) where, is the neurons number in the map in instant . The weights of the BMU and its neighbours are then adjusted towards the input pattern, following Eq. (2): (2) One of the main parameters influencing the training process is the neighbourhood function between Feature selection using PCA: Principal Component Analysis (PCA) was a powerful statistical tool for reducing the dimensionality of multivariate data sets in many areas such as image analysis, data compression, time series prediction and analysis of biological data by finding a new set of variables (Abdi and Williams, 2010). The new set of variables, called Principal Components (PCs), is characterized by his dimension that is smaller than the original counterpart and is ordered by the fraction of the total information each retains. These PCs have been chosen so that the first principal component must have the greatest possible variance; the second component is computed under the constraint of being orthogonal to the first component and having the greatest possible inertia and so on. In our study, we consider the use of PCA in extracting relevant features from the neurons vectorswj; were j is thej th weight vector from the n neurons resulted after the SOM training process; and that have a features (dimension). Therefore, we have an array matrix with the size of : These vectors are now subjected to principal components analysis. To transform them into a new set of the vector with derived dimensions ( ), but in this case, their information content is ranked and stored in the first dimensions. So, the dataset will be reduced to a smaller number of dimensions with low information loss. The transformation is based on the matrix computation: Under the constraints that is a diagonal matrix and that is an identity matrix. Matrix has the same dimension as and related by a linear transformation .
will have the properties that most of their information content is stored in the first dimensions and should be chosen so that R represents the largest variance for the input data.
There are several ways of obtaining the solution of this problem. In this study, we try to construct using covariance method. Before calculating the covariance matrix we need to centering data in matrix as follow: where, is an column vector of ones for ; and is a vector of dimensions that contains the empirical mean along each column j = 1,..., p of W and defined as: The covariance matrix is now, defined by outer product of Wc with itself: The eigenvalues of for the given data should be calculated. Those m eigenvectors corresponding to the largest eigenvalues of define a linear transformation from the n-dimensional space to an mdimensional space in which the features are uncorrelated. An eigenvalue and eigenvector of a matrix are a scalar and a nonzero vector so that: Let provided that be the set of eigenvalues of and with their corresponding eigenvectors, called the principal axes. Then: The problem in using PCA as the dimensional reduction is to define the number of principal components needed to get a good representation of the data. Different methods exist for predicting this value (Abdi and Williams, 2010;Jolliffe, 1972;King and Jackson, 1999) including Kaiser's stopping rule (Kaiser, 1960) that retains and interprets any component where its eigenvalue greater than 1.00. Scree test (Cattell, 1966) which trace the eigenvalues in descending order of their magnitude in relation to their number of factors and determines where they stabilize (D'agostino and Russell, 2005). Percentage of variance explained (Jolliffe, 1972;Shaharudin and Ahmad, 2017); this technique retains components that account for at least of the total variance. Cumulative Percentage of Variance extracted retains components where certain percentages of the cumulative have been suggested; In this study, the Cumulative Percentage of Variance explained method was used following the equation: The choice of the subset of characteristics represents a good estimate of the n-dimension space if the ratio is sufficiently large or greater than a threshold, usually at least 70%. This method is inexpensive in calculation when it is applied directly to the total data; however, if PCA is applied on the neurons, it reduces enormously the computations (Fig. 2).

DATASETS DESCRIPTION
The performance of the proposed SOM-PCA method has experimented on the variety of real classification problems. The specification of these problems is listed in Table 1. All datasets are available from the UCI Machine Learning Repository. Table 1 summarizes the number of features, instances and classes for each dataset used in this study.

Wisconsin breast cancer:
The dataset was collected by Dr. William H. Wolberg (1989Wolberg ( -1991 at the University of Wisconsin-Madison Hospitals. It contains 699 instances whose 458 (65.5%) instances of them areBenign and 241 (34.5%) instances are Malignant, characterized by nine features, which are used to predict benign or malignant disease. This data contains 16 instances with single missing value.

Heart-Statlog:
The dataset is based on data from the Cleveland Clinic Foundation and it contains 270 instances belonging to two classes: the presence or absence of heart disease. It is described by 13 features.

Cardiotocography Data Set:
The dataset consists of measurements of Fatal Heart Rate (FHR) and Uterine Contraction (UC) features on cardiotocograms classified by expert obstetricians. 2126 fatal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features measured. The CTGs were also classified by three expert obstetricians and a consensus classification label assigned to each of them. Classification was both with respect to a morphologic pattern (A, B, C. ...) and to a fatal state (N, S, P). Therefore the dataset can be used either for 10class or 3-class experiments available in UCI Machine Learning Repository.

RESULTS AND DISCUSSION
In order to show the efficiency of the proposed method, SOM-PCA has experimented on the variety of   real benchmark classification problems downloaded from the UCI Machine Learning Repository (a short description of each data set is shown in Table 1) and it is evaluated in terms of accuracy and it is compared to classical SOM. In our topology, the hidden layer consists of 25 neurons (rectangular topology 5×5). The output layer was determined by one neuron that can be 0 or 1. The general architecture of the proposed network is shown in Fig. 1. A summary of the parameters used is described in Table 2. Firstly, All datasets were prepared for the classification, the missing values were replaced by median value (Acuña and Rodriguez, 2004), the data were normalized using min-max normalization (Sola and Sevilla, 1997;Jain and Bhandare, 2011), the datasets were divided into two, 70% is employed for training process and 30% for testing process and all the weights have initialized to random numbers. Then the training process will be done. When the training process is complete for the training data, the last weights of the network have been saved to be ready for the feature extraction procedure using the PCA algorithm and then apply the test dataset.
To evaluate SOM-PCA, we used the classification accuracy as follow: where, TP (True Positives) = The correctly classified as positive cases TN (True Negative) = Correctly classified as negative cases FP (False Positives) = Incorrectly classified as negative cases  Table 3 the best results obtained for the accuracy of classifier using for feature reduction. These results are gotten from Fig. 3 to 5 on a percentage basis. In these figures, the horizontal axis represents the number of PCs and the vertical axis represents accuracy of classification (the gray curve) and Cumulative Percentage of Variance explained (black curve) on percentage basis. These figures demonstrate that by using proposed method, the accuracy is almost unchanged and even increased; it is clear that there is a slight improvement in the classification rate; the maximum value is obtained when the cumulative is between 75% and 95% and after it begins to decrease. In other words, when the number of contributed variables increases the classification rate decreases, therefore, we can only keep variables whose cumulative is less than 95% and the remained features have no effect on the classification rate.In the rest of this part, the results in detail for each dataset.
Breast cancer dataset: Figure 3 shows the cumulative sum of explained variance over different feature selection for the breast cancer dataset (black curve) and the accuracy obtained (grey curve). The grey curve shows that most of the variance (79% of the variance) can be explained by the two first principal components. The third, fourth and fifth principal component still bears some information (16%) while the remaining principal components can carefully be dropped without losing too much information. Together, the first five principal components contain 95% of the information. Now, take a look at the grey curve; we can see that the value of accuracy is around 97% when using 5 features. On classifying the dataset employing original features, it is noted that the classification accuracy of 95.85% is obtained. On applying the proposed method, the accuracy is increased to 97.14%. The highest accuracy is reported for this dataset when the proposed SOM-PCA approach is employed with 5 components.
Heart-Statlog: Figure 4 the cumulative of variance explained over different feature selection for the Heart-Statlog shows that most of the variance can be explained by the eight first principal components. The first eight principal components contain 91% of the information. In opposite, the best accuracy 81.48% is obtained with first three components. Compared to 79% of accuracy obtained by classifying the dataset employing original features, With SOM-PCA, the accuracy is increased slightly to 81.48%. The highest accuracy is reported for this dataset when the proposed SOM-PCA approach is employed with three components.
Cardiotocography dataset: From the Fig. 5 the five first components accounts for 77% of the variance. The remaining components contribute with gradually decreasing variance and we assume this smaller variation is mostly unimportant. The value of accuracy is around 89% when using 5 features and it kept its value almost fixed along the rest components. The accuracy obtained using all original features are 79.93%. So, applying the proposed method, the accuracy is increased significantly to 97.14%. The

75% 95%
r £ £ highest accuracy is reported for this dataset when the proposed SOM-PCA approach is employed with 5 components.

CONCLUSION
This study presents a result of direct classification of variety of datasets using self-organizing maps algorithm. A novel approach based on the Self Organizing Maps and principal component analysis to address the problem of classification. The main innovation is to reduce the dimension of the neurons detected after the SOM training; the new dataset will represented the map with high accuracy. From the numerical results, the improved method gives better accuracy and low time for training, by reducing the dimension of the map and so decreasing the memory size to store the map. The presented method considers the datasets with low dimension and can be extended to treat the data with high dimension. Up to 2% of improvement is obtained using SOM-PCA compared to classical SOM; it can be concluded that this method can be a solution to some problems where very few numbers of training samples exist and feature reduction is needed to apply unsupervised classifiers.