Outlier Removal Approach as a Continuous Process in Basic K -Means Clustering Algorithm

: Clustering technique is used to put similar data items in a same group. K-mean clustering is a commonly used approach in clustering technique which is based on initial centroids selected randomly. However, the existing method does not consider the data preprocessing which is an important task before executing the clustering among the different database. This study proposes a new approach of k-mean clustering algorithm. Experimental analysis shows that the proposed method performs well on infectious disease data set when compare with the conventional k-means clustering method.


INTRODUCTION
Data analysis techniques are necessary on studying actually increasing huge range of large sizing data.Regarding the same edge, cluster analysis (Hastie et al., 2001) tries to pass through data easily to achieve 1st structure experience by dividing data items straight into disjoint classes in a way that data items owned by identical cluster are the same whereas data items owned by another clusters tend to be different.Among the significant well known as well as effective clustering techniques is known as the K-means technique (Hartigan and Wang, 1979) utilizing prototypes (centroids) so as to signify clusters through perfecting the error sum squared operation.(The specifics report for K-means as well as relevant techniques has been provided in (Jain and Dubes, 1988).
The computational difficulty with traditional Kmeans algorithm is extremely large, specifically with regard to huge data units.Moreover the amount of distance computations rises greatly with the increase with the dimensionality of the data.When the dimensionality increases usually, just a few dimensions are highly relevant to specific clusters, however data on the unimportant dimensions may possibly generate extremely very much noise and also conceal the true clusters that will possibly be observed.Furthermore whenever dimensionality elevates, data normally turn out to be extremely short, data elements positioned on separate measurements may be regarded virtually all equally distanced as well as the distance amount, that, primarily for grouping exploration, turns into useless.
Therefore, feature reduction or just dimensionality lessening is the central data-preprocessing approach regarding cluster analysis for datasets which has a huge number of features.
However, huge dimensional data are sometimes enhanced into reduce dimensional data through Principal Component Analysis (PCA) (Jolliffe, 2002) (or singular value decomposition) whereby coherent patterns could be detected more easily.This type of unsupervised dimension reduction is commonly employed in tremendously broad areas which includes meteorology, image processing, genomic analysis and information retrieval.It is additionally well-known that PCA can be used to project data into a reduced dimensional subspace and then K-means will then be applied to the subspace (Zha et al., 2002).In other instances, data are embedded in a low-dimensional space just like the eigenspace from the graph Laplacian and K-means will then be employed (Ng et al., 2001).
A very important reason for PCA reliant dimension lowering is that often it holds the dimensions considering the main variances.This is the same with locating the optimal low rank approximation (in L2 norm) for the data employing the SVD (Eckart and Young, 1936).Also, the dimension lowering property on its own is actually inadequate in order to elucidate the potency of PCA.
On this study, we take a look at the link concerning both of these frequently used approaches and also a data standardization process.We show that principal component analysis and standardization approaches are basically the continuous solution for the cluster membership indicators on the K-means clustering technique, i.e., the PCA dimension reduction automatically executes data clustering in line with the K-means objective function.This gives an essential justified reason of PCA-based data reduction.
The result also provides best ways to address the K-means clustering problem.K-means technique employs K prototypes, the centroids of clusters, to characterize the data.These are determined by minimizing error sum of squares.

K-means clustering algorithm:
A conventional procedure for k-means clustering is straightforward.Getting started we can decide amount of groups K and that we presume a centroid or center of those groups.Immediately consider any kind of random items as initial centroids or a first K items within the series which can also function as an initial centroids.
After that the K-means technique will perform the 3 stages listed here before convergence.Iterate until constant (= zero item move group): • Decide the centroid coordinate • Decide the length of every item to the centroids • Cluster the item according to minimal length Principal component analysis: PCA can be looked at mathematically as the transformation of the linear orthogonal of the data to a different coordinate so that the largest variance of any of the data projections lie on the first coordinate (known as the first principal coordinate), the next largest on the second coordinate and so on.It transforms a numerous possibly correlated variables into a compact quantity of uncorrelated variables called principal components.PCA is a statistical technique for determining key variables in a high dimensional dataset which accounts for differences in the observations and is very important for analysis and visualization where information is very little lacking.
Principal component: Principal components can be determined by the Eigen value decomposition of a data sets correlation matrix/covariance matrix or SVD of the data matrix, normally after mean centering the data for every feature.Covariance matrix is preferred when the variances of features are extremely large on comparison to correlation.It will be best to choose the type of correlation once the features are of various types.Likewise SVD method is employed for statistical precisions.

LITERATURE REVIEW
Many efforts have been made by researchers to enhance the performance as well as efficiency of the traditional k-means algorithm.Principal Component Analysis by Valarmathie et al. (2009) and Yan et al. (2006) is known as an unsupervised Feature Reduction technique meant for projecting huge dimensional data into a new reduced dimensional representation of the data that explains as much of the variance within the data as possible with minimum error reconstruction.Chris and Xiaofeng (2006) Proved that principal components remain the continuous approaches to the discrete cluster membership indicators for K-means clustering and also, proved that the subspace spanned through the cluster centroids are given by spectral expansion of the data covariance matrix truncated at K-1 terms.The effect signifies that unsupervised dimension reduction is directly related to unsupervised learning.In dimension reduction, the effect gives new insights to the observed usefulness of PCA-based data reductions, beyond the traditional noise-reduction justification.Mapping data points right into a higher dimensional space by means of kernels, indicates that solution for Kernel K-means provided by Kernel PCA.In learning, final results suggest effective techniques for K-means clustering.In (Ding and He, 2004), PCA is used to reduce the dimensionality of the data set and then the k-means algorithm is used in the PCA subspaces.Executing PCA is the same as carrying out Singular Value Decomposition (SVD) on the covariance matrix of the data.Karthikeyani and Thangavel (2009) Employs the SVD technique to determine arbitrarily oriented subspaces with very good clustering.Karthikeyani and Thangavel (2009) extended Kmeans clustering algorithm by applying global normalization before performing the clustering on distributed datasets, without necessarily downloading all the data into a single site.The performance of proposed normalization based distributed K-means clustering algorithm was compared against distributed K-means clustering algorithm and normalization based centralized K-means clustering algorithm.The quality of clustering was also compared by three normalization procedures, the min-max, z-score and decimal scaling for the proposed distributed clustering algorithm.The comparative analysis shows that the distributed clustering results depend on the type of normalization procedure.Alshalabi et al. (2006) designed an experiment to test the effect of different normalization methods on accuracy and simplicity.The experiment results suggested choosing the z-score normalization as the method that will give much better accuracy.

Removal of the weaker principal components:
The transformation on the data set to the new principal component axis provides the number of PCs same as the number in the initial features.Although for various data sets, the first few PCs mention most of the variances and so the others can easily be eliminated with minimum loss of information.

MATERIALS AND METHODS
Let Y = {X 1 , X 2 , …, X n } imply the d-dimensional raw data set.Then the data matrix is an n×d matrix given by: The z-score is a form of standardization used for transforming normal variants to standard score form.Given a set of raw data Y, the z-score standardization formula is defined as: where, ‫̅ݔ‬ j and σ j are the sample mean and standard deviation of the j th attribute, respectively.The transformed variable will have a mean of 0 and a variance of 1.The location and scale information of the original variable has been lost (Jain and Dubes, 1988).
One important restriction of the z-score standardization Z is that it must be applied in global standardization and not in within-cluster standardization (Milligan and Cooper, 1988).

Principal component analysis:
)′ be a vector of d random variables, where ′ is the transpose operation.The first step is to find a linear function ܽ ଵ ᇱ ‫ݒ‬ of the elements of v that maximizes the variance, where α 1 is a d-dimensional vector (ܽ ଵଵ ܽ ଵଶ , … , ܽ ଵௗ ) ′ so: 1 1 1 ' .
ᇱ ‫ݒ‬ and has maximum variance.Then we will find such linear functions after d steps.The j th derived variable ߙ́jv is the j th PC.In general, most of the variation in v will be accounted for by the first few PCs.
To find the form of the PCs, we need to know the covariance matrix Σ of v.In most realistic cases, the covariance matrix Σ is unknown and it will be replaced by a sample covariance matrix.That is for j = 1, 2, ..., d, it can be shown that the j th PC is: z = ܽ ଵ ᇱ ‫,ݒ‬ where a j is an eigenvector of Σ correspond with the j th main eigenvalue λ j .
In fact, in the first step, z = ܽ ଵ ᇱ ‫ݒ‬ can be found by solving the following optimization problem: Maximize var (ߙ́1v) subject to ߙ́1a = 1, where, var (ߙ́1v) is computed as: var (ߙ́j v) = ߙ́j Σ a 1 To solve the above optimization problem, the technique of Lagrange multipliers can be used.Let λ be a Lagrange multiplier.We want to maximize: ( ) Differentiating Eq. ( 4) with respect to a 1 , we have: where, I d is the d×d identity matrix.Thus λ is an eigenvalue of Σ and a 1 is the corresponding eigenvector.Since, a 1 is the eigenvector corresponding with the main eigenvalue of Σ.In fact, it can be shown that the j th PC is ܽ ଵ ᇱ ‫,ݒ‬ where a j is an eigenvector of Σ corresponding to its j th largest eigenvalue λ j (Jolliffe, 2002).

Singular value decomposition:
x n } be a numerical data set in a d-dimensional space.Then D can be represented by an n×d matrix X as: where, x ij is the j-component value of x i .Let ߤ̅ = (ߤ̅ ଵ , ߤ̅ ଶ , … , ߤ̅ ௗ ) be the column mean of X: And let e n be a column vector of length n with all elements equal to one.Then SVD expresses X -e n ߤ̅ as: where, U is an n×n column orthonormal matrix, i.e., U T U = I is an identity matrix, S is an n×d diagonal matrix containing the singular values and V is a d×d unitary matrix, i.e., V H V = I, where V H is the conjugate transpose of V.The columns of the matrix V are the eigenvectors of the covariance matrix C of X; precisely: Since C is a d×d positive semi definite matrix, it has d nonnegative eigenvalues and d orthonormal eigenvectors.Without loss of generality, let the eigenvalues of C be ordered in decreasing order: λ 1 ≥λ 2 ≥ … ≥λ d .Let σ j (j = 1, 2, …, d) be the standard deviation of the j th column of X, i.e.: The trace Σ of C is invariant under rotation, i.e.: Noting that e T n X = nߤ̅ and e T n e n = n from Eq. ( 5) and ( 6), we have: Since V is an orthonormal matrix, from Eq. ( 7), the singular values are related to the eigenvalues by: 2 , 1, 2,..., The eigenvectors constitute the PCs of X and uncorrelated features will be obtained by the transformation Y = (X -e n ߤ̅ ) V. PCA selects the features with the highest eigenvalues.

K-means clustering:
Provided some series involving observations (x 1 , x 2 , …, x n ), in which each observation is known as a d-dimensional real vector, k-means clustering is designed to partition an n observations to k units (k = n) S = S 1 , S 2 , …, S k as a way to reduce the Within-Cluster Sum of Squares (WCSS): at which µ i stands out as the mean for items within S i .

RESULTS AND DISCUSSION
The presence of noise in a large amount of data is easily filtered out by the normalization and PCA/SVD preprocessing stages, especially since such a treatment was specifically designed to denoise large numerical values while preserving edges.
In this section, we examine as well as evaluate the tasks for the approaches below: conventional k-means with the original dataset, k-means with normalized dataset, k-means with PCA/SVD dataset and k-means with normalized and PCA/SVD dataset seeing as methods of response to the goal intent behind the kmeans technique.The level of a particular clustering are as well be evaluated, whereby level is analyzed with the error sum of squares for the intra-cluster range, that is a range among data vectors in a group as well as the centroid for the group, the lesser the sum of the Fig. 1: Basic K-means algorithm differences is, the better the accuracy of clustering and the error sum of squares.
Figure 1 presents the result of the basic K-means algorithm using the original dataset having 20 data objects and 7 attributes as shown in Table 1.Two points attached to cluster 1 and four points attached to cluster 2 are out of the cluster formation with the error sum of squares equal 211.21.
The number of PCs found is in fact same with the actual number of initial features.To remove the weakened components out of the PC set we worked out the corresponding variance, its percentage and cumulative percentage, shown in Table 2 and 6.There after we considered the PCs with variances lower than the mean variance, disregarding others.The lessened PCs are shown in Table 3 and 7.
Table 2 presents the variances, the percentage of the variances and cumulative percentage which corresponds to the principal components.
Figure 2 explained the pareto plot of for the variances percentages against the principal component for the original dataset having 20 data objects and 7 variables.
The improve matrix using lessened PCs has been made this also transformed matrix is simply employed on the initial dataset to generate a different lessened estimated dataset, that will be utilized for the remaining data exploration and also reduced dataset containing 4 attributes is also shown in Table 4.
Figure 3 presents the result of the K-means algorithm applying principal component analysis to the original dataset.The reduced datasets containing 20 data objects and 4 attributes as shown in Table 4 and all the points attached to both cluster 1 and 2 are within the cluster formation with the error sum of squares equal 143.14.
Figure 4 presents the result of the K-means algorithm using the rescale dataset with z-score standardization method, having 20 data objects and 7 attributes as shown in Table 5 attached to both cluster 1 and 2 are within the cluster formation with the error sum of squares equal 65.57.Table 6 presents the variances, the percentage of the variances and cumulative percentage which corresponds to the principal components.
The improve matrix using lessened PCs (Table 7) manufactured this also transformed matrix simply employed on a standardized dataset so as to generate different lessened estimated dataset, that will be utilized for the remaining data exploration and the lessened dataset containing 4 attributes shown in Table 8.
Figure 5 presents the result of the K-means algorithm applying standardization and principal component analysis to the original dataset.The reduced datasets containing 20 data objects and 4 attributes as shown in Table 8 and all the points attached to both cluster 1 and 2 are within the cluster formation with the error sum of squares equal 51.26.

CONCLUSION
We have proposed a novel hybrid numerical algorithm that draws on the speed and simplicity of k-

Table 2 :
. All the points The variances cumulative percentages

Table 5 :
The