Distance Based Hybrid Approach for Cluster Analysis Using Variants of K-means and Evolutionary Algorithm

: Clustering is a process of grouping same objects into a specified number of clusters. K-means and K-medoids algorithms are the most popular partitional clustering techniques for large data sets. However, they are sensitive to random selection of initial centroids and are fall into local optimal solution. K-means++ algorithm has good convergence rate than other algorithms. Distance metric is used to find the dissimilarity between objects. Euclidean distance metric is commonly used by number of researchers in most algorithms. In recent years, Evolutionary algorithms are the global optimization techniques for solving clustering problems. In this study, we present hybrid K-means++ with PSO technique (K++_PSO) clustering algorithm based on different distance metrics like City Block and Chebyshev. The algorithms are tested on four popular benchmark data sets from UCI machine learning repository and an artificial data set. The clustering results are evaluated through the fitness function values. We have made a comparative study of proposed algorithm with other algorithms. It has been found that K++_PSO algorithm using Chebyshev distance metric produces good clustering results as compared to other approaches.


INTRODUCTION
With the fast development of information technology, huge amount of data collected from various fields has been stored electronically.The most challenging task of business analyst is to transform large volume of data stored in data warehouses into meaningful information called knowledge.Knowledge Discovery in Databases (KDD) is used to achieve this task.A part of KDD process is data mining.Data mining involves the use of data analysis techniques to discover previously unknown, valid patterns and relationship in large data sets.Clustering is one of the important data mining activities (Han and Kamber, 2001).
Cluster analysis is the process of grouping a set of data points in such a way that data points in the same group are more similar and data points from different groups are dissimilar.Clustering is called the unsupervised learning because there is no prior knowledge of patterns.The aim of clustering is to identify both dense and sparse regions in a data set.Clustering is used in many areas including pattern recognition, pattern analysis, artificial intelligence, image segmentation, image processing, bioinformatics, information retrieval and data mining and knowledge discovery.Therefore, it is an important research topic of diverse areas.
Data clustering can be broadly categorized into hierarchical methods, partitional methods, fuzzy clustering methods, hard clustering methods and modelbased methods (Han and Kamber, 2001;Kaufman and Rousseeuw, 1990).Hierarchical methods create a hierarchical decomposition of the data points.They can be either top-down or bottom-up.Top-down algorithms start with one data point in a single cluster and then split into small groups until each data point is in one cluster.Bottom-up algorithms begin with each data point forming a separate cluster.They successively merge the data points that are close to one another, until all clusters are merged into one.Partitional methods partition the data set into predefined number of clusters.Given a data set of 'N' data points, they attempt to find 'k' groups, which satisfy the following requirements: each data point must belong to exactly one group and each group must contain at least one data point.In fuzzy clustering methods, each data point can belong to more than one cluster.The membership values are associated with each of the data points.The values lie between 0 and 1.In hard clustering methods, each data point can belong to only one cluster.Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model.They can be either hierarchical or partitional depending on the structure.
A broad review of the important clustering algorithms can be found in the literature (Jain and Dubes, 1998;Berkhin, 2002;Xu and Wunsch II, 2005).K-means algorithm was proposed by MacQueen (1967).It is a center-based clustering method.Kmedoids algorithm (Han and Kamber, 2001;Kaufman and Rousseeuw, 1990) uses the most representative data points called medoids instead of centroids.K-means and K-medoids algorithms are the most popular and widely used partitional data clustering methods.However, they are easily struck at local optimal solution and are sensitive to random selection of initial centers.The number of clusters also must be known in advance.K-means++ (Arthur and Vassilvitskii, 2007) is one of the variants of K-means algorithm which uses a new technique of selecting initial centroids by random initial centers with specific probabilities.The new seeding method has better performance and convergence rate than other algorithms.In recent years, evolutionary algorithms (Yu and Gen, 2010) like Particle Swarm Optimization (PSO) and Ant Colony Optimization (ACO) have been used to solve wide range of optimization problems including data mining tasks.They avoid the drawbacks of variants of K-means algorithms.The PSO algorithm was first proposed by Kennedy and Eberhart (1995).It has been successfully applied to solve clustering problems by the research community.It is a population-based global optimization technique (Chen and Fun, 2004).
Recently, hybrid techniques are more popular for solving variety of real-world optimization problems.Euclidean distance metric is traditionally applied for several clustering algorithms in the literature.In this study, we have made an attempt to study the performance of algorithms using other important distance metrics such as City Block and Chebyshev.Cluster analysis based on K-means++ and PSO algorithm (K++_PSO) is proposed in this research using different distance metrics.Through fitness function values, it is shown that K++_PSO algorithm reports good clustering result on four benchmark data sets such as teaching assistant evaluation, thyroid, seeds, breast cancer and an artificial data set for Chebyshev distance metric.Omran et al. (2002) proposed a new image classification algorithm based on particle swarm optimization.Van der Merwe and Engelbrecht (2003) proposed two new methods for clustering data.Esmin et al. (2008) proposed new data clustering approaches using particle swarm optimization.Tsai and Kao (2010) developed a novel data clustering algorithm based on Particle Swarm Optimization with Selective Regeneration (SRPSO) which includes features, unbalanced parameter setting and particle regeneration operation.Mohamed Jafar and Sivakumar (2013) presented a study of particle swarm optimization algorithm to data clustering using different distance metrics.Bandyopadhyay and Maulik (2002) presented an evolutionary technique based on K-means algorithm called KGA-clustering.This algorithm utilizes the searching capability of K-means and avoids the drawback of getting stuck at local optimization.Ye and Chen (2005) developed the hybrid PSO and K-means algorithm, called Alternative KPSO-clustering (AKPSO).They presented an evolutionary particle swarm optimization learning-based method to optimally cluster N data points into K clusters.Dong and Qi (2009) proposed a new hybrid clustering algorithm based on particle swarm optimization and K-means.The algorithm generates better solution than PSO and K-means algorithms.Yang et al. (2009) proposed a hybrid data clustering algorithm based on PSO and K-Harmonic Means (KHM).The performance of the proposed algorithm was compared with PSO and KHM clustering algorithms with different data sets.Kao and Lee (2009) presented a new dynamic data clustering algorithm based on K-means and particle swarm optimization, called KCPSO.Rana et al. (2010) presented a hybrid sequential approach for data clustering using K-means and particle swarm optimization.The proposed algorithm avoids the limitations of both algorithms.Niknam and Amiri (2010) proposed an efficient hybrid approach based on PSO, ACO and K-means algorithms, called PSO-ACO-K approach for cluster analysis.Danesh et al. (2011) proposed a data clustering algorithm based on an efficient hybrid of K-Harmonic Means, PSO and GA.The hybrid algorithm helps to solve the local optima problem and overcomes the limitation of slow convergence speed.Chuang et al. (2012) proposed an improved particle swarm optimization based on Gauss chaotic map for clustering.They used the intra-cluster distance as a measure to search data cluster centroids.Li et al. (2013)

Mathematical model of clustering problem:
The mathematical model of clustering problem (Liu et al., 2006) is described as follows: For a given data set of 'n' points, we have to allocate each data point to one of the 'k' clusters such that the sum of the Squared Euclidean Distance between data point and center of its belonging cluster should be minimum: where, where, n = The number of data points k = The number of clusters w ij = nxk 0-1 matrix x i = The location of the i-the data point c j = The center of the j-th cluster N j = The number of data points belonging to the cluster c j  (1995).It is based on the social behavior of a school of fish, a bacteria modeling, a flock of birds or a swarm of bees (Poli et al., 2007).In PSO system, the individuals are referred as particles.A population or swarm is a collection of particles.It is denoted by 1 2 ( , ,..., ) n P p p p = . Each particle flies through the search space, dynamically altering its position and velocity in the search space according to its own experience and that of neighboring particles.Therefore, particles tend to fly toward better and better searching areas.A predefined fitness function is used to measure the performance of a particle.Each particle maintains a memory of its previous best position, called pbest or The position of the particle is updated using the Eq. ( 5): The description of various parameters is shown in Table 1.Each particle X in the PSO system is constructed as follows: 1 2 ( , ,... ,..., ) where, m ij = The j-th cluster center vector of the i-th particle in cluster C ij N c = The total number of clusters A swarm is a set of particles.Therefore, a swarm represents a number of candidate clustering solutions for a data set.
The fitness function value of the cluster analysis is calculated by the Eq. ( 7): ( ) The fitness function value should be minimized.
Distance metrics: Distance metrics are used to determine the similarity or dissimilarity between two objects.They play a vital role in clustering data objects.
The distance between two objects x i and x j is denoted by d (x i , x j ).The important properties of distance metrics are (Gan et al., 2007):

y and z
The various distance metrics and their formula are shown in Table 2.

METHODOLOGY
In this section, K-means clustering algorithm, Kmedoids clustering algorithm and Hybrid algorithm are described.

K-means clustering algorithm:
The aim of clustering is to classify the given data set 1 2 N X {x , x ...x } = into set of clusters satisfying the following conditions (Niknam and Amiri, 2010): Given a set of 'N' data points and the number of clusters 'c', the objective is to select 'c' cluster centers so as to minimize the mean squared distance.It generates the fast solution.The K-means clustering algorithm is described as follows.Step 1: Compute the selected distance of each object in the data set from each of cluster centroids.

Input: Data set
Step 2: Select the points for a cluster with the minimal distances, they belong to that cluster.
Step 3: Calculate the cluster centers: where, N i is the number of data points in the i-th cluster until

medoids clustering algorithm:
In this algorithm, the centers are located among the data points themselves.A medoid is defined as the data point of a cluster, whose mean dissimilarity to all the data points in the cluster is minimum.
select the number of cluster centers 1 c N < < ; Initialize the random cluster centers selected from the data set.

. } z z z =
Step 1: Choose c objects at random to be the initial cluster centroids.
Step 2: Assign each object to the cluster associated with the closest cluster centers.
Finding the object i within the cluster that minimizes: ( , ) where, C i is the cluster containing the object i and d (i, j) is the distance between object i and j.
Hybrid algorithm: Hybrid algorithms are the integration of two or more optimization techniques.Nowadays, hybrid algorithms are popular due to capability in handing various real-world applications that involve uncertainty and complexity.They make use of qualities of individual algorithms.In this study, we have combined the K-means++ and particle swarm optimization algorithm, called (K++_PSO) for cluster analysis.Euclidean distance is the commonly used metric in most of the clustering algorithms.We have also made an attempt to study the performance of various algorithms with different distance metrics such as City Block and Chebyshev.

Description of K++_PSO algorithm:
Input: Data set

( , ) min || ||
Step 2a) : For iter = 1 to max_it do Step 2b) : Compute the selected distance of each object in the data set from each of cluster centroids of Step 1 Step 2c) : Select the points for a cluster with the minimal distances, they belong to that cluster Step 2d) : Calculate the new cluster centers using: where N i represents the number of data points in the i-th cluster.
Step 2e): Interchange the new cluster centers to old cluster centers Step 3: The final cluster centers of step 2 to be taken as the initial cluster centers for particle 1 and N c randomly selected cluster centroids for remaining particles Step 4: For t = 1 to max_it do Step 5: For each particle i do Step 6: For each data vector z p : • Calculate the fitness value (intra-cluster distance) using the Eq. ( 7) Step 7: Update the global best and local best positiosns Step 8: Update the cluster centroids using the Eq. ( 4) and ( 5)

EXPERIMENTAL RESULTS AND DISCUSSION
We compare the performance of the proposed hybrid algorithm with other clustering algorithms on four benchmark UCI machine learning repository data sets (http://archive.ics.uci.edu/ml/) which include data sets of teaching assistant evaluation, thyroid, seeds, breast cancer and an artificial data set.
The teaching assistant evaluation data set consists of 151 objects and 3 different types of classes characterized by 5 features.The data consist of evaluations of teaching performance over three regular semesters and two summer semesters of 151 teaching assistant assignment at the Statistics Department of the University of Wisconsin-Madison.The scores were divided into 3 roughly equal-sized categories ("low", "medium" and "high") to form the class variable.
The thyroid dataset consists of 215 instances.Each instance has 5 features including T3-resin uptake test, total serum thyroxin, total serum triiodothyronine, basal Thyroid-Stimulating Hormone (TSH) and maximal absolute difference of TSH value after injection of 200 micro grams of thyrotropin-releasing hormone as compared to the basal value.Each of the samples has to be categorized into one of the three classes: Class 1: normal (150 instances), Class 2: hyper (35 instances), Class 3: hypo functioning (30 instances).
The seeds data set contains 210 patterns belonging to 3 different varieties of wheat: Kama, Rosa and Canadian.Each pattern has 7 geometric parameters of wheat kernels such as area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove.
The breast cancer data set consists of 683 records characterized by 9 features such as clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses.The two categories are benign cases (239 records) and malignant cases (444 records).

Artificial data set:
In this data set, there are 5 classes and each class has 50 samples consisting of 3 features.Each feature of the class is distributed according to Class 1~Uniform (80, 100); Class 2~Uniform (60, 80); Class 3~Uniform (40, 60); Class 4~Uniform (20, 40); and Class 5~Uniform (1, 20).The Characteristics of above mentioned data sets are shown in Table 3.The algorithms perform best under the following selected parameter values: The number of particles (p) is set to 10.The cognitive component (c 1 ) and social component (c 2 ) are set to 2.0.The inertia weight (ω) is 0.9→0.4.ω decreases linearly from 0.9 to 0.4 throughout the search process.ω is calculated by the following Eq.( 11 where, ω max and ω min are the initial and final value of weighting coefficient, respectively; ω max = 0.9 and ω min = 0.4; I max is the maximum number of iterations; I is the current iteration number.The maximum number of iterations is 100.The experiments are conducted through 10 independent runs for all the algorithms.The iteration error (ε) is 0.00001.The aim of this paper is to study the effect of hybrid algorithm for data clustering using different distance metrics.Clustering algorithms are implemented using Java.For conducting various experiments, we used a PC Pentium IV (CPU 3.06 GHZ and 1.97 GB RAM) with the selected parameter values.Each algorithm is tested through 100 iterations and 10 independent runs.In this study, the quality of clustering of data clustering algorithms is measured by fitness function values.Table 4 to 8 present a comparison among the results of different clustering algorithms on selected data sets in terms of fitness function values.

Fitness function values:
The distance between each data point and within a cluster and the cluster center of that cluster is computed and added up.It is calculated by using the Eq. ( 12): proposed the K-means clustering algorithm based on Chaos Particle Swarm (CPSOKM).The proposed algorithm solves the problem of K-means algorithm and optimizes the clustering result.Sethi and Mishra (2013) developed a linear Principle Component Analysis (PCA) based hybrid K-means clustering and Particle Swarm Optimization (PSO) algorithm (PCA-K-PSO).The algorithm uses the global searching ability of PSO and fast convergence of K-means algorithm.Aghdasi et al. (2014) proposed K-harmonic data clustering algorithm using combination of PSO and Tabu Search.Basic concepts: In this section, the concept of mathematical model of clustering problem, evolutionary algorithms and distance metrics are discussed.
x ...x } =, a set of data points;select the number of cluster centers 1 c N < < ; Initialize the random cluster centers selected from the data set.
Select an initial center z 1 uniformly at random from the data set X Step 1b): While |z| <c do Choose the next center z i randomly from X, where every x X ∈ has a probability of: (x i , c j ) is the distance between the data point x i and the cluster center c j .The minimum function value indicates the higher quality of clustering.

Table 1 :
Description of PSO parameters in a D-dimensional search space.The positions and velocities are adjusted and the fitness function is computed with new coordinates at each time step.The velocity and position of a particle are modified in each iteration, based upon its own pbest and gbest.The velocity update formula is calculated by the Eq.(4):

Table 2 :
List of different distance metrics

Table 3 :
Characteristics of selected data sets Table 4 to 8 show that the proposed algorithm has the minimum function values 1494.048,2184.582 and 1211.850 on teaching assistant evaluation data set; 1930.333,2925.505 and 1622.335 on thyroid data set; 312.159, 543.589 and 257.987 on seeds data set; 2966.431,6454.468 and 1880.628 on breast cancer data set; 2290.905,3535.108 and 1813.231 on artificially