Privacy Preserving Multiview Point Based BAT Clustering Algorithm and Graph Kernel Method for Data Disambiguation on Horizontally Partitioned Data

,


INTRODUCTION
Data mining is used to extract implicit and previously unknown information from data.Data mining is the process which provides a concept to attract attention of users due to high availability of huge amount of data and need to convert such data into useful information.Naturally this raised privacy concerns about collected data.In response to that, data mining researchers started to address privacy concerns by developing special data mining techniques under the framework of "privacy preserving data mining".Opposed to regular data mining techniques, privacy preserving data mining can be applied to databases without violating the privacy of individuals.Privacy preserving techniques for many data mining models have been proposed in the past 5 years.Techniques for privacy preserving association rule mining in distributed environments (Kantarcioglu and Clifton, 2004).
The privacy violation through the process of mining can pose real privacy issues.The reason is that gathering data and bringing them together to support data mining makes misuse easier.In other words, the problem is not data mining results, but the process that generates them.If the results were generated without sharing information and the results could not be used to deduce private information, data mining would not reduce privacy (Vaidya and Clifton, 2003).Although obtaining globally meaningful results without sharing information seems impossible, some solutions have been proposed for that.In order to perform privacy preservation concept some of the methods have been proposed in earlier work for horizontally partitioned data.ID3 classification (Lindell and Pinkas, 2000) for two parties with horizontally partitioned data by using secure protocols to achieve complete zero knowledge leakage.Four efficient methods (Clifton et al., 2003) namely secure sum, secure set union, secure size of set intersection and scalar product for privacy preserving data mining in distributed environment.Privacy preserving data mining of association rules (Kantarcioglu and Clifton, 2004) when the data is partitioned horizontally.They proposed algorithm which uses three basic ideas such as randomization, encryption of site results and secure computation.The state of art in the area of privacy preserving data mining techniques is presented (Verykios et al., 2004).The authors also discussed about classifications of privacy preserving techniques and privacy preserving algorithms such as heuristic-based techniques, cryptography-based techniques and reconstruction based technique.A framework for evaluating privacy preserving data mining algorithms and based on this frame work one can assess the different features of privacy preserving algorithms according to different evaluation criteria (Elisa et al., 2005).
Clustering is widely used in many applications such as customer behavior analysis, targeted marketing and others.Recently, privacy preserving clustering problems has also been studied by many authors.Existing privacy-preserving protocols based on the kmeans algorithm, Fuzzy c means clustering and this protocol does not reveal intermediate candidate cluster centers.These existing solutions can be made more secure but only at the cost of a high communication complexity.Data containers need to send their data to the third party and at the same time they need to keep privacy on data not solved by existing work, all of the existing work doesn't perform multi view point based clustering for anonymized data, data disambiguation problems is also not solved by these methods.Simultaneously, clustering still requires more robust dissimilarity or similarity measures; recent works such as (Lee and Lee, 2010) illustrate this need.
Similarity measure plays a very important role in the success or failure of a clustering method.Our first objective is to derive a novel method for measuring similarity between data objects in sparse and highdimensional domain, particularly anonymized data.From the proposed similarity measure, then formulate new clustering criterion functions.The objective of our work is to develop a privacy preserving multiview point based clustering method for horizontally partitioned data on only two parties.In distributed architecture, the numbers of data containers are connected with the single third party that knows the multi view clustering procedure.Before performing the multi view point data clustering for horizontally partitioned data the data disambiguation and anonymization problems is solved for data holder.The data disambiguation problems are solved by using RGSGK.Then anonymize the data by encrypting the original data with the secure key before clustering the data, in order to achieve privacy by using MDHKEA.To allow parties to obtain the final results without revealing intermediate candidate cluster centers, propose RBFHE methods for secure computation.The proposed study privacy preserving MVBAT-means clustering for horizontally partitioned data is performed between two parties.Thus, parties cannot learn extra information of the others.

LITERATURE REVIEW
DBSCAN (Liu et al., 2012) is a well-known density-based clustering algorithm which offers advantages for finding clusters of arbitrary shapes compared to partitioning and hierarchical clustering methods.However, there are few papers studying the DBSCAN algorithm under the privacy preserving distributed data mining model, in which the data is distributed between two or more parties and the parties cooperate to obtain the clustering results without revealing the data at the individual parties.Address the problem of two-party privacy preserving DBSCAN clustering.First propose two protocols for privacy preserving DBSCAN clustering over horizontally and vertically partitioned data respectively and then extend them to arbitrarily partitioned data.Đnan et al. (2007) propose methods for constructing the dissimilarity matrix of objects from different sites in a privacy preserving manner which can be used for privacy preserving clustering as well as database joins, record linkage and other operations that require pairwise comparison of individual private data objects horizontally distributed to multiple sites.It show communication and computation complexity of our protocol by conducting experiments over synthetically generated and real datasets.
Privacy-preserving collaborative filtering algorithm (Jeckmans et al., 2012), which allows one company to generate recommendations, based on its own customer data and the customer data from other companies.The security property is based on rigorous cryptographic techniques and guarantees that no company will leak its customer data to others.In practice, such a guarantee not only protects companies' business incentives but also makes the operation compliant with privacy regulations.Mangasarian (2012) propose a simple privacypreserving reformulation of a linear program whose equality constraint matrix is partitioned into groups of rows.Each group of matrix rows and its corresponding right hand side vector are owned by a distinct private entity that is unwilling to share or make public its row group or right hand side vector.By multiplying each privately held constraint group by an appropriately generated and privately held random matrix, the original linear program is transformed into an equivalent one that does not reveal any of the privately held data or make it public.The solution vector of the transformed secure linear program is publicly generated and is available to all entities.Two-Party k-Means Clustering Protocol (Bunn and Ostrovsky, 2007) that guarantees privacy and is more efficient than utilizing a general multiparty "compiler" to achieve the same task.In particular, a main contribution of our result is a way to compute efficiently multiple iterations of k-means clustering without revealing the intermediate values.To achieve this, use novel techniques to perform two-party division and sample uniformly at random from an unknown domain size.
Privacy preserving hierarchical k-means clustering algorithm on horizontally partitioned data, denoted as HPPHKC (Xue et al., 2009).The algorithm has two phases: the first phase, every object can be as a cluster, a secure computation protocol is used to compute the dissimilarity matrix and the most similar clusters will be merged.This process is repeated until get the assigned clusters number k and get k clustering centers.In the second phase, the semi-honest third party and all data involved parties use the k-means algorithm refine the results of the first phase and get the final clustering results.
All of the above clustering methods have to assume some cluster relationship among the data objects that they are applied on.Similarity between a pair of objects can be defined either explicitly or implicitly.Traditional dissimilarity/similarity measure perform clustering single viewpoint, it reduces the clustering accuracy for anonymized data where assumed to be in the same cluster with the two objects being measured.To overcome this above mentioned problem, proposed work using multiple viewpoints, more informative assessment of similarity could be achieved.Theoretical analysis and empirical study are conducted to support this claim.

PROPOSED METHODOLOGY
In this study a novel horizontal partitioning approach for multiview point clustering anonymized data is proposed.Before anonymization is performed for multiview point clustered data it becomes important to secure the data and hence anonymize the original data with the help of encryption technique.The secure key fulfils the encryption process in order to achieve the secure key; where use the key generation algorithm namely Ring-Based Fully Homomorphic Encryption (RBFHE).In this study data ambiguation problem occurs by blanking certain fields in the data table in such a way that no entry (row) in the table is unique.This makes it impossible to uniquely identify an entry by linking to another data table, since in an ambiguated table; at least two rows will match any linking operation.The same fields occurs in the table also occurs for the table it also critical to solve data disambiguation problem.After anonymize the data it is grouped based on single view point that is measuring similarity between the inter Fig. 1: Illustrates the distributed architecture of data holder and third party cluster similarity and dissimilarity wise only.But measuring the intra cluster similarity based measurement also important to perform clustering process.In order to solve these problem formally convert the data table into Ramon-Gartner subtree graph kernel method, then it finds the repeated attributes that have occurs the same attribute value in the table.This data disambiguation problem is solved by using Ramon-Gartner subtree kernel.Then multiview point based similarity measurement is performed for data points of the anonymized data.In this paper, attain privacy of cluster by the following three steps: • Solve data disambiguation problem by the Ramon-Gartner Subtree Graph Kernel (RGSGK) method.• Anonymize the original data with the secure key, by using Ring-Based Fully Homomorphic Encryption (RBFHE).• Multiview point based BAT cluster algorithm for anonymize data it is named as MVBAT Clustering.
The proposed clustering methods is used to cluster the anonymize data in multiview point manner.
In order to perform this process first need to formally define the problem; give details on trust levels of the involved parties and the amount of preliminary information that must be known by each one.There are k data holders, such that k ≥ 2, each of which owns a  1 the illustrates the data matrix of the each data holder, where the disambiguation data presents and after disambiguated data is found in the table by RGSGK method then convert those data into order manner by highest attribute value.In Table 1, let us consider G = G(V, E, L) be an undirected graph where V is a set  In order to perform the data disambiguation problem the set of the following constraints are represented between the different data attributes in the graph (Bach, 2008).

Ramon-Gartner Subtree Graph Kernel (RGSGK) method for data disambiguation: Table
The neighbourhood N(v) of a node v is the set of nodes to which v is connected by an edge, that is For simplicity, assume that every graph has n nodes, m edges, a maximum degree of d and that there are N graphs in our given set of graphs.Let S (G) refer the set of subtree patterns in the graph in this study there a two types of the graphs based on the heart patient type.The first subtree kernel on the graph was defined by Ramon and Gartner (2003).It compares the pairs of nodes from different patterns in the graphs p ଵ = (V, E, L), p ଶ = (V ′ , E ′ , L ′ )V represents vertices of the graph of the current data sampleand V ′ represents vertices of the graph (Fig. 2) of the remaining samples which are represented in the graph by iteratively comparing their neighbourhoods: α be the randomly assigning weight values for same attribute value, for same attributes with same value the h = 1.If the attributes values are different then h is greater than one.The set of relations between two records (x ୬ and x ୫ ), even direct or indirect, are represented as r ୬୫ .λ ୩ is the weight applied to the attribute value if both belongs to same attribute with different values.In our table convert this into graph in the following manner.
Here applied only the variable weight value for two different rows which have maintain the same attribute values and it belongs to h = 1, assign α weight value if it is higher value than all the remaining records higher weight value is also assigned to this attribute value for same heart patients.So disambiguation problem is applicable to rows 1, 3, 5 in the Table 1.In this example the record 1, 3 have all the values of the attributes have same values, so disambiguation occurs.Final weight value is applied to the record 1 as: k ୦ (70,70) = 0.9 × 70 = 63 (3) Similarly it is also applied to entire record 1 in the table then values are converted based on this calculated value, similarly it is also applied to record 3 for all attributes since it comes under as the second part in the table, so the disambiguation problem is solved by calculation of the weight values and then original table values also changed it improves the privacy accuracy since original value are changed as unknown values.Based the attribute value the weight is multiplied as high for individual data in the Table 1.RGSGK task determines the best disambiguation problem results for the vertices, given a set of conditions.In this case, the conditions are the edge weights, which represent how strong are the involved constraints.Positive weights indicate that both adjacent vertices should be in the same attribute value in the records from Table 1.Data  2. Each data holder needs to cluster their data, the cluster algorithm is available in the third party but the third party and the other data containers are semi trusted.So, if the data is send directly to the third party then whole data may be known by, all other data holder and third party.Since, there is a necessity to anonymize the original data before sending the data to the third party.Here, apply the RBFHE encryption process to the entire original data with secure key to anonymize the original data.That secret key is important aspect for achieving the privacy of data.With the help of the RBFHE.keygen (d, q, t, χ ୩ୣ୷ , χ ୣ୰୰ , w) algorithm, attain the secure key.Third party's duty in the protocol is to govern the communication between data holders, construct the dissimilarity matrix and publish clustering results to data holders.

Ring-Based Fully Homomorphic Encryption (RBFHE) for data anonymization:
The proposed RBFHE construction of key is developed in third party and in data holders since both is semi trusted.The third party generates one public key for all data holder that will send the network publicly with hiding some important value.Every data holder calculates a new private key by the received public value from the third party.The generation of the secure key has two steps mainly they are: • Public key generation in third party • Secret key generation in data holder The entire procedure for proposed RBFHE encryption schema for anonymize the data is specified in detail in the following way.In this ciphertext consists of only a single ring element as opposed to the two or more ring elements for schemes based purely on the (ring) learning with errors.The scheme is scaleinvariant and therefore avoids modulus switching and the size of ciphertexts is one ring element.The data entry of each data holder samples dms ୧ and φ(rn, hk) is the key value for each data holder.The most important structure is the ring R. To perform the encryption and decryption process for anonymize data define some the parameters, Let d be a positive integer and define: As the ring of polynomials with integer coefficients modulo the d-th cyclotomic polynomial where φ is Euler's totient function for data accountability for each data holder data matrix samples dms ୧ .The elements of R that is each data holder data matrix samplesdms ୧ can be uniquely represented by all polynomials in Z[dms ୧ ] of degree less than rn.Arithmetic in R is arithmetic modulo ϕ ୢ (dms ୧ ) which is implicit whenever write down terms or equalities involving elements in R. The arbitary coefficient that belongs to the each data holder data matrix samples dms ୧ in R: where, a ୧ ∈ Z identify a with its vector of coefficients of the attributes in the data holder data matrix samples and choose maximum data holder data matrix samples with ℝ ୬ to measure the size of elements in R. When multiplying two elements g, hk ∈ R, the norm of their product g, hk expands with respect to the individual norms of g and hk.The maximal norm expansion that can occur: Which g, hk ∈ R is a ring constant.Let χ be a probability distribution on R that samples small elements a ← χ with high probability.The distribution χ on R is called B-bounded for some B > 0 if for all a ← χ and have ห|h|ห ∞ < ‫.ܤ‬ First, define the discrete Gaussian distribution D ,σ with mean 0 and standard deviation σ over the integers, which assigns a probability proportional to exp (−π |r| ଶ /σ ଶ ) to each data holder data matrix samples dms ୧ ∈ Z and when d is a power of 2 and ϕ ୢ (r) = dms ୧ ୰୬ + 1 take χ be a spherical discrete gaussian probability distribution χ = D ,σ where each coefficient of dms ୧ is sampled according to the one dimensional distribution.The distribution is used in many fully homomorphic encryption schemes based on RBFHE with high probability.
The public key encryption scheme is parameterized by a modulus q and a plaintext modulus 1 < ‫ݐ‬ < ‫.ݍ‬The secret key S of each data holder attribute value is derived from the distribution key χ ୩ୣ୷ and errors are sampled from the distribution χ ୣ୰୰ .The basic encryption and decryption steps of the data holder data matrix samples are defined as below.
Given two ciphertexts C ଵ , C ଶ ∈ R which encrypt two messages m ଵ , m ଶ with inherent noise termsv ଵ , v ଶ their sum modulo q, C ୟୢୢ = [C ଵ + C ଶ ] ୯ encrypts the sum of the message modulo t [m ଵ + m ଶ ] ୲ and rewrite this as This means that the size of the inherent noise v ୟୢୢ of c ୟୢୢ is bounded by: Homomorphic Multiplication operation is divided into two parts.The first part describes a basic procedure to obtain an intermediate ciphertext that encrypts the product [m ଵ m ଶ ] ୲ modulo t of two messages m ଵ and m ଶ .The second part performs a procedure which allows a public transformation of this intermediate ciphertext to a ciphertext that can be decrypted.This latter procedure was introduced (Brakerski and Vaikuntanathan, 2011) in the form of relinearization and was later expanded (Brakerski et al., 2012) into a method called key switching, which transforms a ciphertext decryptable under one secret key to one decryptable under any other secret key.For our analysis, assume that χ ୩ୣ୷ , χ ୣ୰୰ respectively.RBFHE.multi (C ଵ , C ଶ , evk) compute: The second part in the homomorphic multiplication procedure is a key switching step, which transforms the ciphertext c ୫୳୪୲୧ into a ciphertext C that is decryptable under the original secret key: Output by RBFHE.Keygen where e, s ← χ ୣ୰୰ ି୵,୯ are vectors of polynomials sampled from the error distribution χ ୣ୰୰ and [. ] ୯ is applied to each coefficient of the vector and that it is made public because it is needed for the homomorphic multiplication operation.Every data holder and the third party must have access to the comparison functions so that they can compute distance/dissimilarity between objects for clustering the anonymize data.Data holders are supposed to have agreed on the list of attributes that are going to be used for clustering beforehand.This attribute list is also shared with the third party so that TP can run appropriate comparison functions for different data types.At the end of the protocol, the third party will have constructed the dissimilarity matrices for each attribute separately.

Multiview point based BAT clustering for anonymized data:
Third party only gets the cipher text from all the data holders in the network.To perform MVBAT Clustering methods for cipher text of all data holders idealize some of the echolocation characteristics of microbats, can develop bat algorithms.For simplicity, now use the following approximate rules to perform multiview point based clustering method: • All ciphertext of the data holder bats use echolocation to sense distance and they also 'know' the difference between food/prey and background barriers in some magical way.given by: where, β ∈ [0,1] is a random vector drawn from uniform distribution.Here x * is the current best multiview point based clusterresult which is located after comparing all the solutions among bats.As the product λ ୧ f ୧ is the velocity increment, use either f ୧ or λ ୧ , depending on the type of the problem interest.For local clustering process of multiview point based clustering for anoymized data, once best cluster is found a new solution for each bat is generated locally using random walk: where, ϵ ∈ [−1,1] is random number while A ୲ =< A ୧ ୠ୲ > is the average loudness of all the bats at this time step.
Loudness and pulse emission: Furthermore, the loudness A ୧ and the rate r ୧ of pulse emission have to be updated accordingly two criteria functions I ୭ &I are used to measure the similarity between two bats cipertext as the iterations proceed.As pulse emission increases it becomes more similarity value to form a cluster for anonymized data, the loudness can be chosen as any value of convenience.For simplicity, can also use A = 1 and A ୫୧୬ = 0: where, ρ and γ are constants.In fact, ρ is similar to the cooling factor, for any 0 < ߩ < 1 and γ > 0: In the simplicity case, can use ρ = γ and have used ρ = γ = 0.9 in our simulations.Initial emission rate r ୧ can be determined using two criteria functions I ୭ &I based on the calculation of the distance matrix and dissimilarity matrix.

Distance matrix:
The distance matrix is used to find the distance between all data points with selected cluster centroids.This distance matrix helps the third party in calculating the similarity matrix and dissimilarity matrix in an easy way.With the help of the distance matrix, can find the similarity matrix.The similarity matrix is [n * cn] matrix where n is number of data points and, cn is the selected cluster centroid.The matrix consists of similarity value of each data point that moves to the cluster centroid.The following Eq.( 27) is used to find the similarity value of data points with each cluster: From the above equation S ୶୷ the x denotes the data point and y denotes the cluster centroid.Based on the distance from the cluster centroid to the data points, third party calculates the similarity value of each data point.The similarity value of the data point declares, how much the data point is closer with correspond cluster centroid.The data point moves to the cluster centroid which, has the highest similarity value among them.

Dissimilarity matrix:
The dissimilarity matrix is also [n * cn] matrix which consists of dissimilarity value of the data point with the cluster centroid.The dissimilarity value describes how much distance is required to the data point go away from the cluster centroid.The following Eq.( 28) is used to find the dissimilarity value of data points with each cluster: In the above equation i corresponds, to the data point and j corresponds to the cluster centroid.Now have to find the maximum distance of each cluster and subtract with the data point.This result is dissimilarity value of the data point.With the help of the dissimilarity value, the third party can calculates dissimilarity matrix by the following Eq.( 29): Their loudness and emission rates will be updated only if the new solutions are improved, which means that these bats are moving towards the optimal solution.The final form of our criterion function I ୭ is: Dss ୧୨ ୰ denotes the dissimilarity matrix value based on the heart rate value and S ୶୷ represents the similarity matrix, b denotes the bat (cipher text value of the data holders), b ୰ denotes the ciphertext value of the data holder alognwith the Heart patient type.The second criteria function I to perform the clustering process is defined as follows: Proposed MVBAT clustering: Objective function f(x), x = (x ଵ , . . ., x ୢ ) Initialize the bat population x ୧ (i = 1, 2, . . ., n) and vi assigning values from a data matrix from RBFHE Define pulse frequency f ୧ at x ୧ Initialize pulse rates r ୧ two criteria functions I ୭ &I form ( 30) and ( 31) and the loudness A ୧ while (t < ‫ݔܽܯ‬ ‫ݎܾ݁݉ݑ݊‬ ‫݂‬ ‫)ݏ݊݅ݐܽݎ݁ݐ݅‬ Generate new solutions by adjusting frequency and updating velocities and locations/solutions Eq. ( 20) to ( 23

EXPERIMENTAL RESULTS
The experiments for evaluating the performance of proposed MVBAT Clustering for horizontal partitioning data are explained and discussed in detail.Our proposed MVBAT Clustering method for horizontal partitioning data is different from existing clustering methods since the proposed MVBAT Clustering, clustering is performed for anonymized data based on the multiview point, but earlier work focus on single view point based clustering, so it produces less information loss when compare to conventional methods and each attribute value is encrypted by a RBFHE.In order to measure clustering results therefore, perform the following performance evaluation metrics such as communication cost analysis and running time analysis, Information loss, utility, privacy loss and clustering methods accuracy.In our experimentation have used two data sets from UCI Machine Learning Repository (Frank and Asuncion, 2010) datasets such as Adult Dataset and Housing Data The Housing Data Set is described in Table 4.It totally contains 14 attributes in total.Divide datasets into datasets of size 2, 4, 6, 8 and 10 K, respectively where K represents thousands.
In our experiments, use RBFHE cipher to generate private key for users to hide data holders' inputs.The secret key of the two different third-parties is shared between data holders and the resulting cipher text is used as anonymized data for multiview point based clustering process.For the next encryption process, cipher text generated in the previous step is used as the message (plaintext) to be encrypted which yields the next random number as a result.The communication cost is analyzed between the Advanced Encryption Standard (AES) (Đnan et al., 2007), Diffie Hellman Key Exchange Algorithm (DHKEA), Modified Diffie Hellman Key Exchange Algorithm (MDHKEA) and proposed RBFHE.
Figure 3 implies the communication cost of proposed RBFHE increases due to increasing amount of pair-wise entity comparisons with the existing methods such as MDHKEA, DHKEA and AES.Adult dataset containing 10 K entities is evenly distributed among data holders in these tests.It shows that the communication cost of the proposed RBFHE increases dramatically in our system due to secure comparison and communication cost of the existing methods is negligible compared to RBFHE.
Figure 4 implies the communication complexity of proposed RBFHE increases due to increasing amount of pair-wise entity comparisons with the existing methods such as MDHKEA, DHKEA and AES.House data set containing 10K entities is evenly distributed among data holders in these tests.It shows that the communication cost of the proposed RBFHE increases dramatically in our system due to secure comparison and communication cost of the existing methods is negligible compared to our proposed RBFHE for the same reason.
Figure 5 shows that the running time taken of clustering algorithm such as K-Means algorithm, Fuzzy C Means (FCM) clusters, Gaussian Firefly Algorithm (GFA) and MVBAT Clustering.The way of clustering the proposed MVBAT Clustering is different by means of similarity matrix and dissimilarity matrix, then two criteria function two criteria functions I ୭ &I is objective value for multiview point based clustering .The running time of the MVBAT is does not exceeds the normal time taken when compare to existing K-means algorithm, FCM, GFA for clustering the adult data set.
Figure 6 shows that the running time taken of clustering algorithm such as K-Means algorithm, FCM clusters, GFA and MVBAT Clustering.The way of clustering the proposed MVBAT Clustering is different by means of similarity matrix and dissimilarity matrix, then two criteria function two criteria functions I ୭ &I is objective value for multiview point based clustering .The running time of the MVBAT is does not exceeds the normal time taken when compare to existing Kmeans algorithm, FCM, GFA for clustering the house data set.

F-measure:
The F-Measure quantifies how well a clustering that combines the precision and recall and constitutes a well-accepted and commonly used quality measure for automatically generated document clustering's.Let D represent the set of data matrix and let C = ሼC ଵ , . . ., C ୩ ሽ be a clustering of D.Moreover, let C * = ሼC ଵ * , . . ., C ୪ * ሽ designate the reference partitioning.
Then the recall of cluster j with respect to partition i, rec(i, j), is defined as The precision of cluster j with respect to partition i, prec(i, j), is Based on this formula, the overall F-Measure of a clustering C is: Clustering results are evaluated using F-measure parameter and match point between the four raw cluster structures, results are demonstrated in Fig. 7 for adult dataset, it shows that the F measure accuracy of the In contrast to evaluate the clustering accuracy of four clustering methods separately measured using Fmeasure parameter and match point between the three raw cluster structures, results are demonstrated in Fig. 8 for house dataset, it shows that the F measure accuracy of the proposed MVBAT clustering have higher value than the existing GFA, FCM and K means clustering methods, proposed study additionally multiview point   Our results demonstrate the similarity between the privacy-utility loss in horizontal data partitioning data in adult dataset and house dataset for AES, DHKEA, MDHKEA methods and proposed RBFHE ,it shows that RBFHE methods provides substantially better data utility than existing encryption methods, the results are demonstrated in Fig. 8 and 9.

CONCLUSION AND RECOMMENDATIONS
In this study a novel privacy preserving multiview point based BAT clustering methods for privacy preserving clustering over horizontally partitioned data and the data disambiguation problem is solved by using RGSGK and anonymizes the data by using RBFHE encryption technique with secure key.RBFHE enclosed the data records of the data holders and key values are generated to perform anonymization process.MVBAT clustering two criteria functions I ୭ &I have been introduced to measure the similarity and dissimilarity values for data points for encrypted data samples from RBFHE.This two criteria functions I ୭ &I is considered as the objective function for clustering process in BAT algorithm.The quality of the resultant clusters from MVBAT can be easily measured and conveyed to data owners without any leakage of private information.Experimentation results of the proposed MVBAT is compared with other state-of-the-art clustering methods that use different types of similarity measure, on a UCI machine learning datasets such as adult dataset and house dataset and under different evaluation metrics, thus proposed MVBAT improves F-measure, less running time since MV similarity measurement is performed.Future research should examine the possibility of applying our method to vertically partitioning data clustering and perform same work under semi supervised clustering by considering unlabeled data matrix samples.
horizontal partition of the data matrix D, denoted as D ୩ .It consists of k data holders D ୩ and single third party TP.The each data holder D ୩ consists of data matrix Dm ୨ and the data matrix consists of the a number of attributes and, b number of objects [r × a].The distributed architecture is given in Fig. 1.

Fig. 2 :
Fig.2: Example of the graph of vertices and E a set of edges.Each attributes in the data matrix is represented as a vertex v ∈ V in the graph and an edge e ∈ E is added to the graph for every pair of vertices representing attributes which can potentially be the same heart patient which belongs to either one or two, L be the label of the graph kernel, that is the vertices assigns the labels names to nodes such as Age, Sex, Blood Pressure (BP), Cholesterol, Sugar, Heart rate and Heart patient.The heart patient records which belongs to one is represented as same graph and heart which belongs to category two represented as another graph.In order to perform the data disambiguation problem the set of the following constraints are represented between the different data attributes in the graph(Bach, 2008).The neighbourhood N(v) of a node v is the set of nodes to which v is connected by an edge, that is N(v) = ሼv ′ ห(v; v ′ ) ∈ Eሽ.For simplicity, assume that

Fig. 3 :
Fig. 3: Communication cost vs.methods for adult data set

Fig. 6 :
Fig. 6: Running time vs. clustering methods for house data set

Fig. 9 :
Fig. 9: Privacy loss vs. utility loss comparison for adult dataset based similarity measurement is performed based on two criteria functions I ୭ & I .Our results demonstrate the similarity between the privacy-utility loss in horizontal data partitioning data in adult dataset and house dataset for AES, DHKEA, MDHKEA methods and proposed RBFHE ,it shows that RBFHE methods provides substantially better data utility than existing encryption methods, the results are demonstrated in Fig.8 and 9.

Table 1 :
Illustrates the data matrix of the data holder

Table 2 :
Illustrates the data matrix of the data holder after RGSGK

Table 3 :
Description of the adult data set