A Similarity Measure Method for Symbolization Time Series

Similarity measure is the base task of time series data mining tasks. LCSS measure method has obvious limitations in the two different length time series selection of a linear function. The ELCS measure method is proposed to normalize the sequence, which introducing the scale factor to limit the search path of the similarity matrix. Experiment in hierarchical clustering algorithm shows that the improved measure makes up for the shortcomings of LCSS, improves the efficiency and accuracy of clustering and improves time complexity.


INTRODUCTION
Time series has always been an important and interesting research field due to its frequent appearance in different applications.Time series similarity measure that proposed by Agrawal et al. (1993) has become a hot research topic due to its wide application usages such as time series classification, clustering, abnormal findings on the basis of data mining, Many methods have been developed for searching time series measure method in large data sets and especially similarity measure of time series is a very important task in the process of data mining.
There is similarity measure methods of time series, such as Faloutsos et al. (1994) proposed a fast subsequence matching method based on the Euclidean distance metric, in which the similarity measure of the two time series is calculated as two points of the same dimension and it sets a threshold to judge whether the result is similar.Euclidean distance requires two sequences of equal length and ignored the temporal characteristics of time series, thus limiting its application in time series similarity measure.Chung et al. (2004) uses the weight method in the Euclidean distance method and eliminates transform offset, but there are parameters set by manual intervention.Berndt and Cliford, (1994) introduce Dynamic Time Warping distance (DTW) to the time series similarity measure which performed well in the local characteristics comparation of the two unequal length sequences, but the time consumption of the algorithm is too expensive.In addition, DTW algorithm can't found two time series peaks between low point and inflection point, such as the corresponding relations between the feature points and the accuracy of the algorithm is low.Some researchers (Yi et al., 1998;Kim et al., 2001) improved DTW by introducing the index technology, making its time complexity reduced.An index-based approach for similarity search supporting time warping in large sequence databases (Kim et al., 2001) proposed the Segment-wise the Time Warping distance (STW), making the DTW time complexity decreased greatly, but making the similarity measure accuracy reduced too.Latecki et al. (2005) put forward a kind of minimum variance matching method to obtain the flexible similarity matching.
In 1994, the Longest Common Subsequence (LCS) measures (Paterson and Dancik, 1994) to the time series similarity measure.Bollobas et al. (1997) put forward LCSS on the basis of LCS, making a better similarity measure of time series which have amplitude translation, timeline stretching and bending deformation.
Some other researchers have proposed the slopebased, the model-based and the event-based similarity measure.
This research studies the similarity measure problem of symbolic time series.Firstly, this study introduces the definition and the classical similarity measure.Then, we propose a new similarity measure algorithm based on the LCSS algorithm: different to the LCSS algorithm, the new algorithm avoids the selection of a linear function effectively, improves the accuracy of measurement and improves time efficiency greatly compared to the DTW measure.Finally, experiments to verify the proposed algorithm.

LCS AND LCSS SIMILARITY MEASURE
LCS measure: There are time series samples , X Y A  , their vector form is: { , ,..., } 1 2 X x x x n  , { , ,..., } 1 2 Y y y y n  , they satisfy the longest common subsequence of the following conditions were ' { , ,..., } 1 2 , where l is the length of the Common subsequence, Similarity between time series X and Y is defined as LCSS measure: LCS measure can avoid the similar issues which brought by the time series of short-term mutation or intermittent.However, the time series of amplitude translation, timeline stretching and bending deformation can't get a good similarity measure results.LCSS measure is designed for the improvement of the above problems.Let 0 be the longest subsequences in X and Y respectively such that: Then similarity between the time series is defined as formula (1): (1)

EXTENDED LONGEST COMMON SUBSEQUENCE MEASURE (ELCS)
Although the LCSS measure has some advantages, there are still the following issues: Otherwise, the sequence will undetect the candidate series (Keogh and Pazzani, 2001).Thus, the LCSS algorithm timeline stretching support is very limited.
For the existence problem of LCSS measure, this study presents an Extended Longest Common Subsequence (ELCS) measure: Let 1   and 0   be a real constant, Given two sequences { , ,..., } 1 2 X x x x n  and { , ,..., } 1 2 Y y y y n  , The normalization that all sequence is located in between value [0,1]  and Y respectively such that: Then the similarity between the time series defined as the formula (2): (2) Defined above, parameter  makes the search path of the similarity measure matrix concentrated in a diamond area, not only to prevent the sequence of over match, while reducing the time complexity.And the selection of the search path area is related to each sequence length closely, not only appear undetected sequence, but also well adapted timeline stretching and deformation of the sequence match.
Parameter θ in the definition makes the similarity measurement algorithm, after normalization, get further flexibility to match the space.Sequence normalized processing as the formula (3): (3) Which ∈ avoid the linear function f selection of difficulties, at the same time retained the sequence of numerical trend information.
is used as a final evaluation criteria.

EXPERIMENTAL RESULTS AND ANALYSIS
Parameter determination: The experiment using the SCC dataset is to analyses the influence of the algorithm.The ELCS measure contains the parameters  and θ, the  in the performance of the algorithm is very significant.With the changes of the parameter  , the clustering accuracy rate is showed in Fig. 1, the clustering average internal class distance and average among class distance are shown in Fig. 2 and 3.With the  increases, the clustering accuracy rate is changed from low to high.When 2.2

 
, clustering accuracy rate is the highest, the average internal class distance is the smallest; the average among class distance is largest.This result means each one of ELCS measure in the sequence satisfies the length  .While m n  is too large, not well qualified the position of the test sequence corresponds to the information, get meaningless similar sequence segments; While  is too small, the search range of the similarity matrix is

CONCLUSION
Based on the LCS measure, by introducing parameters which standardizes similarity matrix search path, this study improves the accuracy of the similarity measure and overcomes the traditional similarity measure based on Euclidean distance which lack of dealing with noise interference.By the experiment on two different types of data sets, ELCS measure gets higher clustering correctness than the existing similarity measures, but the time expense is higher.In short, the measure can be applied effectively to a variety of time series similarity measure.

EXPERIMENT
Similarity measure is other data mining process foundation, the measure veracity directly affect other process treatment results.Instead, we can use the clustering results to estimate the accuracy of the different similarity measure.Experimental environment and the data: The experimental environment is 2.20 GHz E4500CPU, memory for the 1024M and Window XP Professional system.The experimental data sets use Synthetic Control Chart Time Series (SCC) in the UCI of KDD Archive and CBF dataset.The number of experimental data in the SCC is 600, every time the sequence's length is 60, divided into six categories.The CBF dataset contains Cylinder (C), Bell (B), Funnel (F), it is typical of synthetic data sets.Experiment process: In cluster analysis, time series of the same group resemble each other, different sets of time series are not similar.This study uses the bottomup hierarchical clustering.Set the initial data for the , Each time series as a class i C Step 2: Calculate the similarity between any two categories, get a similarity matrix Step 3: Merge the two categories which are similar, then go to Step 2 loop, until the class number is equal to the predetermined number of clustersThe distance between the clusters uses ELCS similarity measure computation.The results of the clustering are standard ,

Fig
Fig. 6: A m while the average among class distance is smaller.Due to LCSS and ELCS are based on LCS algorithm, do not exist DTW algorithm point corresponds to a multi-point problems, local noise can be ignored.