The Research on Data Mining of Slim Life Mode Based on Cycle Behavior

In this study, data mining of slim life mode based on cycle behavior is propsed. The mining of the periodic behavior is divided into four stages. The first two stages is data pre-processing stage: Firstly, parsing stay point sequence from data sequence of the original location history. Here stay point represent the geographic area to a person’s stay for some time; Secondly, cluster mining the sequence of stay point, find out the significant places, such as company, supermarket, home location, etc. Thirdly, mining periodic on the significant places. Take a place as a reference point; abstract the original location history data into binary sequence by the location point in or out the place. Then, combination two popular signal processing method fast Fourier and autocorrelation find the periods of every place. Fourthly, mining the periodic behavior of the places with the same periods, in this article, first construct the periodic behavior probabilistic model, then use the method based on the hierarchical clustering to mining the periodic behavior between different places. At last, an example is introduced.


INTRODUCTION
Balanced diet cycle theoretical basis rather complex, once owned by the central research department spent three years as many as 30,000 when the track was summed individual clinical conclusions.In simple terms, can be understood: body fat at different times and at different stages have different degrees of change, if we can fully appreciate the changes in the most accurate time you can control fat increase or decrease, but actually does not do modern medicine to the exact hour of the fat changes to the law.
The normal sequence of x(n), n = 0, 1…N-1 discrete Fourier transform is a complex sequence X(f): Where in, subscript k/N represent each factor produced by frequency.Therefore, in this paper, we will use F(x) to represent the Fourier transform.For real signals, the Fu Liye factor is symmetric (more specifically, they are symmetric complex conjugate).Fu Liye converted to use the more complex sinusoidal function: We can use the inverse Fourier transform from frequency domain to time domain F -1 (x) ≡ x(n): As shown in Fig. 1, in the inverse conversion process, we discarded some factor; the result will be the original sequence of approximate value.By selecting the need to record the factor, we can put the Fu Liye transformation is applied to many fields, such as information compression, removal of noise.The two methods can be used to calculate the sequence of discrete Fu Liye transform (Lee et al., 2008a;Giannotti et al., 2007).
Cycle diagram: Suppose X is a sequence of Fu Liye transform.Cycle diagram P can use each of the Fu Liye factor to calculate the square length: Which, according to the theorem, loss, because the data window effect, we can only detect the maximum signal frequency more than half the frequency (Xia et al., 2006;Jeung et al., 2008;Liao et al., 2005), in order to find the K dominant period, we need to pick out the cycle graph of the maximum K value.
Each cycle graph element expressed in the frequency of k/N or N/k power cycle.Rather, each discrete Fourier "box" corresponding to a series of cycles.That is to say, the corresponding cycle factor X(f k/N ) ( ). Can be found, along with the cycle length increased, cycle graph of the resolution is very low (Ye et al., 2009;Li et al., 2008;Cao et al., 2007).
With the increase of cycle, the cycle that reduced precision of reason is the Fourier "box" length increases.Another relevant cause of spectrum leakage, which prompted the length is not a discrete Fourier "box" length is an integer multiple of the frequency dispersion to the entire spectrum of.This will lead to a cycle of "false alarms".However, cycle diagram can still for important short cycle provides precise instructions.And, through the cycle graph, can easily through the detection of Fu Liye's statistical property to automatically extract important cycle (Lee et al., 2008b;Yan et al., 2003;Krumm and Horvitz, 2006;Giannotti et al., 2006).
Cyclic autocorrelation: Second kinds of estimation of time series X explicit cycle is the method of cyclic autocorrelation (ACF), which tests a series of varied time τ value similarity: Therefore, since the correlation in the form of a convolution in time domain, we can avoid the square root calculation and using frequency domain normal Fu Liye converted to compute its value: The symbol * indicates complex conjugate: Cyclic autocorrelation vs cycle diagram, gives more finegrained cycle detection, therefore, it can be more accurate detection of long period.However, due to the following reasons, it is not suitable for automatic periodic search (Elfeky et al., 2005;Bar-David et al., 2009;Li et al., 2010): • Automatically discover important peak compared to cycle graph is difficult.At present based on autocorrelation method requires manual setting effective threshold.• Even offers effective threshold, will still find plenty to meet the conditions of the period.Therefore, the need for additional operations to eliminate the "false alarm".• The high frequency and low amplitude events could be compared to high amplitude event, is not important, although this rarely occurs.As shown in Fig. 2, 7 day cycle in the autocorrelation graphics with low amplitude was not important, however, in the cycle graph, on day 7 of the cycle is very clear (Zhao and Mao, 2011;Zhu, 2011).
From what has been discussed above, we can realize the fact that although the period gram and correlation could not separate with all information of the spectrum, however, through the combination of these two methods, you may find satisfying all of the spectrum information method.The following individuals in important places cycle detection, with a combination of Fu Liye transform and autocorrelation for cycle method.
• Binary sequence periodic detection: For individuals with an important place, we put forward a kind of the important place to find potential cycle method.To an important place as a reference point, moving sequences can be converted into a binary sequence: B = b 1 b 2 …b n , which, at the time stamp is i, individuals in the important sites, b i = 1; otherwise, b i = 0.
The above said, we used a combination of Fu Liye transform and autocorrelation for cycle method to find the binary sequence in the cycle.
In the Discrete Fourier Transform (DFT), sequences of B = b 1 b 2 …b n converted to N complex sequence X 1 , …, X n .For factor X k , cycle graph is defined for each Fu Liye factor square length: F k = ||X k || 2 .Type, F k is the frequency of the K power.In order to specify which frequency is effective, we need to set a threshold and labeled to exceed this threshold frequency.
Through the following method to determine the threshold.Let B' be a sequence of B a random permutation.Because the B' should not have any periodic, even the largest power in the sequence does not indicate periodicity, therefore, we remember its maximum power is P max and only in the sequence of B is higher than that of P max frequency corresponds to a real cycle.In order to make important frequency confidence rate reached 99%, we repeat the above random permutation sequence 100 times and record every permutation sequence maximum power.The 100 experiments in the ninety-ninth largest power values are used as the assessment of the power threshold.
For more than the power threshold F k we still need to determine the exact period of time domain, because in the frequency domain of the value K corresponding to the time domain [ ) between a series of cycle.In order to determine the cycle, we use cyclic autocorrelation, to assess a sequence in a different tag sequence value similarity: ܴሺ߬ሻ = ∑ ܾ ఛ ܾ ାఛ ୀ ଵ .Therefore, the cycle graph are given in the period [l, r), we pass the data into a square functions to test in {R(l), R(l+1), …, R(r-1)} if there exists a peak value.If the function returns the result is, periodic region of a concave, heralded a peak exists, we return to the t* = arg max l≤t<r R(t) as a probe of cycle.Therefore, we return to the t* = arg max l≤t<r R(t) as a probe of cycle.
Periodic behavior mining of important locations: By data processing and periodic detection to obtain each important location in the cycle after cycle, we next discuss behavior mining.We will have the same cycle important sites focus to obtain more concise and valuable cycle behavior.However, due to a behavior there is only a part of the mobile; the same cycle may have a number of periodic behaviors.For example, in Windows Mobile, with two daily behavior.A place in the school, another place in summer.However, for a long time movement and a daily cycle, we don't really know what moving much cycle behavior and the number of days to a periodic behavior.However, we observed that with the same cycle behavior of "day" has the same spatiotemporal patterns.Therefore, you can use clustering method to mining period.Through the application of a model to measure two "days" of the distance between them, we can further packet days to several cluster and each cluster represents a periodic behavior.As in the small example, "school" should be grouped into a cluster; summer should be grouped into a cluster of.Therefore, periodic behavior of mining faces a major issue is to establish the model of periodic behavior, put forward the model based on clustering distance function.
Cycle behavior model: First of all, we picked out all the important places individuals with periodic T. Through the combination with the same cycle personal important locations, we can get different important sites and the period between behavior knowledge.For example, we can summarize Xiaoqiang everyday behavior "in company 9:00-18:00, 20:00-8:00 in the dormitory".
With period T personal important locations in the set O T = {o 1 , o 2 , …, o d }, we use o 0 except for important locations outside the o 1 , o 2 , …, o d position.For each position of LOC = loc 1 loc 2 ... loc n sequences, we generate a corresponding mobile symbol sequence S = S 1 , S 2 , …, S n when loc i o j , s i = j.S is further divided into ݉ = ቔ ் ቕ segments.We use I j to express J, t k (1≤k≤T) said a cycle within the first k relative time stamp.The ‫ܫ‬ = ݅ object in paragraph j t k in O i , in O i in.j For example, for a period of T = 24 hours, on behalf of " one day ", said t 9 day 9:00 and ‫ܫ‬ ଽ ହ = 2 said object in fifth days at the o 2 9:00, naturally, we can use the spatial and temporal distribution to establish the probability model.

Maximum Likelihood Estimation (MLE):
In statistics, Maximum Likelihood Estimation (MLE) is a statistical model parameter estimation method.When we are in a data set using a probabilistic model, maximum likelihood estimation for model parameter estimation.Usually, the data set and the underlying probability model, maximum likelihood method by generating a can make the observation data of maximum probability distribution to select the model parameter values (parameters to maximize the likelihood function).
If a contains the n observations of IID samples x 1 , x 2 , …, x n , from an unknown probability distribution function f 0 (x), speculated that the function FO belongs to a called parametric model to determine the distribution of ሼ݂‫ߠ|ݔ‬ሻ, ߠ߳Θ ሽ, then f 0 .The value of ߠ is unknown and is regarded as the true value of parameter.So looking into some as close to the true value of the estimated value is feasible.Here observed variables x i and θ are considered as vector parameters.
In order to use maximum likelihood estimation, first specify all observed values of joint distribution function: In the actual application is commonly used in both sides of logarithm.Get the following formula: The ଓ̂= ଵ ln‫,ܮ‬ ln‫ܮ‬ሺߠ|‫ݔ‬ ଵ , … , ‫ݔ‬ ሻ called the log likelihood and ଓ̂ called the average log-likelihood.We then called maximum likelihood is the largest average log likelihood, i.e.: For many models, maximum likelihood estimation can be observed via an x 1 , x 2 , …, x n , clear function and solving them; for the other models, there is no problem of maximizing the closed form solution, then the MLE through optimization method for solving.
Space-time distribution matrix: Let T = {t 1 , t 2 , …, t T } be a cycle time stamp collection, x k said in the time stamp t k when choosing the reference point category of random variables.P = [p 1 , p 2 , …, p T ] is a space-time distribution matrix, wherein each column expressed as: independent category distribution vector, it satisfies the . Now, assume that I 1 , I 2 , …, I l has the same periodicity, segment set ‫ܫ‬ = ܷ ୀଵ ‫ܫ‬ probabilities can be obtained by some distribution matrix P to generate:

∏∏
According to the Maximum Likelihood Estimation (MLE), the optimal model can be defined as the following in maximum likelihood problem optimal solution: Periodic behavior: I represent a section of the collection; all the I segment of the periodic behavior is expressed as H(I)<T, P>.T said cycle length; P is from the equation learning the space-time distribution matrix.We further make the |I| of said cover this cycle behavior of all the number of segments.
Cycle behavior mining: Periodic behavior clustering distance function.By periodic behavior is defined, we can segment set on the estimation of periodic behavior, now a section of the set {I 1 , I 2 , …, I m }, we need to find those segments from the same period.Assuming there is k a potential periodic behavior, each one found in parts of the move, all of the paragraph should be divided into k groups and each group corresponds to a periodic behavior.
To solve the problem of the potential method is to use a clustering method.In order to use this method, two cycle behavior of distance metric needs to be defined.As a behavior is expressed as a <T, P> and T is fixed, so the distance from their space-time distribution matrix to determine the.Further, the two cycle behavior between small distance indicates that contains each behavior of some may be produced from the same period.
There are many methods to measure the temporal and spatial distribution of matrix P and Q of the distance between them.Here, we assume that different time stamp on the variables are independent, we propose the use of the famous Kullback-Leibler differences as distance measurement:

∑∑
When KL(P||Q) is very small, which means that the P and Q distribution matrix is similar to, or are different.
Notice that when p(x k = i) or q(x k = i) has a probability of 0, the KL(P||Q) value is infinite.In order to avoid the occurrence of such a situation, we all reference points p(x k = i) and q(x k = i) increase a background variable u: where, λ is a small parameter 0<λ<1.
In order to further from a statistical point of view the above method can solve our problem, we returned to our proposed model.Because I am a distribution matrix generated by the P segment set and then KL(P||Q) can be further developed: where, H(P) is the P entropy, can be seen as a constant.Therefore, Kullback-Leibler difference measurement segment set I can be generated by Q possibility distribution matrix.In our clustering algorithm, for Q to make a choice, we simply choose the maximum likelihood P(I|Q) the Q.Now, suppose we have two periodic behavior, H1 = <T, P> and H2 = <T, P> We define two acts as the distance between: 1 2 ( , ) ( ) dist H H KL P Q = Periodic behavior clustering algorithm: Assuming the existence of K potential periodic behavior, there are many ways to packet segment set to K cluster.However, potential cycle number is usually not known.We propose a hierarchical clustering method to packet segment set and determine the optimal number of periodic behavior.In hierarchical clustering in each iteration, with a minimum distance of two clustering fusion.We use a representation error to monitor the quality of clustering.When the cluster number from K to k-1, if the representation error suddenly increases, suggesting that K may be the correct period number.

CONCLUSION
Of course, the equilibrium cycle diet is just an application of the theory; with the deepening of the late, fat change will also make other applications of various forms.Ideally, hierarchical clustering process, from the same behavior section first fusion, because they have the smallest distance.Therefore, we use clustering in the section is in a special time are concentrated into a separate individual important sites to determine the clustering is good.Therefore, a natural representation of error measurement method is to estimate the quality of clustering.
to represent the original signal.Therefore, the Fu Liye factor in signal x projection to them, keep the amplitude and the phase of the sine function.

Fig. 1 :
Fig. 1: Reconstruction signal from the original 5 Fu Liye factors representation and event A associated with the indicator function.That is to say, p(x k = i) is a reference point in time relative to the o i t k I frequently.