Relative Density Clustering for Motif Discovery.pdf

Relative Density Clustering for Motif Discovery (ReDMoDe) Jaison Davis Data Analytics iBall Research Lab Bangalore, India jaison.davis@iball.co.in Naveen Suresh Department of CSE, PES University Bangalore, India naveen.213@gmail.com Abstract —This paper defines a novel technique to do a parameter light time series motif clustering for the Internet of Things ( IoT) time series data. The paper defines how we use a novel Relative-Density based clustering algorithm to find clusters of similar waveforms. Keywords—time series, motif discovery, clustering, DTW, time series slicing, time series subsequences I. I NTRODUCTION In the recent past, numerous advancements have been made in the field of Internet of Things (IoT), leading to more reliable and accurate methods of monitoring industrial processes and gathering data. We generally obtain time series data from IoT sensors. These data contain recurring patterns which can prove to be quite useful if they are discovered and clustered together. For example, data from a sensor which monitors the level in a water tank will show various patterns that signify full fill cycles, partial fill cycles, light consumption, heavy consumption and even leaks. Such information can be useful to consumers as well as product manufacturers. To cluster time series data, there needs to be a defined measure of similarity (or dissimilarity) between two waves. Many such similarity measures exist, including Euclidean Distance, Dynamic Time Warping Distance, Pearson’s correlation coefficient, Mahalanobis Distance and Spearman’s correlation coefficient. Dynamic Time Warping distance measure has proved to be an accurate measure of dissimilarity for time series data [1][2]. This is due to the fact that offsets in the shape of the curve do not affect DTW distance, unlike most other distance measures (such as Euclidean distance or Pearson’s correlation). Further, by imposing a window size restriction during the calculation of the DTW, it can be made to select only patterns which are within a certain threshold of the magnitude of the phase shift. This also makes the DTW calculation more efficient [3]. Existing clustering algorithms suffer from various drawbacks. The standard k-means algorithm for clustering requires that the measure of similarity be a distance metric. DTW, however, is not a metric, and we can only obtain dissimilarities between the two time series waves but not an absolute distance metric for two waves (hence we cannot find the medoids). Using a traditional density-based approach such as DBSCAN and OPTICS has its problems. For one, a density-based algorithm could combine two groups of high-density clusters into one if the boundaries of the clusters lie close to one another. This often happens with IoT data, as the recurring patterns exhibit slight variations (due to changes in durations of recurring patterns or due to sensor errors). Hence, this could result in suboptimal clustering. Our clustering algorithm, Relative Density Clustering for Motif Discovery (REDMODE), employs specially developed density clusters as a starting criterion for identification of clusters, followed by cluster refinement by set subtraction, and boundary point reassignment to further improve the quality of the clusters. This method of clustering yields groups that are clustered based on relative density of the data points. This works especially well in capturing the essence of IoT data. It requires only a few parameters, namely the data, the distance (EPS value) below which two wavelets can be grouped as similar (in the initial density step), and a minimum acceptable cluster size. We have also discovered an elegant and effective solution for determining the EPS value to be used as a parameter for the clustering. This method involves optimizing a simple cost function. The cost function optimization coupled with the ` clustering yields a parameter-light algorithm for efficacious clustering of time series data. II. T HE P ROBLEM With our clustering algorithm, we are trying to slice and cluster a time series without using any rigid, user-defined parameters. The clustering itself is required to be sensitive to the space it is operating in. Which means in certain cases, wavelets of different but similar shapes can form one cluster if these wavelets cannot form their own individual clusters. A. Why solving this is hard Although DTW is an effective measure of wavelet “shape distance”, DTW is not a metric (it violates the triangle inequality principle). Which means it does not work well with traditional clustering algorithms. Therefore a new clustering algorithm is needed. What compounds the problem is that the algorithm needs to be parameter light and be sensitive to the data it operates in. B. Why we need to solve it Data that is generated by IoTs is a mix of noise and motifs. Discovering motifs is necessary to interpret the data that the IoTs generate. For instance, data from a fitness tracker needs to be clustered and labelled in terms walking, running etc. for it to make any meaningful sense. III. R ELATED W ORK There are several algorithms to slice and cluster a time series. Purely density based approaches, such as DBSCAN and OPTICS exist[4]. However, these methods tend to cluster data very loosely. The within cluster distance for these clusters is above optimum levels. But most of these algorithms either rely on hard parameters or are not robust enough for noisy IoT data. A. k-Means Clustering The primary problem with the k-means clustering is that the algorithm is not suited to be used with non-metric distance measures like DTW. Additionally, the algorithm needs a k-parameter to work. Another drawback of the k-means clustering is that it fits all the data to one of the clusters which can corrupt the homogeneity of the clusters. B. Density-Peaks Algorithms Although the Density-Peaks algorithm[5] is designed to work with non-metric distance measures as well, it needs a parameter to be computed for it to work. A new algorithm would need to be developed in order to calculate the exact number of clusters required. C. UCR Matrix Profile for Motif Search UCR matrix profile[6] is a fast clustering algorithm that works well when searching for a motif in a time series. The distance measure based on correlation, that the Matrix Profile uses, works well for the discovery of longer motifs, but it doesn't work for the discovery of shorter motifs clusters that are important in IoT applications. IV. D ISTANCE M ETRICS AND M EASURES Some measures of similarity between time series data include Euclidean distance, Pearson’s correlation coefficient and Dynamic Time Warping (DTW) distance. Euclidean distance is a good and inexpensive measure for similarity between time series data provided that there is nearly zero phase difference between the motifs to be discovered. However, this in practice is a challenging task, and generally, arbitrary slices are to be clustered. Euclidean distance takes an adverse hit in such a scenario and does not capture the similarity between time series data accurately. Pearson’s correlation coefficient suffers from the same limitation. DTW, however, is much more forgiving of phase differences and minor variations in the signal due to similar regions of the series’ being matched even if they are shifted in time. Although DTW computation is significantly more expensive, there exist methods to make its computation faster, such as implementing a restricting window for computation or FastDTW [7]. Hence, DTW is chosen as the measure of similarity for time series clustering. V. A LGORITHM Our clustering algorithm, REDMODE, by itself requires the following parameters: (i) Distance Matrix, containing the distances between different wavelets in the dataset (Δ) (ii) Cluster Cut-off Size (λ) (iii) “Eps” Value (ε) The cluster cut-off size is the minimum number of motifs that a cluster should contain. It is dependent on the volume of data being clustered and can hence be heuristically determined for most purposes. The “Eps” value, ε, is used in the algorithm to determine if a point belongs to another’s vicinity or not, i.e. for data points A and B, A ∈ μ B , Δ AB ≥ ε A ∉ μ B , Δ AB < ε, where μ X is the set of points in X’s vicinity. This is used to calculate initial densities of points, as specified by the algorithm. A similar parameter is expected as an input to DBSCAN, which purely clusters points based ` on the density of points on the hyperplane. This parameter can be difficult to heuristically estimate, and is key to the algorithm’s performance. However, this attribute can be estimated accurately by optimizing a combined cost function, as specified in the following section. A. Clustering Algorithm Our clustering algorithm expects the distance matrix, the EPS value and the cluster cut-off size (the minimum number of points in a group for the group to be considered a cluster). The algorithm can be divided into three parts: i) “Pure” cluster formation ii) Removal of anomalies (points that cannot be clustered well) iii) Reassignment of points from clusters that do not satisfy the cluster cut-off size “Pure” density clusters are points which are objectively clustered based on a vicinity based upon the EPS value. This is done by considering every point in the space as a centroid. These clusters will have very similar points, but the points may belong to multiple clusters. These clusters are also key to the EPS value calculation. Initially, fuzzy, pure clusters are formed. Here, each point in the dataset acts as a cluster centre and all points within an EPS value radius of a point is considered a part of the corresponding cluster. Hence, a point may belong to multiple clusters, but only N total clusters exist. If a point has less than the cluster cut off size number of points in its EPS value vicinity, it is also marked as an anomaly. The clusters are then be sorted in descending order by length. Now, an iterative set-subtraction is applied on the clusters, wherein the smaller clusters are replaced with the set difference between the smaller cluster and the larger one. This removes points which exist in multiple clusters from the smaller of the clusters, leaving the larger clusters large while making the smaller clusters further smaller. This is demonstrated in Algorithm 1. This creates the pure clusters. The clusters formed by the anomalous points as the cluster centres are now removed from the list of pure clusters. These pure clusters are further refined by removing border points from each cluster and re-assigning them to the nearest cluster. A border point j belonging to cluster i is one which meets the following criterion: The Mean inter-cluster distance of C i < Mean distance of point j to all other points in C i All the pure clusters obtained no longer contain any points that exist in more than one cluster. However, there could be clusters present which do not meet the cutoff-size constraint. For this, cluster reassignment is done. All points in clusters which do not meet the cluster cut-off size are reassigned to the closest cluster to the point. Here, the average distance between the point and the members of the cluster are used to determine the closest cluster to the point. This process is outlined in Algorithm 2. ` B. EPS Calculation The EPS parameter can be specified as an input to the algorithm. However, an accurate and effective method to calculate the EPS value was discovered by formulating a cost function to be minimized. Our cost function uses the mean of the mean pairwise distance within clusters and the ratio of total number motifs to the number of clusters. Let δ signify the mean within-cluster distance. This distance can be defined for each pure cluster as the average distance between every pair of points in the cluster. Hence the within-cluster mean distance, δ* is the mean distance for all within-cluster distances for all clusters, for a given EPS value. Further, let ρ signify the grouped ratio. This is the ratio between the number of pure clusters to the number of data points, given an EPS value. The within-cluster mean distance, δ* varies in a non-decreasing manner with increasing EPS value. The grouped ratio, ρ, however, varies in a non-increasing manner with increasing EPS value. After normalizing these parameters, a cost function can be formulated as follows: Cost eee = Aδ* ε + Bρ ε , where A and B are weights The nature of the relationship of δ*and ρ with the EPS value is dependent on the dissimilarities between the data points. However, the cost function is seen to vary with the EPS parameter as follows: (i) Increasing, when ε ∈ [0, α 1 ] (ii) Decreasing, when ε ∈ [α 1 , α 2 ] (iii) Increasing, when ε ∈ [α 2 , ∞ ) By optimizing this function and finding the global minima ( α 2 ), we can obtain the optimum EPS value to be passed to the clustering algorithm. C. Time Complexity The ReDMoDe clustering algorithm can be split into two parts for analysis: the pure cluster formation and the re-assignment. For an input of N time series slices, the pure cluster formation iterates through all the pairs of the clusters, and hence has an asymptotic time complexity ∈ O(N²). P ERFORMANCE Figure 1: A generic variation of the cost function with respect to varying EPS value. The minima is determined as the most effective EPS value for clustering. The re-assigning can be further split into the refinement of border points and assigning points to the closest cluster. Although the worst case time complexity for the border point refinement appears to ∈ O(N 3 ), the worst case for finding the mean of the combinations occurs when only one cluster exists. Hence, N C 2 pairs of points exist in the cluster., but as only one cluster exists, this process has an amortized cost ∈ O(N²). The assignment of points to the nearest clusters must go through all the points which belong to clusters which have lesser points than the cut-off threshold. For each of the points, the average distance must be calculated between the respective point and all clusters (distances between the point and each point in all other large clusters). The amortized cost of this operation ∈ O(N²). Hence, the entire clustering algorithm has an amortized cost ∈ O(N²). VI. R ESULTS We applied the RedMoDe algorithm to time series data generated by various IoTs. Some of the results that we obtained are illustrated in Figure 2 and Figure 3. A. Case I Figure 2 shows all the motifs and anomalies that the algorithm detected. The first of the six clusters shown (on the top left of the plots) has a cluster size of 325 motifs, and the second cluster on the top right has a cluster size of 112 motifs. These motifs have a common underlying pattern but have phase difference, noise etc. within itself that makes it difficult to cluster. But the relative density based method ` that we use successfully group these motifs into individual clusters. In the first cluster, although the dip in data happens at a different phase for different motifs, the algorithm is able to group them onto the same cluster. Figure 2: The first time series plot shows the IoT data. The second plot below it shows the 7 anomalous patterns. The subsequent plots show some of the motifs discovered by the algorithm. B. Case II With this time series data, the clusters were formed based both on the magnitude and shape of the motifs. Although the magnitudes vary and there is much noise in the data, the estimated Eps value helps in grouping the slices to appropriate clusters without including the noisy signals. The cluster number 4 in the figure shows the robustness of the algorithm when dealing with motifs that can be further away from each other on DTW distance. Considering that the algorithm does not require rigid parameters, it seems to be dynamic enough to deal with such data. Figure 3: The first plot shows an IoT time series data that operates different levels and frequencies through time. The plots below the main data show 8 of the many clusters generated by our algorithm. VII. C ONCLUSION Our novel algorithm for clustering of time series data obtained from IoT devices proves to be extremely effective. It is capable of locating and clustering motifs of various lengths, as well as patterns that vary based on time and frequency. Furthermore, minor phase differences are ignored by the DTW distance metric, which leads to optimal clustering by the ReDMoDe clustering algorithm. The EPS parameter that is needed for ReDMoDe can be calculated by optimizing a cost function, and hence the algorithm is parameter-light in nature. The clustering on real-world IoT data yield state-of-the-art results, and produced clusters of various recurring patterns. These clusters are useful for analysis of cycles and patterns that occur regularly. ` REFERENCES [1] Abdullah Mueen, Eamonn J. Keogh: Extracting Optimal Performance from Dynamic Time Warping. KDD 2016: 2129-2130 [2] Iglesias, F., Kastner, W., (2013) Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns [3] Ratanamahatana, C.A. & Keogh, E. (2004). Making Time-series Classification More Accurate Using Learned Constraints. SDM International conference, pp. 11-22. [4] M. Ankerst, M. Breunig, H. Kriegel and J. Sander, "OPTICS", ACM SIGMOD Record, vol. 28, no. 2, pp. 49-60, 1999. [5] A. Rodriguez and A. Laio, "Clustering by fast search and find of density peaks", Science, vol. 344, no. 6191, pp. 1492-1496, 2014. [6] Rakthanmano, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E., “Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping” IEEE ICDM 2011 [7] S. Salvador and P. Chan, "Toward Accurate Dynamic Time Warping in Linear Time and Space", Intelligent Data Analysis, vol. 11, no. 5, pp. 561-580, 2007. Problem Pattern mining of a time series is known and mostly solved problem for larger waveforms. But, lot of solutions require mining of smaller waveforms for patterns and anomalies. This is an important yet unsolved problem. Unlike larger waveforms, distance between two motifs cannot be computed with a correlation based function. This type of motif require metrics like the DTW distance which are non-metric in nature. Using DTW for clustering brings the problem into the non-metric feature space. Clustering in non-metric spaces is what our algorithm solves. A. Why solving this is hard Although DTW is an effective measure of wavelet “shape distance”, DTW is not a metric (it violates the triangle inequality principle). Which means it does not work well with traditional clustering algorithms. Therefore a new clustering algorithm is needed. What compounds the problem is that the algorithm needs to be parameter light and dynamic with respect to the space it operates in. VIII. R ELATED W ORK There are several algorithms to slice and cluster a time series. Purely density based approaches, such as DBSCAN and OPTICS exist[4]. However, these methods tend to cluster time series data very loosely. The within cluster distance for these clusters are much higher than optimum levels. Specifically looking at individual clustering algorithms, D. k-Means Clustering The primary problem with the k-means clustering is that the algorithm is not suited to be used with non-metric distance measures like DTW. Additionally, the algorithm needs a k-parameter to work. E. Density-Peaks Algorithms Although the Density-Peaks algorithm[5] could be made to work with non-metric distance measures, it needs a parameter for it to work. A new algorithm would need to be developed in order to calculate the exact number of clusters required. F. UCR Matrix Profile for Motif Search UCR matrix profile[6] based clustering is a fast algorithm that works well at clustering motifs in a time series. UCR matrix profile uses a correlation based distance measure to cluster time series data. This technique works well in the discovery of longer motifs, but it doesn't work for shorter motifs. G. Affinity Propagation IX. R ESULTS We applied our clustering algorithm to a diverse set of time series data and compared the results obtained with those from other algorithms. We picked a list of diverse time series data that cover a wide spectrum shapes and lengths. These time series data sets, we think are representative of a large variety of real world data sets. The window for rolling was set to TS 1: Mean daily temperature, Fisher River near Dallas, Jan 01, 1988 to Dec 31, 1991 [s] https://datamarket.com/data/set/235d/mean-daily-temperatur e-fisher-river-near-dallas-jan-01-1988-to-dec-31-1991#!ds= 235d&display=line ` TS 2 : Monthly Lake Erie Levels 1921 – 1970 [s] https://datamarket.com/data/set/22pw/monthly-lake-erie-lev els-1921-1970#!ds=22pw&display=line TS 3: Annual Swedish fertility rates (1000's) 1750-1849 [s] https://datamarket.com/data/set/22s2/annual-swedish-fertilit y-rates-1000s-1750-1849-thomas-1940#!ds=22s2&display=l ine TS 4: ECG Signal Data from a single patient https://github.com/c-labpl/qrs_detector/tree/master/ecg_data https://github.com/c-labpl/qrs_detector/ TS 5: Sensor temperature data from a refrigerator Source:JNARK We look at the mean of average intra-cluster distance of the resulting clustering as a metric to compare algorithms. Table : Results from clustering algorithms applied on our test time series datasets. The metric used is mean of average intra cluster DTW distance (Number of clusters). Data (# of Slices) TS 1 (145) TS 2 (58) TS 3 (8) TS 4 (270) TS 5 (3985) ReDMode 0.10 (18) A: 0 0.04 (13) 0.07 (8) A: 12 0.05 (7) 0.27 (2) A: 0 0.13 (2) 0.02 (30) A: 1 0.04 (24) 0.04 (245) A: 76 0.04 (235) k-Med. + ReDMode 0.10 (18) A:0 0.11 (13) 0.11 (8) A:0 0.13 (7) 0.31 (2) A:0 0.26 (2) 0.03 (30) A:0 0.04 (24) 0.06 (245) A:0 0.06 (235) OPTICS 0.17 (3) A: 41 0.34 (3) 0.07 (5) A: 18 0.14 (5) 0.12 (1) A: 5 0.20 (2) 0.07 (8) A: 10 0.34 (8) 0.11 (11) A: 24 0.31 (8) Affinity Prop. 0.34 (4) 0.41 (2) 0.31 (3) 0.77 (4) 0.78 (3) ` A:0 0.33 (4) A:0 0.41 (2) A:0 0.41 (3) A:0 0.57 (4) A:0 0.35 (3) ReDMoDe Algorithm : Unlike most its counterparts, ReDMode did not require input params. Nonetheless, we obtained an optimal balance between cluster counts and homogeneity. k-Medoids + ReDMode Algorithm : We fed the parameter obtained from our ReDMode as a parameter to k-Medoids algorithm. With the help of the parameter fed from our clustering algorithm, the k-Medoids algorithm was able to generate fairly good clusters. But the algorithm was not effective in labeling anomalous motifs. OPTICS : The clusters obtained were not up to the mark and the number of anomalies too abnormally high. Considering the fact that this algorithm needs three input parameters, it is ineffective in solving this problem. Affinity Propagation : The clusters obtained from Affinity Propagation were unusable and random. Fig #: Comparison of the largest clusters of TS 4 from different clustering algorithms 1. ReDMoDe 2. k-Medoids + ReDMode 3. OPTICS 4. Affinity Propagation While k-Medoids clustering works fairly well with the help of RedMode, on its own it is ineffective in clustering points in a DTW based feature space. Most other clustering algorithms proved to be ineffective when working with data in non-metric feature space. Our novel algorithm proved to be effective in clustering time series motifs with little or no user input. EPS Calculation Algorithm Step 1: Add clusters around each point ` Step 2: Subtract smaller window from the larger clusters leaving the largest clusters Step 3: Plot the inner cluster mean DTW dist to the radius of the clusters being tried Fig #: Normalized Cluster Count to Max Slice Count ratio(blue) plotted with Normalized Mean of Average Within cluster distance(orange) Fig : Weighted Mean of Avg intra-cluster distance `