Relative Density Clustering for Motif Discovery (ReDMoDe) Jaison Davis Naveen Suresh Data Analytics Department of CSE, iBall Research Lab PES University Bangalore, India Bangalore, India [email protected] [email protected] Abstract—This paper defines a novel technique to do a shift. This also makes the DTW calculation more efficient parameter light time series motif clustering for the Internet of [3]. Things ( IoT) time series data. The paper defines how we use a novel Relative-Density based clustering algorithm to find Existing clustering algorithms suffer from various clusters of similar waveforms. drawbacks. The standard k-means algorithm for clustering requires that the measure of similarity be a distance metric. Keywords—time series, motif discovery, clustering, DTW, DTW, however, is not a metric, and we can only obtain time series slicing, time series subsequences dissimilarities between the two time series waves but not an I. INTRODUCTION absolute distance metric for two waves (hence we cannot find the medoids). In the recent past, numerous advancements have been made in the field of Internet of Things (IoT), leading to Using a traditional density-based approach such as more reliable and accurate methods of monitoring industrial DBSCAN and OPTICS has its problems. For one, a processes and gathering data. We generally obtain time density-based algorithm could combine two groups of series data from IoT sensors. These data contain recurring high-density clusters into one if the boundaries of the patterns which can prove to be quite useful if they are clusters lie close to one another. This often happens with discovered and clustered together. For example, data from a IoT data, as the recurring patterns exhibit slight variations sensor which monitors the level in a water tank will show (due to changes in durations of recurring patterns or due to various patterns that signify full fill cycles, partial fill sensor errors). Hence, this could result in suboptimal cycles, light consumption, heavy consumption and even clustering. leaks. Such information can be useful to consumers as well Our clustering algorithm, Relative Density Clustering for as product manufacturers. Motif Discovery (REDMODE), employs specially To cluster time series data, there needs to be a defined developed density clusters as a starting criterion for measure of similarity (or dissimilarity) between two waves. identification of clusters, followed by cluster refinement by Many such similarity measures exist, including Euclidean set subtraction, and boundary point reassignment to further Distance, Dynamic Time Warping Distance, Pearson’s improve the quality of the clusters. This method of correlation coefficient, Mahalanobis Distance and clustering yields groups that are clustered based on relative Spearman’s correlation coefficient. Dynamic Time Warping density of the data points. This works especially well in distance measure has proved to be an accurate measure of capturing the essence of IoT data. It requires only a few dissimilarity for time series data [1][2]. This is due to the parameters, namely the data, the distance (EPS value) below fact that offsets in the shape of the curve do not affect DTW which two wavelets can be grouped as similar (in the initial distance, unlike most other distance measures (such as density step), and a minimum acceptable cluster size. We Euclidean distance or Pearson’s correlation). Further, by have also discovered an elegant and effective solution for imposing a window size restriction during the calculation of determining the EPS value to be used as a parameter for the the DTW, it can be made to select only patterns which are clustering. This method involves optimizing a simple cost within a certain threshold of the magnitude of the phase function. The cost function optimization coupled with the ` clustering yields a parameter-light algorithm for efficacious UCR matrix profile[6] is a fast clustering algorithm that clustering of time series data. works well when searching for a motif in a time series. The distance measure based on correlation, that the Matrix II. THE PROBLEM Profile uses, works well for the discovery of longer motifs, With our clustering algorithm, we are trying to slice and but it doesn't work for the discovery of shorter motifs cluster a time series without using any rigid, user-defined clusters that are important in IoT applications. parameters. IV. DISTANCE METRICS AND MEASURES The clustering itself is required to be sensitive to the Some measures of similarity between time series data space it is operating in. Which means in certain cases, include Euclidean distance, Pearson’s correlation coefficient wavelets of different but similar shapes can form one cluster and Dynamic Time Warping (DTW) distance. Euclidean if these wavelets cannot form their own individual clusters. distance is a good and inexpensive measure for similarity A. Why solving this is hard between time series data provided that there is nearly zero phase difference between the motifs to be discovered. Although DTW is an effective measure of wavelet However, this in practice is a challenging task, and “shape distance”, DTW is not a metric (it violates the generally, arbitrary slices are to be clustered. Euclidean triangle inequality principle). Which means it does not work distance takes an adverse hit in such a scenario and does not well with traditional clustering algorithms. Therefore a new capture the similarity between time series data accurately. clustering algorithm is needed. Pearson’s correlation coefficient suffers from the same What compounds the problem is that the algorithm needs limitation. to be parameter light and be sensitive to the data it operates DTW, however, is much more forgiving of phase in. differences and minor variations in the signal due to similar B. Why we need to solve it regions of the series’ being matched even if they are shifted Data that is generated by IoTs is a mix of noise and in time. Although DTW computation is significantly more motifs. Discovering motifs is necessary to interpret the data expensive, there exist methods to make its computation that the IoTs generate. For instance, data from a fitness faster, such as implementing a restricting window for tracker needs to be clustered and labelled in terms walking, computation or FastDTW [7]. Hence, DTW is chosen as the running etc. for it to make any meaningful sense. measure of similarity for time series clustering. III. RELATED WORK V. ALGORITHM There are several algorithms to slice and cluster a time Our clustering algorithm, REDMODE, by itself requires series. Purely density based approaches, such as DBSCAN the following parameters: and OPTICS exist[4]. However, these methods tend to (i) Distance Matrix, containing the distances between cluster data very loosely. The within cluster distance for different wavelets in the dataset (Δ) these clusters is above optimum levels. But most of these algorithms either rely on hard parameters or are not robust (ii) Cluster Cut-off Size (λ) enough for noisy IoT data. (iii) “Eps” Value (ε) A. k-Means Clustering The cluster cut-off size is the minimum number of The primary problem with the k-means clustering is that motifs that a cluster should contain. It is dependent on the the algorithm is not suited to be used with non-metric volume of data being clustered and can hence be distance measures like DTW. Additionally, the algorithm heuristically determined for most purposes. The “Eps” needs a k-parameter to work. Another drawback of the value, ε, is used in the algorithm to determine if a point k-means clustering is that it fits all the data to one of the belongs to another’s vicinity or not, i.e. for data points A clusters which can corrupt the homogeneity of the clusters. and B, B. Density-Peaks Algorithms A ∈ µB, Δ AB ≥ ε Although the Density-Peaks algorithm[5] is designed to A ∉ µB, Δ AB < ε, work with non-metric distance measures as well, it needs a parameter to be computed for it to work. A new algorithm where µX is the set of points in X’s vicinity. would need to be developed in order to calculate the exact This is used to calculate initial densities of points, as number of clusters required. specified by the algorithm. A similar parameter is expected C. UCR Matrix Profile for Motif Search as an input to DBSCAN, which purely clusters points based ` on the density of points on the hyperplane. This parameter average distance between the point and the members of the can be difficult to heuristically estimate, and is key to the cluster are used to determine the closest cluster to the point. algorithm’s performance. However, this attribute can be This process is outlined in Algorithm 2. estimated accurately by optimizing a combined cost function, as specified in the following section. A. Clustering Algorithm Our clustering algorithm expects the distance matrix, the EPS value and the cluster cut-off size (the minimum number of points in a group for the group to be considered a cluster). The algorithm can be divided into three parts: i) “Pure” cluster formation ii) Removal of anomalies (points that cannot be clustered well) iii) Reassignment of points from clusters that do not satisfy the cluster cut-off size “Pure” density clusters are points which are objectively clustered based on a vicinity based upon the EPS value. This is done by considering every point in the space as a centroid. These clusters will have very similar points, but the points may belong to multiple clusters. These clusters are also key to the EPS value calculation. Initially, fuzzy, pure clusters are formed. Here, each point in the dataset acts as a cluster centre and all points within an EPS value radius of a point is considered a part of the corresponding cluster. Hence, a point may belong to multiple clusters, but only N total clusters exist. If a point has less than the cluster cut off size number of points in its EPS value vicinity, it is also marked as an anomaly. The clusters are then be sorted in descending order by length. Now, an iterative set-subtraction is applied on the clusters, wherein the smaller clusters are replaced with the set difference between the smaller cluster and the larger one. This removes points which exist in multiple clusters from the smaller of the clusters, leaving the larger clusters large while making the smaller clusters further smaller. This is demonstrated in Algorithm 1. This creates the pure clusters. The clusters formed by the anomalous points as the cluster centres are now removed from the list of pure clusters. These pure clusters are further refined by removing border points from each cluster and re-assigning them to the nearest cluster. A border point j belonging to cluster i is one which meets the following criterion: The Mean inter-cluster distance of Ci < Mean distance of point j to all other points in Ci All the pure clusters obtained no longer contain any points that exist in more than one cluster. However, there could be clusters present which do not meet the cutoff-size constraint. For this, cluster reassignment is done. All points in clusters which do not meet the cluster cut-off size are reassigned to the closest cluster to the point. Here, the ` B. EPS Calculation The EPS parameter can be specified as an input to the algorithm. However, an accurate and effective method to calculate the EPS value was discovered by formulating a cost function to be minimized. Our cost function uses the mean of the mean pairwise distance within clusters and the ratio of total number motifs to the number of clusters. Let δ signify the mean within-cluster distance. This distance can be defined for each pure cluster as the average distance between every pair of points in the cluster. Hence the within-cluster mean distance, δ* is the mean distance for all within-cluster distances for all clusters, for a given EPS value. Further, let ρ signify the grouped ratio. This is the ratio between the number of pure clusters to the number of data points, given an EPS value. The within-cluster mean PERFORMANCE distance, δ* varies in a non-decreasing manner with Figure 1: A generic variation of the cost function with respect increasing EPS value. The grouped ratio, ρ, however, varies to varying EPS value. The minima is determined as the most in a non-increasing manner with increasing EPS value. After effective EPS value for clustering. normalizing these parameters, a cost function can be The re-assigning can be further split into the formulated as follows: refinement of border points and assigning points to the closest cluster. Although the worst case time complexity Costeee = Aδ*ε + Bρε , where A and B are weights for the border point refinement appears to ∈ O(N3), the The nature of the relationship of δ*and ρ with the EPS worst case for finding the mean of the combinations value is dependent on the dissimilarities between the data occurs when only one cluster exists. Hence, N C2 pairs of points. However, the cost function is seen to vary with the points exist in the cluster., but as only one cluster exists, EPS parameter as follows: this process has an amortized cost ∈ O(N²). (i) Increasing, when ε ∈ [0, α1] The assignment of points to the nearest clusters must go through all the points which belong to clusters which (ii) Decreasing, when ε ∈ [α1, α2] have lesser points than the cut-off threshold. For each of the points, the average distance must be calculated (iii) Increasing, when ε ∈ [α2, ∞) between the respective point and all clusters (distances By optimizing this function and finding the global between the point and each point in all other large clusters). The amortized cost of this operation ∈ O(N²). minima (α2), we can obtain the optimum EPS value to be Hence, the entire clustering algorithm has an amortized passed to the clustering algorithm. cost ∈ O(N²). C. Time Complexity VI. RESULTS The ReDMoDe clustering algorithm can be split into We applied the RedMoDe algorithm to time series data two parts for analysis: the pure cluster formation and the generated by various IoTs. Some of the results that we re-assignment. obtained are illustrated in Figure 2 and Figure 3. For an input of N time series slices, the pure cluster A. Case I formation iterates through all the pairs of the clusters, and hence has an asymptotic time complexity ∈ O(N²). Figure 2 shows all the motifs and anomalies that the algorithm detected. The first of the six clusters shown (on the top left of the plots) has a cluster size of 325 motifs, and the second cluster on the top right has a cluster size of 112 motifs. These motifs have a common underlying pattern but have phase difference, noise etc. within itself that makes it difficult to cluster. But the relative density based method ` that we use successfully group these motifs into individual clusters. In the first cluster, although the dip in data happens at a different phase for different motifs, the algorithm is able to group them onto the same cluster. Figure 3: The first plot shows an IoT time series data that operates different levels and frequencies through time. The plots below the main data show 8 of the many clusters generated by our algorithm. Figure 2: The first time series plot shows the IoT data. The second plot below it shows the 7 anomalous patterns. The subsequent plots show some of the motifs discovered by the VII. CONCLUSION algorithm. Our novel algorithm for clustering of time series data obtained from IoT devices proves to be extremely effective. B. Case II It is capable of locating and clustering motifs of various With this time series data, the clusters were formed based lengths, as well as patterns that vary based on time and both on the magnitude and shape of the motifs. Although the frequency. Furthermore, minor phase differences are ignored magnitudes vary and there is much noise in the data, the by the DTW distance metric, which leads to optimal estimated Eps value helps in grouping the slices to clustering by the ReDMoDe clustering algorithm. The EPS appropriate clusters without including the noisy signals. parameter that is needed for ReDMoDe can be calculated by The cluster number 4 in the figure shows the robustness of optimizing a cost function, and hence the algorithm is the algorithm when dealing with motifs that can be further away from each other on DTW distance. Considering that parameter-light in nature. The clustering on real-world IoT the algorithm does not require rigid parameters, it seems to data yield state-of-the-art results, and produced clusters of be dynamic enough to deal with such data. various recurring patterns. These clusters are useful for analysis of cycles and patterns that occur regularly. ` and OPTICS exist[4]. However, these methods tend to cluster time series data very loosely. The within cluster REFERENCES distance for these clusters are much higher than optimum levels. Specifically looking at individual clustering [1] Abdullah Mueen, Eamonn J. Keogh: Extracting Optimal Performance algorithms, from Dynamic Time Warping. KDD 2016: 2129-2130 [2] Iglesias, F., Kastner, W., (2013) Analysis of Similarity Measures in D. k-Means Clustering Times Series Clustering for the Discovery of Building Energy The primary problem with the k-means clustering is that Patterns the algorithm is not suited to be used with non-metric [3] Ratanamahatana, C.A. & Keogh, E. (2004). Making Time-series distance measures like DTW. Additionally, the algorithm Classification More Accurate Using Learned Constraints. SDM needs a k-parameter to work. International conference, pp. 11-22. [4] M. Ankerst, M. Breunig, H. Kriegel and J. Sander, "OPTICS", ACM E. Density-Peaks Algorithms SIGMOD Record, vol. 28, no. 2, pp. 49-60, 1999. Although the Density-Peaks algorithm[5] could be made [5] A. Rodriguez and A. Laio, "Clustering by fast search and find of to work with non-metric distance measures, it needs a density peaks", Science, vol. 344, no. 6191, pp. 1492-1496, 2014. parameter for it to work. A new algorithm would need to be [6] Rakthanmano, T., Campana, B., Mueen, A., Batista, G., Westover, B., Zhu, Q., Zakaria, J., Keogh, E., “Searching and Mining Trillions of developed in order to calculate the exact number of clusters Time Series Subsequences under Dynamic Time Warping” IEEE required. ICDM 2011 [7] S. Salvador and P. Chan, "Toward Accurate Dynamic Time Warping F. UCR Matrix Profile for Motif Search in Linear Time and Space", Intelligent Data Analysis, vol. 11, no. 5, UCR matrix profile[6] based clustering is a fast pp. 561-580, 2007. algorithm that works well at clustering motifs in a time series. UCR matrix profile uses a correlation based distance Problem measure to cluster time series data. This technique works well in the discovery of longer motifs, but it doesn't work Pattern mining of a time series is known and mostly solved for shorter motifs. problem for larger waveforms. But, lot of solutions require G. Affinity Propagation mining of smaller waveforms for patterns and anomalies. This is an important yet unsolved problem. IX. RESULTS Unlike larger waveforms, distance between two motifs cannot be computed with a correlation based function. This type of motif require metrics like the DTW distance which We applied our clustering algorithm to a diverse set of are non-metric in nature. time series data and compared the results obtained with those from other algorithms. We picked a list of diverse time Using DTW for clustering brings the problem into the series data that cover a wide spectrum shapes and lengths. non-metric feature space. Clustering in non-metric spaces is These time series data sets, we think are representative of a what our algorithm solves. large variety of real world data sets. The window for rolling was set to A. Why solving this is hard TS 1: Mean daily temperature, Fisher River near Dallas, Jan Although DTW is an effective measure of wavelet 01, 1988 to Dec 31, 1991 [s] “shape distance”, DTW is not a metric (it violates the https://datamarket.com/data/set/235d/mean-daily-temperatur triangle inequality principle). Which means it does not work e-fisher-river-near-dallas-jan-01-1988-to-dec-31-1991#!ds= well with traditional clustering algorithms. Therefore a new 235d&display=line clustering algorithm is needed. What compounds the problem is that the algorithm needs to be parameter light and dynamic with respect to the space it operates in. VIII. RELATED WORK There are several algorithms to slice and cluster a time series. Purely density based approaches, such as DBSCAN ` TS 5: Sensor temperature data from a refrigerator Source:JNARK TS 2: Monthly Lake Erie Levels 1921 – 1970 [s] https://datamarket.com/data/set/22pw/monthly-lake-erie-lev els-1921-1970#!ds=22pw&display=line We look at the mean of average intra-cluster distance of the resulting clustering as a metric to compare algorithms. Table: Results from clustering algorithms applied on our test time series datasets. The metric used is mean of average intra cluster DTW distance (Number of clusters). TS 3: Annual Swedish fertility rates (1000's) 1750-1849 [s] Data TS 1 TS 2 TS 3 TS 4 TS 5 (# of (145) (58) (8) (270) (3985) https://datamarket.com/data/set/22s2/annual-swedish-fertilit Slices) y-rates-1000s-1750-1849-thomas-1940#!ds=22s2&display=l ine ReDMode 0.10 0.07 0.27 0.02 0.04 (18) (8) (2) (30) (245) A: 0 A: 12 A: 0 A: 1 A: 76 0.04 0.05 0.13 0.04 0.04 (13) (7) (2) (24) (235) k-Med. + 0.10 0.11 0.31 0.03 0.06 ReDMode (18) (8) (2) (30) (245) A:0 A:0 A:0 A:0 A:0 0.11 0.13 0.26 0.04 0.06 (13) (7) (2) (24) (235) OPTICS 0.17 0.07 0.12 0.07 0.11 (3) (5) (1) (8) (11) TS 4: ECG Signal Data from a single patient A: 41 A: 18 A: 5 A: 10 A: 24 https://github.com/c-labpl/qrs_detector/tree/master/ecg_data 0.34 0.14 0.20 0.34 0.31 https://github.com/c-labpl/qrs_detector/ (3) (5) (2) (8) (8) Affinity 0.34 0.41 0.31 0.77 0.78 Prop. (4) (2) (3) (4) (3) ` A:0 A:0 A:0 A:0 A:0 0.33 0.41 0.41 0.57 0.35 (4) (2) (3) (4) (3) ReDMoDe Algorithm: Unlike most its counterparts, ReDMode did not require input params. Nonetheless, we obtained an optimal balance between cluster counts and homogeneity. k-Medoids + ReDMode Algorithm: We fed the parameter obtained from our ReDMode as a parameter to k-Medoids algorithm. With the help of the parameter fed from our clustering algorithm, the k-Medoids algorithm was able to generate fairly good clusters. But the algorithm was not effective in labeling anomalous motifs. OPTICS: The clusters obtained were not up to the mark and the number of anomalies too abnormally high. Considering the fact that this algorithm needs three input parameters, it is ineffective in solving this problem. Affinity Propagation: The clusters obtained from Affinity Propagation were unusable and random. While k-Medoids clustering works fairly well with the help Fig #: Comparison of the largest clusters of TS 4 from of RedMode, on its own it is ineffective in clustering points different clustering algorithms in a DTW based feature space. Most other clustering 1. ReDMoDe 2. k-Medoids + ReDMode algorithms proved to be ineffective when working with data in non-metric feature space. Our novel algorithm proved to be effective in clustering time series motifs with little or no user input. EPS Calculation Algorithm Step 1: Add clusters around each point 3. OPTICS 4. Affinity Propagation ` Step 2: Subtract smaller window from the larger clusters leaving the largest clusters Fig #: Normalized Cluster Count to Max Slice Count ratio(blue) plotted with Normalized Mean of Average Within cluster distance(orange) Fig : Weighted Mean of Avg intra-cluster distance Step 3: Plot the inner cluster mean DTW dist to the radius of the clusters being tried `
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-