Time Series Synthesis Using the Matrix Profile

This is a complementary website to a short Vision paper submitted to BigData2023. Here, we house extra visualization examples, our code, hyperparameter discussion, and results.

Abstract: Publishing and sharing data is crucial for the data mining community, allowing collaboration and driving open innovation. However, many researchers cannot release their data due to privacy regulations or fear of leaking confidential business information. To alleviate such issues, we propose the Time Series Synthesis Using the Matrix Profile (TSSUMP) method, where synthesized time series can be released in lieu of the original data. The TSSUMP method synthesizes time series by preserving similarity join information while reducing the correlation between the synthesized and the original time series. As a result, neither the values for the individual time steps nor the local patterns (or shapes) from the original data can be recovered, yet the resulting data can be used for downstream tasks that data analysts are interested in. We concentrate on similarity joins because they are one of the most widely applied time series data mining routines across different data mining tasks. We test our method on a case study of ECG and gender masking prediction. In this case study, the synthesized time series preserves enough information in the original time series such that unmodified data mining tools can obtain near identical performance on the synthesized time series as the original time series.

The code is available through this Google Drive link:

(Note, a previous iteration of this method was dubbed "TSAUMP" and any residual references to the TSSUMP method as such are artifacts.)

tsaump_release_1.zip

A note on the importance of preserving information within the Matrix Profile Index (MPI)...

The following text is further discussion most pertinent to Sections 1-3 of our manuscript.

While we believe that our paper is completely self-contained, here we take the liberty of further explaining the importance of the location information contained within the Matrix Profile Index.

We want to reiterate and expand on the text in our paper, because a reader who is not very familiar with time series data mining might baulk at the idea that the actual shape of the subsequences may not be important, but the location of the subsequence’s nearest neighbors is. Gharghabi et. al. showed in [1][2], that for semantic segmentation we only need the location of the subsequence’s nearest neighbors. More generally, there seems to be an emerging realization that such information is sufficient for most time series data mining tasks.

For example, one of the main reasons why the community does motif discovery is to understand the temporal relationship between the motifs. Suppose we compare and contrast the motifs in the New York Taxi demand, and the motifs in New York temperature. An analyst may make the following observations:

For both datasets, the majority of the subsequence’s motifs are to the day before (or after). This is what forecasters call classic persistence, the assumption that the situation will be the same tomorrow as it was today.
Uniquely for Taxi, some of the subsequence’s nearest neighbors will be seven days away. This is because of the vagaries of the typical five-day workweek, both Saturday and Sunday are different to typical weekdays, and each other.
The timing of some motifs across the two different datasets are related. Unusually hot or cold weather does greatly affect taxi demand.
There are a handful of days in the year when both Taxi and Temperature motifs strongly violate classic persistence, these are usually caused by blizzards. Such events are classic anomalies identified in the Numenta anomaly benchmark [3]
Finally, if the analyst has access to taxi demand from other cities, she may be able to see interesting differences. For example, Philadelphia, PA is only 80 miles from New York and has similar weather and culture. However, in New York alone, the subsequence’s representing the second Monday of October show much lower persistence. Why is that? It appears that to be caused by Columbus Day, which is celebrated by New York’s Italian-American community but is increasing ignored in the rest of the USA

Note that all the analytics above (and much more) can be conducted with only access to knowledge of the location of the subsequence’s nearest neighbors.

[1] S. Gharghabi, Y. Ding, C. -C. M. Yeh, K. Kamgar, L. Ulanova and E. Keogh, "Matrix Profile VIII: Domain Agnostic Online Semantic Segmentation at Superhuman Performance Levels," 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 2017, pp. 117-126, doi: 10.1109/ICDM.2017.21.
[2] Gharghabi, S., Yeh, C.-C. M., Ding, Y. et al. Domain agnostic online semantic segmentation for multi-dimensional time series. Data Min Knowl Disc 33, 96–130 (2019). https://doi.org/10.1007/s10618-018-0589-3
[3] S. Ahmad et al., “Unsupervised Real-Time Anomaly Detection for Streaming Data,” Neurocomputing, vol. 262, 2017, pp. 134-147. Original Source of Numenta anomaly benchmark https://github.com/numenta/NAB/tree/master/data/realKnownCause

Optimization

Please see the PDF for this appendix.

Optimization_Appendix_for_TSSUMP__BigData_2023_Short_Vision_Paper_.pdf

Hyperparameter Discussion

The hyperparameters of TSSUMP can be categorized into three groups: 1) a parameter associated with the Matrix Profile, 2) parameters associated with loss function, and 3) parameters associated with the optimization process. The Matrix Profile has just one well-known parameter in window length m, and efforts to deparameterize the Matrix Profile can be found in the Pan-Matrix-Profile (PMP) [1]. The user can set the subsequence length based on their domain knowledge or adopt the PMP method.

Several research efforts have independently shown that for many problems, including segmentation [2] and anomaly detection [3], setting m to about one period (i.e. one heartbeat, one gait cycle) is a strong heuristic. For the loss function, we use $\alpha$, $\beta$, and $\gamma$ to control the relative importance of each term as shown in the following:

We can directly evaluate how well the current solution solves the time series anonymization problem, and can tune these parameters based on how well the current solution performs on each loss term (L_distance, L_identity, L_cosmetic) relative to L_local. We found that setting alpha=1, beta=2, and gamma=1 works well in our experiments.

While L_cosmetic was not discussed in the paper, it is an optional term based on In this case, we decide to model the time series complexity [4] which is defined as: L_comsetic = Complexity(T_{i,m}) - Complexity(\hat{T}_{i,m}))^2. This measure of time series complexity can be understood as the length of a time series after it is "stretched" flat.

The last group of hyperparameters pertain to the optimization process. Similarly to loss function parameters, setting these can be adjusted based on the quality of the current solution. In our experiments, we use the AdamW optimizer with its default parameter settings in Pytorch. We use the ReduceLROnPlateau learning rate scheduler from Pytorch with the initial learning rate set to 10, patience to 5, cooldown to 3, and the rest with default settings. We set the batch size to 512. The number of iterations is set according to the size of the dataset, and more iterations are required to converge on longer time series.

[1] Madrid, Frank, et al. "Matrix profile xx: Finding and visualizing time series motifs of all lengths using the matrix profile." 2019 IEEE International Conf. on Big Knowledge (ICBK). IEEE, 2019.
[2] Gharghabi, Shaghayegh, et al. "Matrix profile VIII: domain agnostic online semantic segmentation at superhuman performance levels." 2017 IEEE international conference on data mining (ICDM). IEEE, 2017.
[3] Lu, Yue, et al. "Matrix profile XXIV: scaling time series anomaly detection to trillions of datapoints and ultra-fast arriving data streams." Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022.
[4] Batista, Gustavo EAPA, Xiaoyue Wang, and Eamonn J. Keogh. "A complexity-invariant distance measure for time series." Proceedings of the 2011 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2011.

Further Visualization Examples

Due to submission guideline page restrictions, we did not describe in depth how the time series were anonymized for clustering in lieu of presenting an example for results. We remedy this with a description here, and provide the code in our package.

Given several time series to cluster, embed them into a background signal of Gaussian noise.
Record the indices of the background signal where each time series was embedded.
TSSUMP-synthesize a proxy time series given the time series described in step #1.
Using the recorded indices of each original series from the data in step #2, "cut out" the "anonymized pieces" that serve as our anonymized series.
Cluster the original set of time series and the set of TSSUMP-synthesized series.

UCR Archive: Symbols, Heirarchal Clustering under Ward Linkage

The Symbols dataset is comprised of 13 people asked to copy randomly appearing symbols as well as they could. There were three possible symbols, each person contributed about 30 attempts. The data is the X-Axis motion in drawing the shape. The dataset authors consider this dataset to have six classes.

Five examples from each class were selected to be hierarchically clustered under Ward linkage. Leaf labels and colors correspond to datapoint indices and their class labels, respectively.

top) The clustering on the original data.

bottom) The clustering on the TSSUMP-synthesized data. Any "discrepancies" between the clustering on the original data and the TSSUMP-synthesized data is lightly highlighted.

UCR Archive: GunPoint, Hierarchical Clustering under Average Linkage

The GunPoint dataset involves one female actor and one male actor making a motion with their hand, where the two classes are Gun-Draw and Point. "For Gun-Draw the actors have their hands by their sides. They draw a replicate gun from a hip-mounted holster, point it at a target for approximately one second, then return the gun to the holster, and their hands to their sides. For Point the actors have their gun by their sides. They point with their index fingers to a target for approximately one second, and then return their hands to their sides. For both classes, we tracked the centroid of the actor's right hands in both X- and Y-axes, which appear to be highly correlated. The data in the archive is just the X-axis. Class 1 is "gun" and class 2 is "no gun (pointing)"."

Eight examples from both classes were selected to be hierarchically clustered under Average linkage. Leaf labels and colors correspond to datapoint indices and their class labels, respectively.

top) The clustering on the original data.

bottom) The clustering on the TSSUMP-synthesized data.

The Runtime of TSSUMP Method

We evaluated the runtime performance of the proposed TSSUMP method using ECG time series data from the MIT-BIH Long-Term ECG Database [4]. To do this, we selected the first X data points from patient 14046 and ran TSSUMP until the algorithm converged, with X ranging from 1,000 to 100,000.

The two convergence criteria we used were: first, we required that the Pearson correlation between the input time series and the anonymized time series be between -0.02 and 0.02 (indicating no correlation between the two time series); and second, we required that the Pearson correlation between the input time series' matrix profile and the anonymized time series' matrix profile be greater than 0.7 (indicating sufficient preservation of utility information).

Our results, presented in the figure on the left, show that the relationship between the runtime and the time series length is roughly linear. Specifically, as longer time series contain more subsequences, the algorithm requires more iterations to converge, leading to longer runtimes than for shorter time series.

[4] Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. 2000. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, 23 (2000), e215–e220