PRCIS: Pattern Recognition Comparison in Series

Matrix Profile XXVII: A Novel Distance Measure for Comparing Long Time SeriesThe most useful data mining primitives are distance measures. With an effective distance measure, it is possible to perform classification, clustering, anomaly detection, segmentation, etc. For single-event time series Euclidean Distance and Dynamic Time Warping distance are known to be extremely effective. However, for time series containing cyclical behaviors, the semantic meaningfulness of such comparisons is less clear. For example, on two separate days the telemetry from an athlete workout routine might be very similar. The second day may change the order in of performing push-ups and squats, adding repetitions of pull-ups, or completely omitting dumbbell curls. Any of these minor changes would defeat existing time series distance measures. Some bag-of-features methods have been proposed to address this problem, but we argue that in many cases, similarity is intimately tied to the shapes of subsequences within these longer time series. In such cases, summative features will lack discrimination ability. In this work we introduce PRCIS, which stands for Pattern Representation Comparison in Series. PRCIS is a distance measure for long time series, which exploits recent progress in our ability to summarize time series with dictionaries. We will demonstrate the utility of our ideas on diverse tasks and datasets.

We are happy to announce that PRCIS has been accepted to ICKG 2022.

Audrey Der, Chin-Chia Michael Yeh, Renjie Wu, Junpeng Wang, Yan Zheng, Zhongfang Zhuang, Liang Wang, Wei Zhang, Eamonn Keogh

emails: {ader003, rwu034}@ucr.edu, {miyeh, junpenwa, yazheng, zzhuang, liawang, wzhan}@visa.com, eamonn@cs.ucr.edu

Resources and Code

Paper (arxiv; PDF)
LINK TO REPOSITORY: Contains the codebase and subsets of data (when subsets were used).

Note: This is a supplementary website intended to be referenced in tandem with its corresponding paper, and is not meant to be used alone.

Note: "PRECIS" was the original spelling of the method, and any remaining instances of this spelling are a byproduct of this change.

Quickstart

Notation

In the paper we refer to the dictionary parameters S and L as the size of the dictionary and length of the patterns within. Due to the fact the codebase written over an extended period of time, the naming of the variables within may vary.

S may be referred to as NUMPAT ("number of patterns") for short.
L may be referred to as WINLEN ("window length"), CYCLELEN ("cycle length"), or something along the lines of "pattern length".

A Short Tutorial

The codebase uses Experiment objects as defined below for generating Yeh Dictionaries and calculating distance matrices (regardless of dictionary creation method).

class Experiment:

def __init__(self, distmet, dict_settings, algyield=True, multivariate=False, downsamplefactor=1):

self.distmet = distmet # distance metric, "DTW", "ED", "PRECIS"

self.numpatt = dict_settings[0]

self.cyclelen = dict_settings[1]

self.algyield = algyield # yield to dict method or exclude any generated patterns not of this exact length

self.multivariate = multivariate # multivar PRECIS extension; only used during the development of this work, not presented in paper

self.downsamplefactor = downsamplefactor #typically untouched; only used during development of this work, not presented in paper

Here is a simple sample snippet of what creating Yeh Dictionaries from each time series and computing a PRECIS distance matrix:

exp = Experiment("PRECIS",[4,150])

use_dicts = []

for ts in dataset:

d, idxs = make_exemplar(ts) #idxs will be a list of tuples in the form of (start,end) indices of each pattern from ts

use_dicts.append(d)

distmat = exp.distmat_from_dicts(use_dicts)

The Yeh Dictionary creation method is directly called within class methods, and is automatically used during make_exemplar. To use a different dictionary method, do not use Experiment.make_exemplar.

Clustering

Note: Figure placement indicators may not be accurate when viewed on a mobile device.

Note: Rival methods not pictured on this website are easily viewable by viewing their notebooks through the github repository.

OPSD_CLU.ipynb: left, top) OPSD Two-month snippets of the electrical power demand data from four randomly selected countries in Europe. Includes:
- (Not Pictured) OPSD Random Day Strawman
WeAllWalk_CLU.ipynb: left, bottom) We-All-Walk Dendrogram.
- Includes:
  - Catch22:
    - All features
    - (Not Pictured) FS features (as determined during classification)
  - (Not Pictured) Random Non-Obvious Holiday
TaipeiMRT_CLU.ipynb: left, middle) Taipei MRT Clustering
NASAMill_CLU.ipynb: right, top) NASA Mill Dataset
- (Not Pictured) k-shape
- (Not Pictured) Folder of Results: Cluster by Period
bottom right) Due to the sensitive nature of the data, we cannot share the dataset or code used to generate the Business Merchant figures at this time. We thank you for your understanding.