Following the previously proposed by Li, Lin, and Oates approach for time-series variable length recurrent patterns discovery based on symbolic discretization and grammatical inference, I have implemented their algorithm in Java adding few new features. Among them are highly efficient time series anomaly discovery (anomaly detection): RRA (Rare Rule Anomaly) for exact anomaly discovery and the "rule density curve" for an approximate anomaly discovery. Both algorithms are based on the rule density curve concept which corresponds to the amount of grammatical rules which span over a time series point and effectively reflects the short- and long-term correlations between the point and its neighbors. Since grammatical inference process compresses the input data, the number of possible grammar' rules spanning a time-series segment relates to its algorithmic compressibility (i.e., Kolmogorov complexity) as we discuss in our work.
GrammarViz 2.0 is an end-to-end solution for time series data mining that also enables interactive discretization parameters tuning through visual navigation of discretized time-series and inferred grammar rules. We have released Grammarviz 2.0 GUI under GPL, please find the code and documentation at our GitHub repository.
An example of time series anomaly discovery in ECG signal. Top panel shows the true anomalous heartbeat location. Middle panel shows that the rule density curve clearly identifies the true anomaly by its global minimum. Bottom panel confirms that the RRA-reported discord has indeed the largest Euclidean distance value to its nearest non-self match.
An example of RRA algorithm application for discovery of spatio-temporal anomaly. The highlighted trajectory (path) corresponds to an abnormal behavior that does not conform to the usual pattern of exiting and entering the block's parking lot. The multi-dimensional trajectory data (time, latitude, longitude) was transformed into a sequence of scalars by mapping its values to the visit order of a Hilbert space filling curve embedded in the trajectory manifold space.
The SAX-VSM is an algorithm for time-series characteristic patterns discovery based on Symbolic Aggregate approXimation (SAX, a technique for time series discretization) and Vector Space Model (VSM an IR toolkit for word relevancy ranking).
Given a training set representing time series classes, SAX-VSM decomposes all class' time series into words via sliding window, building bags of words for each class. In turn, TF*IDF statistics is computed on these. The process yields vectors of class-chracteristic patterns where each pattern is ranked by the class specificity.
These vectors, as we have shown, can be used to build an excellent time series classification engine based on Cosine similarity. In addition to that, these enable interpretation of class specificity and classification results.
Below is shown an example of SAX-VSM application to a subset of the well studied MNIST dataset (10 classes of time series representing handwritten digits). This example illustrates the algorithm's rotational invariance, robustness, and the capacity of multiple characteristic features discovery and ranking. The sliding window of 190 was used in this example.
Note, that in addition, the SAX-VSM library provides SAX discretization parameters optimization scheme based on DIviding RECtangles algorithm, addressing the common for SAX-based techniques problem of optimal discretization parameters selection.
JMotif-SAX Java library provides an API for time series discretization based on Symbolic Aggregate approXimation (SAX). JMotif-SAX implements a multi-threaded SAX discretization procedure that scales very well: