Following the previously proposed by Li, Lin, and Oates approach for time-series variable length recurrent patterns discovery based on symbolic discretization and grammatical inference, I have implemented their algorithm in Java (so it runs on all platforms) adding few new features. Among them, is a highly efficient time series anomaly discovery (anomaly detection) algorithms based on the grammar rule density curve. The rule density curve accounts for the amount of grammatical rules which span over a time series point and effectively reflects the short- and long-term correlations between the point and its neighbors. Since grammatical inference process compresses the input data, the number of possible grammar' rules spanning a time-series segment relates to its algorithmic compressibility and reflects its Kolmogorov (i.e., algorithmic) complexity, as we discuss in our work. GrammarViz 2.0 implements two rule density curve-based algorithms for time series anomaly discovery: the straightforward and highly efficient heatmap-like visualization capable to highlight potentially anomalous segments in a linear time and RRA (rare rule anomaly) algorithm, which is a more computationally expensive exact solution for time series discord discovery.
GrammarViz 2.0 is an end-to-end solution for time series data mining that also enables interactive discretization parameters tuning through visual navigation of discretized time-series and inferred grammar rules. We have released Grammarviz 2.0 GUI under GPL, please find the code and documentation at our GitHub repository.
GrammarViz 2.0 screenshot
The background heatmap under time series reflects the values of "grammar rule density curve" that reflects the amount of grammar' rules encoding the string which was obtained by the discretization of this time series with SAX. Note how the anomalous heartbeat is clearly identified by a light blue color shade.
An example of time series anomaly discovery in ECG signal. Top panel shows the true anomalous heartbeat location. Middle panel shows that the rule density curve clearly identifies the true anomaly by its global minimum. Bottom panel confirms that the RRA-reported discord has indeed the largest Euclidean distance value to its nearest non-self match.
An example of RRA algorithm application for discovery of spatio-temporal anomaly. The highlighted trajectory (path) corresponds to an abnormal behavior that does not conform to the usual pattern of exiting and entering the block's parking lot. The multi-dimensional trajectory data (time, latitude, longitude) was transformed into a sequence of scalars by mapping its values to the visit order of a Hilbert space filling curve embedded in the trajectory manifold space.
Sliding window-based symbolic discretization and Sequitur are processing the input data sequentially from left to right, therefore, the rule density curve can be used for online (i.e., real-time) anomaly detection as shown in the next video.
The SAX-VSM is an algorithm for time-series characteristic patterns discovery based on Symbolic Aggregate approXimation (SAX, a technique for time series discretization) and Vector Space Model (VSM an IR toolkit for word relevancy ranking).
Given a training set representing time series classes, SAX-VSM decomposes all class' time series into words via sliding window, building bags of words for each class. In turn, TF*IDF statistics is computed on these. The process yields vectors of class-chracteristic patterns where each pattern is ranked by the class specificity.
These vectors, as we have shown, can be used to build an excellent time series classification engine based on Cosine similarity. In addition to that, these enable interpretation of class specificity and classification results.
Below is shown an example of SAX-VSM application to a subset of the well studied MNIST dataset (10 classes of time series representing handwritten digits). This example illustrates the algorithm's rotational invariance, robustness, and the capacity of multiple characteristic features discovery and ranking. The sliding window of 190 was used in this example.
Note, that in addition, the SAX-VSM library provides SAX discretization parameters optimization scheme based on DIviding RECtangles algorithm, addressing the common for SAX-based techniques problem of optimal discretization parameters selection.
JMotif-SAX Java library provides an API for time series discretization based on Symbolic Aggregate approXimation (SAX). JMotif-SAX implements a multi-threaded SAX discretization procedure that scales very well: