📚 Technical docs

Here is the functional diagram of the main objects of this component, as well as their technical documentation:

_images/tdaad.png — Functional diagram of the topological anomaly detection scheme.

class tdaad.anomaly_detectors.TopologicalAnomalyDetector(window_size: int = 100, step: int = 5, tda_max_dim: int = 2, n_centers_by_dim: int = 5, support_fraction: float = None, contamination: float = 0.1, random_state: int = 42)[source]

Object for detecting anomaly base on Topological Embedding and sklearn.covariance.EllipticEnvelope.

This object analyzes multiple time series data through the following operations: - run a sliding window algorithm and represent each time series window with topological features,

see Topological Embedding,

use a MinCovDet algorithm to robustly estimate the data mean and covariance in the embedding space,
and use these to derive an embedding mahalanobis distance and associated outlier detection procedure, see Elliptic Envelope.

After fitting, it is able to produce an anomaly score from a time series describing normal / abnormal time segments. (the lower, the more abnormal) The predict method (inherited from EllipticEnvelope) allows to transform that score into binary normal / anomaly labels.

Read more in the User Guide.

Parameters:

window_size (int, default=40) – Size of the sliding window algorithm to extract subsequences as input to named_pipeline.
step (int, default=5) – Size of the sliding window steps between each window.
tda_max_dim (int, default=2) – The maximum dimension of the topological feature extraction.
n_centers_by_dim (int, default=5) – The number of centroids to generate by dimension for vectorizing topological features. The resulting embedding will have total dimension =< tda_max_dim * n_centers_by_dim. The resulting embedding dimension might be smaller because of the KMeans algorithm in the Archipelago step.
support_fraction (float, default=None) – The proportion of points to be included in the support of the raw MCD estimate. If None, the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2. Range is (0, 1).
contamination (float, default=0.1) – The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Range is (0, 0.5]. Only matters for computing the decision function.
random_state (int, RandomState instance or None, default=None) – Determines the pseudo random number generator for shuffling the data. Pass an int for reproducible results across multiple function calls.

topological_embedding_

TopologicalEmbedding transformer object that is fitted at fit.

Type:: object

Examples

>>> import numpy as np
>>> n_timestamps = 1000
>>> n_sensors = 20
>>> timestamps = pd.to_datetime('2024-01-01', utc=True) + pd.Timedelta(1, 'h') * np.arange(n_timestamps)
>>> X = pd.DataFrame(np.random.random(size=(n_timestamps, n_sensors)), index=timestamps)
>>> X.iloc[n_timestamps//2:,:10] = -X.iloc[n_timestamps//2:,10:20]
>>> detector = TopologicalAnomalyDetector(n_centers_by_dim=2, tda_max_dim=1).fit(X)
>>> anomaly_scores = detector.score_samples(X)
>>> decision = detector.decision_function(X)
>>> anomalies = detector.predict(X)

class tdaad.topological_embedding.TopologicalEmbedding(window_size: int = 40, step: int = 5, tda_max_dim: int = 2, n_centers_by_dim: int = 5)[source]

Topological embedding for multiple time series.

Slices time series into smaller time series windows, forms an affinity matrix on each window and applies a Rips procedure to produce persistence diagrams for each affinity matrix. Then uses Atol [ref:Atol] on each dimension through the gudhi.representation.Archipelago representation to produce topological vectorization.

Read more in the User Guide.

Parameters:

window_size (int, default=40) – Size of the sliding window algorithm to extract subsequences as input to named_pipeline.
step (int, default=5) – Size of the sliding window steps between each window.
n_centers_by_dim (int, default=5) – The number of centroids to generate by dimension for vectorizing topological features. The resulting embedding will have total dimension =< tda_max_dim * n_centers_by_dim. The resulting embedding dimension might be smaller because of the KMeans algorithm in the Archipelago step.
tda_max_dim (int, default=2) – The maximum dimension of the topological feature extraction.

Examples

>>> n_timestamps = 100
>>> n_sensors = 5
>>> timestamps = pd.to_datetime('2024-01-01', utc=True) + pd.Timedelta(1, 'h') * np.arange(n_timestamps)
>>> X = pd.DataFrame(np.random.random(size=(n_timestamps, n_sensors)), index=timestamps)
>>> TopologicalEmbedding(n_centers_by_dim=2, tda_max_dim=1).fit_transform(X)