Models and Configuration Guide

This section provides comprehensive examples of machine learning models available in ObServML, organized by experiment type. Each model includes practical configuration examples from real use cases, REST API usage patterns, and parameter explanations.

ObServML supports four main experiment types: - Time Series Analysis: Forecasting and anomaly detection in temporal data - Fault Detection: Unsupervised anomaly detection in sensor data - Fault Isolation: Supervised classification for root cause analysis - Process Mining: Analysis of sequential processes and workflows

Configuration Structure

All models follow a consistent YAML configuration structure:

load_object:
  module: framework.{ExperimentType}
  name: {ExperimentType}Experiment
setup:
  datetime_column: "timestamp"  # Column containing timestamps
  target: "target_variable"     # Target variable (for supervised learning)
  # Additional setup parameters...
eda:  # Optional: Exploratory Data Analysis
create_model:
  model: "model_name"
  params:
    # Model-specific parameters

Time Series Analysis

Time Series Analysis experiments are designed for monitoring sensor data over time, offering insights into individual sensor behavior and enabling predictive maintenance through forecasting and anomaly detection.

Prophet

Prophet is excellent for time series forecasting with strong seasonal patterns and holiday effects.

Configuration Example:

load_object:
  module: framework.TimeSeriesAnalysis
  name: TimeSeriesAnomalyExperiment
setup: 
  datetime_column: "ds"
  datetime_format: "ms"
  target: "y"
  predict_window: 1000
  retrain:
    retrain_window: 5000
    metric: "MAPE"
    metric_threshold: 0.3
    higher_better: false
eda:
create_model:
  model: "prophet"
  params:
    periods: 0
    factor: 1.0
    forecast_window: 100

REST API Usage:

# Train Prophet model
curl -X POST "http://localhost:8010/timeseries_prophet/train" \
  -H "Content-Type: application/json" \
  -d '{
    "load_object": {
      "module": "framework.TimeSeriesAnalysis",
      "name": "TimeSeriesAnomalyExperiment"
    },
    "setup": {
      "datetime_column": "ds",
      "target": "y",
      "predict_window": 1000
    },
    "create_model": {
      "model": "prophet",
      "params": {
        "periods": 0,
        "factor": 1.0,
        "forecast_window": 100
      }
    }
  }'

# Make predictions
curl -X POST "http://localhost:8010/timeseries_prophet/predict"

# Get forecast plot
curl "http://localhost:8010/timeseries_prophet/plot/forecast"

Parameters: - periods: Number of periods to forecast forward - factor: Anomaly detection factor - forecast_window: Number of future points to predict

ARIMA/SARIMA

ARIMA models are ideal for stationary time series with clear autoregressive patterns.

Configuration Example:

load_object:
  module: framework.TimeSeriesAnalysis
  name: TimeSeriesAnomalyExperiment
setup: 
  datetime_column: "ds"
  target: "y"
eda:
create_model:
  model: "arima"
  params:  
    start_p: 10
    d: 1
    start_q: 10
    max_p: 100
    max_q: 100
    seasonal: false
    threshold_for_anomaly: 3

Parameters: - start_p: Starting value for autoregressive order - d: Differencing order - start_q: Starting value for moving average order - max_p/max_q: Maximum values for parameter search - seasonal: Enable seasonal ARIMA (SARIMA) - threshold_for_anomaly: Standard deviations for anomaly detection

LSTM

Long Short-Term Memory networks for complex temporal patterns and non-linear relationships.

Configuration Example:

load_object:
  module: framework.TimeSeriesAnalysis
  name: TimeSeriesAnomalyExperiment
setup: 
  datetime_column: "ds"
  target: "y"
eda:
create_model:
  model: "lstm"
  params:
    seq_length: 50
    layer_no: 2
    cell_no: 64
    epoch_no: 100
    batch_size: 32
    shuffle: true
    patience: 10
    threshold_for_anomaly: 3

Parameters: - seq_length: Length of input sequences - layer_no: Number of LSTM layers - cell_no: Number of cells per LSTM layer - epoch_no: Training epochs - batch_size: Training batch size - patience: Early stopping patience - threshold_for_anomaly: Anomaly detection threshold

Autoencoder

Neural network autoencoders for anomaly detection in time series through reconstruction error.

Configuration Example:

load_object:
  module: framework.TimeSeriesAnalysis
  name: TimeSeriesAnomalyExperiment
setup: 
  datetime_column: "ds"
  target: "y"
eda:
create_model:
  model: "ae"
  params:
    layer_no: 6
    window: 250
    epoch_no: 50
    batch_size: 64
    shuffle: false
    threshold_for_anomaly: 3
    neuron_no_enc: [30, 25, 20, 15, 10, 5]
    neuron_no_dec: [5, 10, 15, 20, 25, 30]
    act_enc: 'relu'
    act_dec: 'relu'

Parameters: - layer_no: Number of layers in encoder/decoder - window: Moving window size for sequences - neuron_no_enc/dec: Neurons per layer in encoder/decoder - act_enc/dec: Activation functions

SSA (Singular Spectrum Analysis)

SSA for time series decomposition and anomaly detection.

Configuration Example:

load_object:
  module: framework.TimeSeriesAnalysis
  name: TimeSeriesAnomalyExperiment
setup: 
  datetime_column: "ds"
  target: "y"
eda:
create_model:
  model: "ssa"
  params:
    window_size: 10
    lower_frequency_bound: 0.05
    lower_frequency_contribution: 0.975
    threshold: 3

Fault Detection

Fault Detection experiments focus on unsupervised anomaly detection in multivariate sensor data, identifying deviations from normal operating conditions without requiring labeled data.

Isolation Forest

Isolation Forest is highly effective for anomaly detection in high-dimensional data by isolating anomalies through random partitioning.

Configuration Example:

load_object:
  module: framework.FaultDetection  
  name: FaultDetectionExperiment
setup:
  datetime_column: "ds"
  datetime_format: "ms"
eda:
create_model:
  model: "iforest"
  params:
    n_estimators: 100
    contamination: "auto"
    random_state: 0

REST API Usage:

# Train Isolation Forest model
curl -X POST "http://localhost:8010/pump_anomaly/train" \
  -H "Content-Type: application/json" \
  -d '{
    "load_object": {
      "module": "framework.FaultDetection",
      "name": "FaultDetectionExperiment"
    },
    "setup": {
      "datetime_column": "ds"
    },
    "create_model": {
      "model": "iforest",
      "params": {
        "n_estimators": 100,
        "contamination": "auto"
      }
    }
  }'

# Detect anomalies in new data
curl -X POST "http://localhost:8010/pump_anomaly/predict"

# Get anomaly visualization
curl "http://localhost:8010/pump_anomaly/plot/outliers"

Parameters: - n_estimators: Number of isolation trees - contamination: Expected proportion of anomalies ("auto" for automatic detection) - random_state: Random seed for reproducibility

PCA (Principal Component Analysis)

PCA with Hotelling's T² and SPE tests for multivariate anomaly detection, as demonstrated in the research paper with pump sensor data.

Configuration Example:

load_object:
  module: framework.FaultDetection  
  name: FaultDetectionExperiment
setup:
  datetime_column: "ds"
  datetime_format: "ms"
eda:
create_model:
  model: "pca"
  params:
    alpha: 0.05
    detect_outliers: ['ht2', 'spe']
    n_components: 0.95
    normalize: true

Use Case Example (from research paper): The pump dataset contains 50 sensors without labeled data, making it ideal for fault detection. PCA reduces dimensionality while preserving 95% of variance, then applies statistical tests to identify anomalies.

Parameters: - alpha: Significance level for Hotelling's T² test - detect_outliers: Types of outlier detection ['ht2', 'spe'] - n_components: Variance explained by principal components - normalize: Apply data normalization

DBSCAN

Density-based clustering for anomaly detection, effective for identifying clusters of normal behavior.

Configuration Example:

load_object:
  module: framework.FaultDetection  
  name: FaultDetectionExperiment
setup:
  datetime_column: "ds"
  datetime_format: "ms"
eda:
create_model:
  model: "dbscan"
  params:
    eps: 2
    min_samples: 5

Parameters: - eps: Maximum distance between points in a neighborhood - min_samples: Minimum points required to form a cluster

Elliptic Envelope

Robust covariance estimation for outlier detection in multivariate data.

Configuration Example:

load_object:
  module: framework.FaultDetection  
  name: FaultDetectionExperiment
setup:
  datetime_column: "ds"
  datetime_format: "ms"
eda:
create_model:
  model: "ee"
  params:
    contamination: 0.1

Parameters: - contamination: Proportion of outliers in the dataset (0.0 to 0.5)

Fault Isolation

Fault Isolation experiments perform supervised classification when labeled data is available, enabling root cause analysis and identification of specific fault types or machine states.

Decision Tree

Decision trees provide interpretable classification with feature importance analysis, as demonstrated in the research paper with electrical circuit data.

Configuration Example:

load_object:
  module: framework.FaultIsolation
  name: FaultIsolationExperiment
setup:
  datetime_column: "ds"
  datetime_format: "ms"
  target: "Output (S)"
  predict_window: 1000
  retrain:
    retrain_window: 5000
    metric: "Accuracy"
    metric_threshold: 0.9
    higher_better: true
eda:
create_model:
  model: "dt"
  params:

REST API Usage:

# Train Decision Tree for fault isolation
curl -X POST "http://localhost:8010/electrical_faults/train" \
  -H "Content-Type: application/json" \
  -d '{
    "load_object": {
      "module": "framework.FaultIsolation",
      "name": "FaultIsolationExperiment"
    },
    "setup": {
      "datetime_column": "ds",
      "target": "Output (S)",
      "predict_window": 1000
    },
    "create_model": {
      "model": "dt",
      "params": {}
    }
  }'

# Classify new faults
curl -X POST "http://localhost:8010/electrical_faults/predict"

# Get feature importance plot
curl "http://localhost:8010/electrical_faults/plot/feature_importance"

# Get confusion matrix
curl "http://localhost:8010/electrical_faults/plot/confusion_matrix"

Use Case Example (from research paper): Electrical circuit monitoring with 3-phase electricity data (Va, Vb, Vc voltages and Ia, Ib, Ic currents). The decision tree identifies which sensors contribute most to fault detection, with current Ib showing highest feature importance.

Random Forest

Ensemble method combining multiple decision trees for robust classification.

Configuration Example:

load_object:
  module: framework.FaultIsolation
  name: FaultIsolationExperiment
setup:
  datetime_column: "ds"
  target: "fault_type"
eda:
create_model:
  model: "rf"
  params:
    n_estimators: 100
    max_depth: 10
    random_state: 42

Naive Bayes

Probabilistic classifier based on Bayes' theorem, effective for categorical fault classification.

Configuration Example:

load_object:
  module: framework.FaultIsolation
  name: FaultIsolationExperiment
setup:
  datetime_column: "ds"
  target: "fault_category"
eda:
create_model:
  model: "nb"
  params:

Hidden Markov Models (HMM)

HMM for sequential fault pattern recognition and state-based fault isolation.

Configuration Example:

load_object:
  module: framework.FaultIsolation
  name: FaultIsolationExperiment
setup:
  datetime_column: "ds"
  target: "machine_state"
eda:
create_model:
  model: "hmm"
  params:
    n_iter: 1000
    covariance_type: "diag"
    n_mix: 10

Parameters: - n_iter: Maximum number of EM iterations - covariance_type: Type of covariance matrix - n_mix: Number of mixture components

Bayesian Network

Probabilistic graphical model for understanding causal relationships between variables.

Configuration Example:

load_object:
  module: framework.FaultIsolation
  name: FaultIsolationExperiment
setup:
  datetime_column: "ds"
  target: "fault_root_cause"
eda:
create_model:
  model: "bn"
  params: 
    learningMethod: 'MIIC'
    prior: 'Smoothing'
    priorWeight: 1
    discretizationNbBins: 30
    discretizationStrategy: "quantile"
    discretizationThreshold: 0.01
    usePR: false

Note: Bayesian Network prediction requires target variable and may have Docker compatibility issues with pyAgrum.

Process Mining

Process Mining experiments analyze sequential data to understand workflows, operator behavior, and process optimization opportunities.

Heuristics Miner

Discovers process models from event logs using heuristic rules.

Configuration Example:

load_object:
  module: framework.ProcessMining
  name: ProcessMiningExperiment
setup:
eda:
create_model:
  model: "heuristics"
  params:

Apriori Association Rules

Finds frequent patterns and association rules in sequential data.

Configuration Example:

load_object:
  module: framework.ProcessMining
  name: ProcessMiningExperiment
setup:
eda:
create_model:
  model: "apriori"
  params:
    min_support: 0.1
    min_confidence: 0.8

TopK Rules

Discovers the top-K most interesting association rules.

Configuration Example:

load_object:
  module: framework.ProcessMining
  name: ProcessMiningExperiment
setup:
eda:
create_model:
  model: "topk"
  params:
    k: 100
    min_confidence: 0.5

CMSPAM

Closed sequential pattern mining for discovering frequent subsequences.

Configuration Example:

load_object:
  module: framework.ProcessMining
  name: ProcessMiningExperiment
setup:
eda:
create_model:
  model: "cmspam"
  params:
    min_support: 0.1

Advanced Configuration Options

Automatic Retraining

Configure automatic model retraining based on performance metrics:

setup:
  retrain:
    retrain_window: 5000        # Use last 5000 samples for retraining
    metric: "Accuracy"          # Metric to monitor (Accuracy, F1, Precision, Recall, MAPE, MSE)
    metric_threshold: 0.9       # Threshold that triggers retraining
    higher_better: true         # Whether higher metric values are better

Data Formatting

Handle different data formats with custom formatting options:

setup:
  format:
    name: "pivot"               # Formatting mode
    id: "tsdata"               # Data variable name
    max_level: 1
    columns: "target"
    index: "date"
    values: "value"

Prediction Windows

Control visualization and prediction scope:

setup:
  predict_window: 1000          # Number of samples to show in prediction plots
  forecast_window: 100          # Number of future points to predict

REST API Patterns

Common Endpoints

All experiments support these standard endpoints:

POST /{experiment_name}/train - Train a new model
POST /{experiment_name}/predict - Make predictions
POST /{experiment_name}/save - Save model to MLflow
POST /{experiment_name}/load - Load model from MLflow
GET /{experiment_name}/plot/{plot_name} - Get visualization
GET /{experiment_name}/cfg - Get configuration
GET /{experiment_name}/run_id - Get MLflow run ID

Batch Processing Example

# Train multiple models in sequence
for model in "iforest" "pca" "dbscan"; do
  curl -X POST "http://localhost:8010/sensor_${model}/train" \
    -H "Content-Type: application/json" \
    -d @configs/pump/${model}.yaml
done

# Monitor all models
for model in "iforest" "pca" "dbscan"; do
  curl -X POST "http://localhost:8010/sensor_${model}/predict"
done

Model Selection Guidelines

Time Series Analysis

Prophet: Strong seasonality, holiday effects, missing data tolerance
ARIMA: Stationary data, clear autoregressive patterns
LSTM: Complex non-linear patterns, large datasets
Autoencoder: Anomaly detection, reconstruction-based analysis
SSA: Trend and seasonal decomposition

Fault Detection

Isolation Forest: High-dimensional data, unknown anomaly types
PCA: Multivariate data, statistical anomaly detection
DBSCAN: Density-based clusters, irregular cluster shapes
Elliptic Envelope: Gaussian-distributed data, robust outlier detection

Fault Isolation

Decision Tree: Interpretability required, feature importance analysis
Random Forest: Robust classification, ensemble benefits
Naive Bayes: Categorical features, probabilistic classification
HMM: Sequential patterns, state-based analysis
Bayesian Network: Causal relationships, probabilistic inference

Process Mining

Heuristics Miner: Process discovery from event logs
Apriori: Frequent pattern mining, association rules
TopK Rules: Most interesting patterns, rule ranking
CMSPAM: Sequential pattern mining, closed patterns

Troubleshooting

Common Issues

BayesNet Docker Issues: Works locally but may fail in Docker due to pyAgrum compilation requirements
PCA Indexing: Avoid duplicate indices in input data
Process Mining: Some models repeat training during prediction phase
Memory Requirements: Deep learning models (LSTM, Autoencoder) require sufficient RAM

Performance Optimization

Use appropriate predict_window sizes to balance visualization and performance
Configure retrain_window based on data velocity and model stability
Monitor MLflow for model performance metrics and storage usage
Use batch processing for multiple model training

This comprehensive guide provides the foundation for implementing robust monitoring solutions with ObServML across various industrial applications and use cases.