# MLOps Tracking and Empirical Benchmarking (Engineering)

**Navigation:**
  * **Theory introduction:** [See the Intro](../../THEORY.md)
  * **Related mathematical theory:** [See the Mathematical Theory](../../math/meta/10_mlops_evaluation.md)
  
This chapter details the software engineering behind the `evaluation/` module (`metrics.py`, `performance_analyzer.py`, `plotting.py`, and `tracker.py`). While the mathematical solvers handle the internal physics of the model, these scripts act as the external "MLOps wrapper," providing robust, production-safe telemetry to track how models actually perform on out-of-sample data.

---

## Object-Oriented Telemetry (`BenchmarkTracker`)

In complex pipelines (especially evolutionary AutoML where hundreds of models are trained and discarded), passing around loose dictionaries of metrics quickly becomes unmanageable. 

**Architectural Choice (The State Holder):** The framework introduces the `BenchmarkTracker` class. This Object-Oriented component acts as a unified state holder for a specific model's lifecycle. It stores the model's metadata (`time_fit`, `time_predict`), the raw predictions across all data splits (Train, Validation, Test), and the computed diagnostics.

When `slice_and_evaluate` is called, the tracker automatically partitions the contiguous prediction tensor into the correct cross-validation folds and routes them to the metric and diagnostic sub-modules, centralizing the telemetry logic in a single, auditable object.


```{literalinclude} ../../../../src/tam/evaluation/tracker.py
:language: python
:start-after: "#: <tracker_slice_and_evaluate>"
:end-before: "#: </tracker_slice_and_evaluate>"
:caption: src/tam/evaluation/tracker.py (Object-Oriented Telemetry Slicing)
```

---

## Robust Metric Calculation (Pipeline Safety)

When evaluating edge-case models generated by evolutionary pipelines, predictions can occasionally explode (producing `Inf`) or fail entirely (producing `NaN`). If a standard metric function attempts to compute the Mean Squared Error on an array containing a single `NaN`, the entire pipeline will crash.

**Architectural Choice (Safe NumPy Vectorization):** The `calculate_regression_metrics` function relies exclusively on pure, vectorized NumPy operations without heavy dependencies like `sklearn`. Crucially, it enforces a strict boolean mask (`~np.isnan(y_true) & ~np.isnan(y_pred) & ~np.isinf(y_pred)`) before any math is executed.

If an evolutionary algorithm proposes a mathematically unstable formula, this module safely isolates the failure, returning empty or constrained metrics rather than crashing the global optimization loop. 

```{literalinclude} ../../../../src/tam/evaluation/metrics.py
:language: python
:start-after: "#: <calculate_regression_metrics>"
:end-before: "#: </calculate_regression_metrics>"
:caption: src/tam/evaluation/metrics.py (NaN-Safe Metric Vectorization)
```

---

## MLOps Diagnostics: Residuals and Degradation

To move beyond simple point-accuracy metrics, the `performance_analyzer.py` module computes structural diagnostics. 

**Architectural Choice (Fast Proxy Statistics):** Instead of importing heavy statistical libraries (like `statsmodels`) to run a formal Durbin-Watson test for serial correlation, the `analyze_residuals` function computes the **Lag-1 Autocorrelation** natively via `np.corrcoef(res[:-1], res[1:])`. This provides the exact same diagnostic signal (missing temporal features) but executes orders of magnitude faster, which is critical when analyzing hundreds of models in an AutoTAM loop.


```{literalinclude} ../../../../src/tam/evaluation/performance_analyzer.py
:language: python
:start-after: "#: <performance_analyzer_residuals>"
:end-before: "#: </performance_analyzer_residuals>"
:caption: src/tam/evaluation/performance_analyzer.py (Vectorized Residual Analysis)
```

To measure Concept Drift, the `detect_temporal_degradation` function slices the target arrays into strict chronological halves (`midpoint = len(yt) // 2`). It re-routes these sub-arrays back through the safe metric calculator to determine if the test-set error is statistically stationary or actively exploding.

```{literalinclude} ../../../../src/tam/evaluation/performance_analyzer.py
:language: python
:start-after: "#: <performance_analyzer_degradation>"
:end-before: "#: </performance_analyzer_degradation>"
:caption: src/tam/evaluation/performance_analyzer.py (Temporal Degradation Tracking)
```

---

## The Dashboard Facade (`plotting.py`)

Visualizing the results of multiple competing models requires a dynamic layout system.

**Architectural Choice (Dynamic GridSpec):** The `plot_benchmark_dashboard` acts as a universal Facade. It dynamically adjusts its internal Matplotlib `GridSpec` layout based on the nature of the data.
* If `is_timeseries=True`, it allocates massive wide panels to render the temporal prediction curves against the ground truth.
* If it is cross-sectional data, it reconfigures the grid to prioritize Parity Plots (Predicted vs True scatter plots) and Residual Distribution histograms.

This abstraction ensures that data scientists can invoke a single reporting command (`plot_benchmark_dashboard`) and receive a presentation-ready, multi-pane diagnostic view perfectly tailored to the topology of their dataset.


```{literalinclude} ../../../../src/tam/evaluation/eval_plotting.py
:language: python
:start-after: "#: <plotting_dashboard>"
:end-before: "#: </plotting_dashboard>"
:caption: src/tam/evaluation/eval_plotting.py (Dynamic Matplotlib Dashboard Layout)
```