# Empirical Evaluation and MLOps Diagnostics **Navigation:** * **Theory introduction:** [See the Intro](../../THEORY.md) * **Related code architecture:** [See the Code Architecture](../../architecture/meta/10_mlops_tracking_code.md) * **Related inference topic:** [Statistical Diagnostics](07_statistical_diagnostics.md) While the Primal exact solver and the `DiagnosticsTAM` module provide "Glass-Box" statistical inference on the training data (e.g., T-statistics and EDoF), modern Machine Learning Operations (MLOps) require rigorous empirical validation on out-of-sample data. This chapter establishes the mathematical rationale for the specific error metrics and tracking algorithms implemented in the framework's evaluation engine. --- ## 1. The Mathematics of Forecasting Metrics The evaluation engine computes a diverse suite of metrics because no single mathematical norm captures all dimensions of predictive failure {cite:p}`hyndman2006another`. ### $L_1$ and $L_2$ Loss Metrics The fundamental scale-dependent metrics measure the absolute magnitude of the error: * **Mean Absolute Error (MAE):** $$\text{MAE} = \frac{1}{N} \sum_{i=1}^N |Y_i - \hat{Y}_i|$$ MAE penalizes errors linearly, making it mathematically robust to massive, isolated target outliers. * **Root Mean Square Error (RMSE):** $$\text{RMSE} = \sqrt{ \frac{1}{N} \sum_{i=1}^N (Y_i - \hat{Y}_i)^2 }$$ Because RMSE is based on the $L_2$ norm, it penalizes the variance of the errors {cite:p}`chai2014root`. A model with a low MAE but a high RMSE indicates that while it is generally accurate, it occasionally makes catastrophic forecasting errors. ### Relative Percentage Errors (The SMAPE Advantage) In industrial datasets (e.g., varying smart meters), targets exist on vastly different scales. Scale-independent metrics are required to average performance across heterogeneous topologies. * **Mean Absolute Percentage Error (MAPE):** $$\text{MAPE} = \frac{100}{N} \sum_{i=1}^N \left| \frac{Y_i - \hat{Y}_i}{Y_i} \right|$$ While highly interpretable, MAPE possesses a severe mathematical asymmetry: it penalizes over-forecasting ($\hat{Y}_i > Y_i$) exponentially more heavily than under-forecasting. Furthermore, if the true target $Y_i = 0$, the metric explodes to infinity, instantly crashing automated evolutionary pipelines (AutoTAM). * **Symmetric Mean Absolute Percentage Error (SMAPE):** $$\text{SMAPE} = \frac{100}{N} \sum_{i=1}^N \frac{|Y_i - \hat{Y}_i|}{(|Y_i| + |\hat{Y}_i|)/2}$$ To construct a mathematically safe environment for the `AutoTAM` orchestrator, the framework relies heavily on SMAPE {cite:p}`hyndman2006another`. By dividing by the average of the true and predicted values, SMAPE strictly bounds the maximum error for any single observation to exactly $200\%$. This guarantees that a single zero-target anomaly cannot destabilize the global fitness function during evolutionary hyperparameter search. --- ## 2. Temporal Degradation Tracking A core tenet of time-series forecasting is that the assumption of exchangeability (i.i.d.) is false. Data generating processes undergo Concept Drift over time. To quantify this, the `detect_temporal_degradation` algorithm splits a contiguous, out-of-sample test array into two chronological halves: $\mathcal{H}_1$ and $\mathcal{H}_2$. It computes the performance ratio: $$\text{Degradation} (\%) = \left( \frac{\text{RMSE}_{\mathcal{H}_2} - \text{RMSE}_{\mathcal{H}_1}}{\text{RMSE}_{\mathcal{H}_1}} \right) \times 100$$ If this metric yields $+20\%$, it mathematically proves that the model's structural physics are actively decaying, signaling the operational need to trigger the `AdaptiveTAM` or `KalmanTAM` meta-learners. --- ## 3. Residual Autocorrelation (The Durbin-Watson Proxy) If a Generalized Additive Model perfectly captures the conditional expectation $\mu(X)$, the residuals $\epsilon_t = Y_t - \hat{Y}_t$ must be pure White Noise ($\mathbb{E}[\epsilon_t \epsilon_{t-k}] = 0$ for all $k > 0$). If the residuals exhibit serial correlation, it proves the current topology is missing a critical time-dependent feature (e.g., an unmodeled daily seasonality or an auto-regressive tensor product). The `analyze_residuals` module computes the **Lag-1 Autocorrelation** ($\rho_1$) as a highly efficient computational proxy for the canonical Durbin-Watson ($DW$) statistic {cite:p}`durbin1950testing`. The classical $DW$ test evaluates: $$DW = \frac{\sum_{t=2}^N (\epsilon_t - \epsilon_{t-1})^2}{\sum_{t=1}^N \epsilon_t^2}$$ Algebraically, this expands to: $$DW \approx 2(1 - \rho_1)$$ By computing the simple Pearson correlation between $\epsilon_t$ and $\epsilon_{t-1}$: $$\rho_1 = \frac{\text{Cov}(\epsilon_t, \epsilon_{t-1})}{\text{Var}(\epsilon_t)}$$ The orchestrator instantly evaluates the structural integrity of the time-domain. If $\rho_1 \gg 0$, the $DW$ statistic approaches $0$, signaling severe positive autocorrelation and triggering the `AutoTAM` knowledge graph to propose deeper temporal spline expansions.