- Regression Metrics
- Mean Squared Error (MSE): The most common regression metric, measuring the average of the squares of the errors. It is sensitive to outliers because errors are squared.
- Root Mean Squared Error (RMSE): The square root of MSE, which brings the error back to the same scale and units as the target variable.
- Mean Absolute Error (MAE): The average of the absolute differences between predictions and targets. It is more robust to outliers than MSE.
- Coefficient of Determination (R2): Measures the proportion of variance in the target variable that is explained by the model.
- Ranking and Sequence Metrics
- Discounted Cumulative Gain (DCG) and nDCG: These metrics evaluate the quality of ranking in search engines by accounting for the position of relevant results; higher positions contribute more to the score.
- BLEU Score (Bilingual Evaluation Understudy): A standard metric for evaluating machine translation quality by comparing n-gram overlaps between the model's output and human-provided targets.
- Perplexity: A common metric for language models that measures how well the model predicts a sample.
- Unsupervised and Clustering Metrics
- Adjusted Rand Index (ARI): Measures the similarity between two different clusterings (e.g., predicted clusters vs. ground truth labels) while correcting for chance.
- Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters, providing a way to assess cluster compactness and separation without ground truth.
- Theoretical and Model Capacity Metrics
- VC Dimension (Vapnik-Chervonenkis Dimension): A measure of the complexity or "wiggliness" of a class of functions, used to provide theoretical bounds on generalization error.
- Information Criteria (AIC/BIC): Used for model selection, these metrics penalize model complexity (the number of parameters) to prevent overfitting.
Advanced Interpretation Tools
-
F1-Score: The harmonic mean of precision and recall, providing a single number to balance the trade-off between the two.
-
Cohen’s Kappa statistic (κ): A statistic that measures model performance while accounting for the possibility of a classifier guessing correctly by chance. Values between 0.61 and 0.80 are generally considered "good".
-
Cost-Sensitive Evaluation: In some scenarios, different errors have different consequences (e.g., missing a cancer diagnosis is worse than a false alarm). You can assign a cost matrix to weight FP and FN errors differently when evaluating performance