Accuracy

Accuracy is the most common quantitative performance measure used to evaluate classification systems. It provides a single-number summary of a model’s success by measuring the fraction of instances it correctly identified out of the total population.

1. Calculation and Building Accuracy

In the context of a Confusion Matrix, accuracy is built by summing the correct predictions (the main diagonal) and dividing by the total number of examples.

Formula: Accuracy=(TP+TN)/(FP+FN+TP+TN).
Alternative Definition: Accuracy can also be expressed as 1−Error Rate, where the error rate is the proportion of incorrect predictions.
Multiclass Extension: For tasks with more than two classes, accuracy remains the ratio of correct predictions to the total sample size, though the Confusion Matrix expands to a K×K grid.

2. How to Interpret Accuracy

Accuracy is most effectively interpreted when errors in all classes are judging to be equally important. For example, in an object recognition task for a domestic robot, misidentifying a chair may be no more or less critical than misidentifying a table.

However, interpreting a high accuracy score requires context:

Generalization: High accuracy on training data often indicates nothing more than the model's ability to memorize the dataset (overfitting). True performance must be measured on a test set or through cross-validation to estimate how well the model generalizes to unseen data.
Baselines: An accuracy score is meaningless without a baseline. A model should always be compared to a random baseline or a simple heuristic (like always predicting the most frequent class) to determine if it has learned meaningful patterns.

3. The Critical Flaw: Imbalanced Datasets

The most significant limitation of accuracy is its tendency to be highly misleading on skewed or imbalanced datasets. In many real-world scenarios, one class is much more frequent than others.

The Trivial Solution: If a medical screening task for a rare disease has a 0.1% prevalence (1 in 1,000 people), a "dumb" model that always predicts "healthy" will achieve 99.9% accuracy while failing to detect a single actual case of the disease.
Domination by Majority: Because accuracy treats all samples equally, the model's performance on the majority class will dominate the metric, masking poor performance on the minority class that is often of higher interest or danger.

4. Advanced Accuracy Metrics

To address the limitations of basic accuracy, practitioners use more nuanced versions:

Cost-Sensitive Accuracy: This involves assigning a positive "cost" weight to specific types of mistakes (False Positives vs. False Negatives). The counts for FP and FN are multiplied by these costs before calculating the final score, reflecting that missing a cancer diagnosis is more costly than a false alarm.
Per-Class Accuracy: This calculates the accuracy for each individual class and then takes the average. This ensures that the model is performing well across all categories, regardless of their size.
Balanced Error Rate (BER): This is the average of the per-class error rates, designed to prevent the metric from being dominated by the most frequent classes.

5. Accuracy in the Model Lifecycle

Learning Curves: By plotting accuracy against the size of the training set, analysts can diagnose model problems. High bias (underfitting) is indicated when both training and validation accuracy are low. High variance (overfitting) is indicated by a large gap between high training accuracy and significantly lower validation accuracy.
Trade-offs: Accuracy often trades off against other properties like compute requirements (complex models may be more accurate but slower) or interpretability (simple models like linear regression may be less accurate but easier to explain).
Slice-based Evaluation: Rather than looking at a single aggregate accuracy score, it is crucial to evaluate accuracy on individual slices or subgroups of data to detect hidden biases or performance drops in critical segments (e.g., lower accuracy for a specific demographic group)