Accuracy

Accuracy is one of the most straightforward and widely used evaluation metrics for classification models. It measures the proportion of correctly predicted instances out of the total number of predictions made. In simple terms, accuracy answers the question: "How often is the classifier right?"

The value of accuracy ranges between 0 and 1, or 0% to 100% when expressed as a percentage. A value of 1 (or 100%) indicates perfect predictions, while a value of 0 means that the model predicted every instance incorrectly.

Definition and Intuition

Accuracy is a classification metric that quantifies how often a machine learning model makes correct predictions. Formally, it is defined as the ratio of correctly predicted observations to the total number of observations. The metric gives a simple scalar value, usually expressed as a percentage, that indicates the proportion of predictions the model got right.

Model Performance Insights

High Accuracy (closer to 1 or 100%)

Indicates that the model is correctly predicting a large number of cases. This usually suggests a good model fit, especially when the dataset is balanced and prediction costs are uniform.

Moderate Accuracy (around 0.5 or 50%)

May suggest that the model is not much better than random guessing, especially in binary classification. It could point to a need for better features, model tuning, or data preprocessing.

Low Accuracy (close to 0)

Implies poor model performance, most predictions are incorrect. This is often a sign of severe underfitting or a fundamentally flawed model or dataset.

A high accuracy does not always mean a good model, especially in imbalanced datasets, where one class dominates. For example, in a dataset where 95% of samples belong to class A, a model that always predicts class A will achieve 95% accuracy but is effectively useless for detecting class B.

Core Concepts

Simplicity and Interpretability

Accuracy is easy to compute and understand, making it a natural first step in model evaluation. For many real-world classification tasks, it's the go-to metric; especially when misclassification costs are equal.

Class Balance Sensitivity

Accuracy assumes that all classes are equally important and balanced. In datasets with significant class imbalance, accuracy may give a false impression of model performance.

Use in Multi-Class Problems

In multi-class classification, accuracy is still defined as the number of correct predictions divided by the total number of samples. It works in the same way, though individual class performance can get obscured. For a better understanding of which classes are misclassified, confusion matrices or class-wise metrics are more informative.

Diagnostic, Not Prescriptive

Accuracy tells you how well your model is doing overall but doesn’t explain why it's performing that way. It doesn’t reveal error patterns or guide model improvement.

Threshold Dependence (for Probabilistic Models)

In models that output probabilities (e.g., logistic regression, neural networks), the final class prediction often depends on a decision threshold (commonly 0.5). Changing the threshold affects the accuracy, which may be misleading if not evaluated properly.

Mathematical Formulation

Accuracy is mathematically defined as the proportion of correct predictions among all predictions made by the model.

Binary and Multi-class Accuracy

For binary classification:

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

TP - True Positives : correctly predicted positive instances.
TN - True Negatives : correctly predicted negative instances.
FP - False Positives : negative instances incorrectly predicted as positive.
FN - False Negatives : positive instances incorrectly predicted as negative.

For multi-class classification:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

This simply counts all correct predictions across all classes and divides by the total number of instances.

Calculation Procedure

Generate predictions for all samples in the dataset using your model.
Compare predicted labels with the actual (true) labels.
Count the number of correct predictions (i.e., where predicted label equals the true label).
Divide the number of correct predictions by the total number of samples:

$$\text{Accuracy} = \frac{\sum_{i=1}^{n} \mathbf{1}(y_i = \hat{y}_i)}{n}$$

$y_i$ is the true label.
$\hat{y}_i$ is the predicted label.
$\mathbf{1}(\cdot)$ is an indicator function that returns 1 if the argument is true, and 0 otherwise.
$n$ is the total number of predictions.

Example

Let’s walk through an example of computing accuracy for a binary classification problem.

True labels:

$y = [1, 0, 1, 1, 0, 0, 1, 0]$

Predicted labels:

$\hat{y} = [1, 0, 0, 1, 0, 1, 1, 0]$

Step 1: Compare each prediction with the true label

Index	True Label ($y_i$)	Predicted Label ($\hat{y}_i$)	Correct?
0	1	1	✅
1	0	0	✅
2	1	0	❌
3	1	1	✅
4	0	0	✅
5	0	1	❌
6	1	1	✅
7	0	0	✅

Step 2: Count correct predictions

Correct predictions: indices 0, 1, 3, 4, 6, 7 ⇒ 6 correct predictions.

Step 3: Total number of predictions

8 predictions.

Step 4: Compute Accuracy

$$\text{Accuracy} = \frac{6}{8} = 0.75$$

The model achieved an accuracy of 75%, meaning it correctly classified 6 out of 8 instances.

Properties and Behavior

Understanding the properties of accuracy is essential to interpreting it correctly and knowing when it may or may not be an appropriate performance measure. While accuracy is intuitive and easy to compute, it comes with specific behaviors that can mislead model evaluation.

Typical Value Ranges

Best case:

All predictions are correct.

$\text{Accuracy} = 1.0 \quad \text{(or 100%)}$

Worst case:

All predictions are incorrect.

$\text{Accuracy} = 0.0 \quad \text{(or 0%)}$

Baseline (random or majority guess):

The baseline accuracy depends on the class distribution. For example:

In a balanced binary classification task, random guessing yields ~50% accuracy.
In an imbalanced dataset (e.g., 90% of instances in one class), always predicting the majority class gives 90% accuracy, but this doesn’t mean the model is good.

Accuracy is meaningful when classes are balanced and the cost of misclassification is uniform.

Accuracy is misleading when one class dominates or when different types of errors have unequal costs.

Sensitivity to Outliers or Noise

Unlike regression metrics like MSE or RMSE, accuracy is not sensitive to the magnitude of prediction error, because it only considers whether a prediction is right or wrong, not how wrong.

Label noise:

If true labels are incorrectly assigned, accuracy may drop even if the model is making reasonable predictions.

Outliers in class distribution:

A few rare class samples (minority class) may be ignored entirely by a model that optimizes for accuracy. This leads to high accuracy but poor recall or F1-score for the minority class.

In a dataset with 99% of samples labeled as class 0 and 1% as class 1, a model that always predicts class 0 will have 99% accuracy, but 0% recall for class 1.

Differentiability and Role in Optimization

Accuracy is not differentiable, which makes it unsuitable as a loss function for gradient-based optimization. It is involves discrete comparisons (whether a prediction matches the label), which are non-differentiable operations. We cannot compute gradients from accuracy with respect to model parameters (e.g., weights in a neural network).

Instead, differentiable loss functions are used during training, such as Binary Cross-Entropy, Categorical Cross-Entropy or Hinge Loss.

Accuracy is then used after training as a diagnostic or validation metric to evaluate how well the model is performing on unseen data.

Assumptions and Limitations

Despite its simplicity, accuracy makes several implicit assumptions that can lead to misleading interpretations.

Equal Class Distribution:

Assumes that all classes occur in roughly equal proportions. In practice, many datasets are imbalanced, where one or more classes dominate. In such cases, accuracy favors the majority class.

Equal Error Costs:

Assumes that the cost of all types of misclassification is the same. False negatives might be much worse than false positives (e.g., in cancer detection). Accuracy does not capture this distinction.

No Insight into Confidence:

Accuracy doesn’t consider the confidence level of predictions. A model that’s 51% confident in its predictions is treated the same as one that’s 99% confident, as long as both get the label correct.

Single Threshold Dependence:

For probabilistic models, accuracy is calculated using a specific classification threshold (e.g., 0.5). Changing the threshold can significantly alter accuracy. Other metrics like ROC-AUC or F1-score offer a more threshold-independent view.

While accuracy is easy to understand and compute, it’s not always trustworthy, especially in high-stakes, imbalanced, or cost-sensitive classification tasks. Always consider the context and complement accuracy with other classification metrics.

Code Example

fn calculate_accuracy(actual: &[u8], predicted: &[u8]) -> f64 {
    if actual.len() != predicted.len() {
        panic!("Length of actual and predicted arrays must be the same.");
    }

    let correct = actual.iter()
        .zip(predicted.iter())
        .filter(|(a, p)| a == p)
        .count();

    let total = actual.len();

    correct as f64 / total as f64
}

fn main() {
    // Example data
    let actual_labels = vec![1, 0, 1, 1, 0, 0, 1, 0];
    let predicted_labels = vec![1, 0, 0, 1, 0, 1, 1, 0];

    // Calculate accuracy
    let accuracy = calculate_accuracy(&actual_labels, &predicted_labels);

    // Output result
    println!("Accuracy: {:.2}%", accuracy * 100.0);
}

Explanation

Input Validation:

Ensures that actual and predicted slices have the same length. This is essential to avoid runtime panics or incorrect calculations.

Main Logic:

The zip function pairs each corresponding element.
The filter keeps only matching pairs (correct predictions).
The count() gives the total number of correct predictions.

Accuracy Calculation:

correct as f64 / total as f64 Converts integers to floats and divides to compute the proportion of correct predictions.

Output

Accuracy: 75.00%

True Labels: [1, 0, 1, 1, 0, 0, 1, 0]
Predictions: [1, 0, 0, 1, 0, 1, 1, 0]

Correct predictions are at indices 0, 1, 3, 4, 6, 7 (6 out of 8):

$\frac{6}{8} = 0.75 \Rightarrow 75\%$

Alternative Metrics

While accuracy is a widely used and intuitive metric, it has significant limitations, especially when dealing with imbalanced datasets, cost-sensitive tasks, or when a deeper understanding of error types is needed. To address these limitations, several alternative metrics are used to provide a more nuanced view of classification performance.

Precision

Precision (also known as Positive Predictive Value) measures the proportion of positive predictions that were actually correct.

$$\text{Precision} = \frac{TP}{TP + FP}$$

High precision means that when the model predicts a positive class, it's usually right.
Especially important in cases where false positives are costly (e.g., spam filters, fraud detection).

"Out of all the samples predicted as positive, how many are truly positive?"

👉 A detailed explanation of Precision can be found in the section: Precision

Recall

Recall (also called Sensitivity or True Positive Rate) measures the proportion of actual positives that were correctly identified.

$$\text{Recall} = \frac{TP}{TP + FN}$$

High recall means the model captures most of the actual positives.
Critical when false negatives are costly (e.g., disease detection, security screening).

"Out of all actual positive samples, how many did we correctly find?"

👉 A detailed explanation of Recall can be found in the section: Recall

F1 Score

The F1 Score is the harmonic mean of precision and recall. It provides a single score that balances both concerns, especially when we need to avoid both false positives and false negatives.

$$\text{F1} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

Best used when class distribution is uneven or when both types of error matter.
Ranges from 0 (worst) to 1 (perfect).

A low F1 indicates an imbalance between precision and recall.

👉 A detailed explanation of F1 Score can be found in the section: F1 Score

ROC-AUC

ROC-AUC evaluates a model’s ability to distinguish between classes at various classification thresholds.

ROC Curve plots True Positive Rate (Recall) vs. False Positive Rate.
AUC (Area Under Curve) measures overall performance:
- 0.5 = random guessing.
- 1.0 = perfect separation.

Ideal for comparing classifiers across different thresholds, especially in binary classification.

Log Loss

Measures the uncertainty of predictions based on their confidence. It penalizes confident but wrong predictions more heavily than uncertain ones.

$$\text{LogLoss} = -\frac{1}{n} \sum_{i=1}^{n} \left[y_i \log(\hat{p}_i) + (1 - y_i)\log(1 - \hat{p}_i)\right]$$

$y_i$ is the true label (0 or 1).
$\hat{p}_i$ is the predicted probability of the positive class.

Useful in probabilistic models where confidence matters, like logistic regression or neural networks.

Balanced Accuracy

Accounts for class imbalance by averaging the recall obtained on each class:

$$\text{Balanced Accuracy} = \frac{1}{2} \left( \frac{TP}{TP + FN} + \frac{TN}{TN + FP} \right)$$

Helps prevent the inflated performance that can occur with traditional accuracy on imbalanced data.

Confusion Matrix

A confusion matrix provides a complete breakdown of the classification results across all predicted and actual classes.

For binary classification:

	Predicted: 0	Predicted: 1
Actual: 0	TN	FP
Actual: 1	FN	TP

From this matrix, you can derive accuracy, precision, recall, specificity, and more.

It provides insight into what types of errors the model is making.

Advantages and Disadvantages

Accuracy is one of the most well-known and frequently used evaluation metrics for classification tasks. However, while it’s intuitive and easy to compute, its effectiveness depends heavily on the context in which it’s applied.

✅ Advantages:

Easy to understand: "What percentage of predictions were correct?". Often the first metric new practitioners learn.
Requires only a straightforward comparison of predicted vs. actual labels. No complex math, no probability thresholds, no parameter tuning.
When class distributions are roughly equal and all misclassification costs are the same, accuracy provides a reliable measure of model performance.
Acts as a quick sanity check or starting point for evaluating classification models. Useful for comparing with naive models (e.g. majority class predictors).

❌ Disadvantages:

High accuracy can be achieved by always predicting the majority class.
Doesn’t distinguish between false positives and false negatives. Offers no insight into which class is being misclassified, or how frequently.
Assumes that all types of errors have equal cost, which is rarely true in real-world applications (e.g., fraud detection, medical diagnosis).
Treats all predictions as either right or wrong. Disregards the confidence of predictions, which is crucial in many probabilistic or risk-aware applications.
For models outputting probabilities, accuracy depends on the classification threshold (often 0.5 by default). Changing the threshold can significantly affect the accuracy.

Best used when: classes are balanced, and all misclassifications are equally costly.

Avoid relying on accuracy alone: when classes are imbalanced, or error costs differ significantly.

Conclusion

Accuracy is the most straightforward and commonly used metric for evaluating classification models. It answers a simple question: What proportion of predictions were correct? Thanks to its simplicity and ease of interpretation, it's often the first go-to metric when assessing model performance.

In order to make informed decisions, we should complement accuracy with more insightful metrics such as precision, recall, F1 score, ROC-AUC, or even visual tools like the confusion matrix. The choice of metric should always be guided by the problem domain, data characteristics, and business objectives. Use accuracy as a baseline metric, but never as your only one.

External resources:

Example code in Rust available on 👉 GitHub Repository

Feedback & Sharing

Give us your thoughts on this page, or share it with others who may find it useful.

Feedback

Found this helpful? Let me know what you think or suggest improvements 👉 Contact me.