F1 Score

The F1 Score is a evaluation metric for classification models, especially in situations where there is an uneven class distribution or when both false positives and false negatives are costly. It is the harmonic mean of Precision and Recall, combining both into a single, balanced measure.

In many real-world classification problems, such as spam detection, fraud identification, or medical diagnosis Can optimizing only precision or only recall can lead to misleading conclusions. The F1 Score addresses this by taking both into account, penalizing extreme values in either metric.

Definition and Intuition

The F1 Score is a classification metric that balances two important aspects of model performance: Precision (how many predicted positives are actually correct) and Recall (how many actual positives the model successfully identified).

It is defined as the harmonic mean of Precision and Recall, which gives a more conservative average than the arithmetic mean, especially when there's an imbalance between the two. The harmonic mean punishes extreme differences. For example, if either precision or recall is very low, the F1 Score will be low, even if the other is high.

F1 Score is useful in domains where both false positives and false negatives are important to minimize, and you need a single metric that captures this balance.

Model Performance Insights

High F1 Score

Indicates the model has a good balance between precision and recall. It correctly identifies a large number of positive cases and does so with minimal false positives.

Low F1 Score

Suggests that the model is either missing many positives (low recall), making too many incorrect positive predictions (low precision), or both.

Trade-off Awareness

If our application needs to balance catching positives without overwhelming with false alarms (e.g., search engines, recommender systems), the F1 Score can be a guiding metric.

When Not to Use

If the cost of false positives and false negatives is significantly different, it may be better to look at precision and recall separately or use a weighted variant (like the Fβ score).

Core Concepts

Harmonic Mean vs. Arithmetic Mean

The harmonic mean used in F1 gives more weight to the lower value. This ensures that the F1 Score will only be high if both precision and recall are high.

Symmetry

F1 treats precision and recall as equally important. If our application cares more about one over the other, consider using Fβ Score, which allows weighting one more heavily.

Applicability to Imbalanced Data

In scenarios where the positive class is rare (e.g., fraud detection), accuracy can be misleading. The F1 Score focuses on the minority class performance, making it more suitable for such tasks.

Mathematical Formulation

The F1 Score is calculated using the harmonic mean of Precision and Recall. It combines both metrics into a single scalar that reflects the balance between them.

$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Precision : $\frac{TP}{TP + FP}$
Recall : $\frac{TP}{TP + FN}$$

TP - True Negatives : correctly predicted negative instances.
FP - False Positives : negative instances incorrectly predicted as positive.
FN - False Negatives : positive instances incorrectly predicted as negative.

Calculation Procedure

Count Confusion Matrix Values
- True Positives (TP): Correctly predicted positives.
- False Positives (FP): Incorrectly predicted positives.
- False Negatives (FN): Missed actual positives.
Calculate Precision
$$\text{Precision} = \frac{TP}{TP + FP}$$
Calculate Recall
$$\text{Recall} = \frac{TP}{TP + FN}$$
Compute F1 Score
$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

If either Precision or Recall is 0, the F1 Score will also be 0.

Example

Let’s use a binary classification example.

Confusion Matrix:

True Positives (TP): 40
False Positives (FP): 10
False Negatives (FN): 20

Precision:

$$\text{Precision} = \frac{40}{40 + 10} = \frac{40}{50} = 0.80$$

Recall:

$$\text{Recall} = \frac{40}{40 + 20} = \frac{40}{60} \approx 0.6667$$

F1 Score:

$$\text{F1 Score} = 2 \times \frac{0.80 \times 0.6667}{0.80 + 0.6667} = 2 \times \frac{0.53336}{1.4667} \approx 0.7273$$

Interpretation

An F1 Score of 0.7273 suggests that the model performs reasonably well in balancing both precision and recall. It correctly identifies many of the actual positives while keeping false positives at a moderate level.

Properties and Behavior

The F1 Score is a widely used metric due to its balanced nature, but understanding its behavior under different conditions is critical for using it appropriately in model evaluation.

Typical Value Ranges

The F1 Score ranges between 0 and 1:

1: Perfect precision and recall, every positive is correctly identified, and no false positives are made.
0: Either precision or recall is zero, indicating failure in either detecting positives or doing so accurately.

A “good” F1 Score is highly context-dependent, depending on domain expectations and acceptable error trade-offs.

In medical diagnosis, an F1 Score of 0.7 might be too low; in spam detection, it might be acceptable.

Sensitivity to Outliers or Noise

Outliers in Predictions:

If a model makes occasional extreme misclassifications (e.g., false positives on rare classes), the F1 Score can drop significantly, especially in small datasets.

Label Noise:

Incorrectly labeled data (especially false negatives or positives) directly impacts both precision and recall, and thus, the F1 Score.

Since the F1 Score considers both false positives and false negatives, it tends to be moderately sensitive to noise in either direction.

Differentiability and Role in Optimization

The F1 Score, being based on discrete counts (TP, FP, FN), is not differentiable, which makes it unsuitable as a direct loss function in gradient-based optimization algorithms. In practice, models are trained with differentiable surrogates like binary cross-entropy or focal loss. The F1 Score is used post-training to evaluate performance. Some advanced techniques (e.g. reinforcement learning-inspired loss functions or F1-approximation objectives) can optimize for F1 directly, but they are complex and less commonly used.

Assumptions and Limitations

The F1 Score assumes equal importance of precision and recall. If your application places more emphasis on one over the other, the Fβ Score (a generalized version) may be more appropriate.

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$$

$\beta > 1$ : more emphasis on recall.
$\beta < 1$ : more emphasis on precision.

F1 Score does not consider true negatives, making it unsuitable when the negative class is also important (e.g., evaluating overall correctness or resource usage).

Like most metrics, the usefulness of the F1 Score depends on your domain and business objectives.

Code Example

fn calculate_precision(tp: usize, fp: usize) -> f64 {
    if tp + fp == 0 {
        return 0.0;
    }
    tp as f64 / (tp + fp) as f64
}

fn calculate_recall(tp: usize, fn_: usize) -> f64 {
    if tp + fn_ == 0 {
        return 0.0;
    }
    tp as f64 / (tp + fn_) as f64
}

fn calculate_f1_score(tp: usize, fp: usize, fn_: usize) -> f64 {
    let precision = calculate_precision(tp, fp);
    let recall = calculate_recall(tp, fn_);
    
    if precision + recall == 0.0 {
        return 0.0;
    }

    2.0 * (precision * recall) / (precision + recall)
}

fn main() {
    // Example confusion matrix values
    let tp = 40;
    let fp = 10;
    let fn_ = 20;

    let f1 = calculate_f1_score(tp, fp, fn_);
    println!("F1 Score: {:.4}", f1);
}

Explanation

Input:

tp: True Positives (correctly identified positives).
fp: False Positives (incorrect positive predictions).
fn_: False Negatives (missed actual positives).

calculate_precision and calculate_recall compute the standard definitions with safeguards against division by zero.

Main Function:

Computes the F1 Score using the harmonic mean formula.
Outputs the score with 4 decimal places for clarity.

The code checks for edge cases where division by zero might occur.

Output

F1 Score: 0.7273

An F1 Score of 0.7273 indicates a moderate balance between precision and recall, reflecting that the model is performing well but still missing some positives or including some false alarms.

Alternative Metrics

While the F1 Score provides a useful balance between precision and recall, it may not always be the best choice depending on the specific requirements of a task.

Precision

Precision (also known as Positive Predictive Value) measures the proportion of positive predictions that were actually correct.

$$\text{Precision} = \frac{TP}{TP + FP}$$

High precision means that when the model predicts a positive class, it's usually right.
Especially important in cases where false positives are costly (e.g., spam filters, fraud detection).

"Out of all the samples predicted as positive, how many are truly positive?"

👉 A detailed explanation of Precision can be found in the section: Precision

Recall

Recall (also called Sensitivity or True Positive Rate) measures the proportion of actual positives that were correctly identified.

$$\text{Recall} = \frac{TP}{TP + FN}$$

High recall means the model captures most of the actual positives.
Critical when false negatives are costly (e.g., disease detection, security screening).

"Out of all actual positive samples, how many did we correctly find?"

👉 A detailed explanation of Recall can be found in the section: Recall

Accuracy

Accuracy measures the overall proportion of correct predictions. While easy to understand, it can be misleading in imbalanced datasets where predicting the majority class yields high scores. Use it cautiously when class distributions are skewed.

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

👉 A detailed explanation of Accuracy can be found in the section: Accuracy

ROC-AUC

ROC-AUC evaluates a model’s ability to distinguish between classes at various classification thresholds.

ROC Curve plots True Positive Rate (Recall) vs. False Positive Rate.
AUC (Area Under Curve) measures overall performance:
- 0.5 = random guessing.
- 1.0 = perfect separation.

Ideal for comparing classifiers across different thresholds, especially in binary classification.

Fβ Score

The Fβ Score generalizes the F1 Score by letting you control the balance between precision and recall. When $\beta > 1$, recall is given more weight; when $\beta < 1$, precision is prioritized. This is useful in domains where one type of error is much more costly than the other.

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}$$

Matthews Correlation Coefficient

MCC is a correlation-based metric that considers all four outcomes in the confusion matrix. It is especially valuable for imbalanced datasets, providing a balanced measure even when class sizes differ. An MCC close to 1 indicates a strong predictive relationship.

$$\text{MCC} = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

Advantages and Disadvantages

The F1 Score is widely used in classification tasks, particularly when there is a class imbalance or when both precision and recall are important. However, like any metric, it has its strengths and limitations depending on the context.

✅ Advantages:

The F1 Score provides a single, balanced measure when both false positives and false negatives matter equally. This is especially helpful when optimizing for one metric in isolation could harm the other.
Unlike accuracy, the F1 Score is less biased by class imbalance, making it a more reliable choice in domains like fraud detection or rare disease diagnosis.
As the harmonic mean of precision and recall, it offers an intuitive summary of model performance that can be easily communicated and compared.

❌ Disadvantages:

The F1 Score only considers precision and recall (true positives, false positives, and false negatives), leaving out true negatives. This can be misleading in cases where correct negative predictions are important.
The F1 Score cannot be used directly as a loss function in model training with gradient-based optimization, requiring surrogate loss functions like binary cross-entropy.
In many real-world tasks, one may be more important than the other (e.g., recall in cancer detection, precision in spam filtering). F1 doesn’t account for this without modification (e.g., Fβ Score).

Conclusion

The F1 Score is a widely-used metric for evaluating classification models, especially in situations where the data is imbalanced or both types of classification errors (false positives and false negatives) carry significant consequences. By combining precision and recall into a single harmonic mean, it offers a balanced view of a model’s effectiveness in identifying positive cases accurately and consistently.

However, it’s important to recognize that the F1 Score doesn’t account for true negatives and may not reflect priorities where one type of error is more critical than the other. In such cases, using the Fβ Score, Precision, Recall, or other metrics in combination is more appropriate.

External resources:

Example code in Rust available on 👉 GitHub Repository

Feedback & Sharing

Give us your thoughts on this page, or share it with others who may find it useful.

Feedback

Found this helpful? Let me know what you think or suggest improvements 👉 Contact me.