F1 Score

The F1 Score is a evaluation metric for classification models, especially in situations where there is an uneven class distribution or when both false positives and false negatives are costly. It is the harmonic mean of Precision and Recall, combining both into a single, balanced measure.
In many real-world classification problems, such as spam detection, fraud identification, or medical diagnosis Can optimizing only precision or only recall can lead to misleading conclusions. The F1 Score addresses this by taking both into account, penalizing extreme values in either metric.
Definition and Intuition
The F1 Score is a classification metric that balances two important aspects of model performance: Precision (how many predicted positives are actually correct) and Recall (how many actual positives the model successfully identified).
It is defined as the harmonic mean of Precision and Recall, which gives a more conservative average than the arithmetic mean, especially when there's an imbalance between the two. The harmonic mean punishes extreme differences. For example, if either precision or recall is very low, the F1 Score will be low, even if the other is high.
Model Performance Insights
High F1 Score
Indicates the model has a good balance between precision and recall. It correctly identifies a large number of positive cases and does so with minimal false positives.
Low F1 Score
Suggests that the model is either missing many positives (low recall), making too many incorrect positive predictions (low precision), or both.
Trade-off Awareness
If our application needs to balance catching positives without overwhelming with false alarms (e.g., search engines, recommender systems), the F1 Score can be a guiding metric.
When Not to Use
If the cost of false positives and false negatives is significantly different, it may be better to look at precision and recall separately or use a weighted variant (like the Fβ score).
Core Concepts
Harmonic Mean vs. Arithmetic Mean
The harmonic mean used in F1 gives more weight to the lower value. This ensures that the F1 Score will only be high if both precision and recall are high.
Symmetry
F1 treats precision and recall as equally important. If our application cares more about one over the other, consider using Fβ Score, which allows weighting one more heavily.
Applicability to Imbalanced Data
In scenarios where the positive class is rare (e.g., fraud detection), accuracy can be misleading. The F1 Score focuses on the minority class performance, making it more suitable for such tasks.
Mathematical Formulation
The F1 Score is calculated using the harmonic mean of Precision and Recall. It combines both metrics into a single scalar that reflects the balance between them.
- Precision : \(\frac{TP}{TP + FP}\)
- Recall : \(\frac{TP}{TP + FN}$\)
- TP - True Negatives : correctly predicted negative instances.
- FP - False Positives : negative instances incorrectly predicted as positive.
- FN - False Negatives : positive instances incorrectly predicted as negative.
Calculation Procedure
-
Count Confusion Matrix Values
- True Positives (TP): Correctly predicted positives.
- False Positives (FP): Incorrectly predicted positives.
- False Negatives (FN): Missed actual positives.
-
Calculate Precision
$$\text{Precision} = \frac{TP}{TP + FP}$$
-
Calculate Recall
$$\text{Recall} = \frac{TP}{TP + FN}$$
-
Compute F1 Score
$$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
Example
Let’s use a binary classification example.
Confusion Matrix:
- True Positives (TP): 40
- False Positives (FP): 10
- False Negatives (FN): 20
Precision:
Recall:
F1 Score:
Interpretation
An F1 Score of 0.7273 suggests that the model performs reasonably well in balancing both precision and recall. It correctly identifies many of the actual positives while keeping false positives at a moderate level.
Properties and Behavior
The F1 Score is a widely used metric due to its balanced nature, but understanding its behavior under different conditions is critical for using it appropriately in model evaluation.
Typical Value Ranges
The F1 Score ranges between 0 and 1:
- 1: Perfect precision and recall, every positive is correctly identified, and no false positives are made.
- 0: Either precision or recall is zero, indicating failure in either detecting positives or doing so accurately.
A “good” F1 Score is highly context-dependent, depending on domain expectations and acceptable error trade-offs.
Sensitivity to Outliers or Noise
Outliers in Predictions:
If a model makes occasional extreme misclassifications (e.g., false positives on rare classes), the F1 Score can drop significantly, especially in small datasets.
Label Noise:
Incorrectly labeled data (especially false negatives or positives) directly impacts both precision and recall, and thus, the F1 Score.
Differentiability and Role in Optimization
The F1 Score, being based on discrete counts (TP, FP, FN), is not differentiable, which makes it unsuitable as a direct loss function in gradient-based optimization algorithms. In practice, models are trained with differentiable surrogates like binary cross-entropy or focal loss. The F1 Score is used post-training to evaluate performance. Some advanced techniques (e.g. reinforcement learning-inspired loss functions or F1-approximation objectives) can optimize for F1 directly, but they are complex and less commonly used.
Assumptions and Limitations
The F1 Score assumes equal importance of precision and recall. If your application places more emphasis on one over the other, the Fβ Score (a generalized version) may be more appropriate.
- \(\beta > 1\) : more emphasis on recall.
- \(\beta < 1\) : more emphasis on precision.
F1 Score does not consider true negatives, making it unsuitable when the negative class is also important (e.g., evaluating overall correctness or resource usage).
Code Example
fn calculate_precision(tp: usize, fp: usize) -> f64 {
if tp + fp == 0 {
return 0.0;
}
tp as f64 / (tp + fp) as f64
}
fn calculate_recall(tp: usize, fn_: usize) -> f64 {
if tp + fn_ == 0 {
return 0.0;
}
tp as f64 / (tp + fn_) as f64
}
fn calculate_f1_score(tp: usize, fp: usize, fn_: usize) -> f64 {
let precision = calculate_precision(tp, fp);
let recall = calculate_recall(tp, fn_);
if precision + recall == 0.0 {
return 0.0;
}
2.0 * (precision * recall) / (precision + recall)
}
fn main() {
// Example confusion matrix values
let tp = 40;
let fp = 10;
let fn_ = 20;
let f1 = calculate_f1_score(tp, fp, fn_);
println!("F1 Score: {:.4}", f1);
}
Explanation
Input:
tp
: True Positives (correctly identified positives).fp
: False Positives (incorrect positive predictions).fn_
: False Negatives (missed actual positives).
calculate_precision
and calculate_recall
compute the standard definitions with safeguards against division by zero.
Main Function:
- Computes the F1 Score using the harmonic mean formula.
- Outputs the score with 4 decimal places for clarity.
Output
F1 Score: 0.7273
An F1 Score of 0.7273 indicates a moderate balance between precision and recall, reflecting that the model is performing well but still missing some positives or including some false alarms.
Alternative Metrics
While the F1 Score provides a useful balance between precision and recall, it may not always be the best choice depending on the specific requirements of a task.
Precision
Precision (also known as Positive Predictive Value) measures the proportion of positive predictions that were actually correct.
- High precision means that when the model predicts a positive class, it's usually right.
- Especially important in cases where false positives are costly (e.g., spam filters, fraud detection).
"Out of all the samples predicted as positive, how many are truly positive?"
👉 A detailed explanation of Precision can be found in the section: Precision
Recall
Recall (also called Sensitivity or True Positive Rate) measures the proportion of actual positives that were correctly identified.
- High recall means the model captures most of the actual positives.
- Critical when false negatives are costly (e.g., disease detection, security screening).
"Out of all actual positive samples, how many did we correctly find?"
👉 A detailed explanation of Recall can be found in the section: Recall
Accuracy
Accuracy measures the overall proportion of correct predictions. While easy to understand, it can be misleading in imbalanced datasets where predicting the majority class yields high scores. Use it cautiously when class distributions are skewed.
👉 A detailed explanation of Accuracy can be found in the section: Accuracy
ROC-AUC
ROC-AUC evaluates a model’s ability to distinguish between classes at various classification thresholds.
- ROC Curve plots True Positive Rate (Recall) vs. False Positive Rate.
-
AUC (Area Under Curve) measures overall performance:
- 0.5 = random guessing.
- 1.0 = perfect separation.
Ideal for comparing classifiers across different thresholds, especially in binary classification.
Fβ Score
The Fβ Score generalizes the F1 Score by letting you control the balance between precision and recall. When \(\beta > 1\), recall is given more weight; when \(\beta < 1\), precision is prioritized. This is useful in domains where one type of error is much more costly than the other.
Matthews Correlation Coefficient
MCC is a correlation-based metric that considers all four outcomes in the confusion matrix. It is especially valuable for imbalanced datasets, providing a balanced measure even when class sizes differ. An MCC close to 1 indicates a strong predictive relationship.
Advantages and Disadvantages
The F1 Score is widely used in classification tasks, particularly when there is a class imbalance or when both precision and recall are important. However, like any metric, it has its strengths and limitations depending on the context.
✅ Advantages:
- The F1 Score provides a single, balanced measure when both false positives and false negatives matter equally. This is especially helpful when optimizing for one metric in isolation could harm the other.
- Unlike accuracy, the F1 Score is less biased by class imbalance, making it a more reliable choice in domains like fraud detection or rare disease diagnosis.
- As the harmonic mean of precision and recall, it offers an intuitive summary of model performance that can be easily communicated and compared.
❌ Disadvantages:
- The F1 Score only considers precision and recall (true positives, false positives, and false negatives), leaving out true negatives. This can be misleading in cases where correct negative predictions are important.
- The F1 Score cannot be used directly as a loss function in model training with gradient-based optimization, requiring surrogate loss functions like binary cross-entropy.
- In many real-world tasks, one may be more important than the other (e.g., recall in cancer detection, precision in spam filtering). F1 doesn’t account for this without modification (e.g., Fβ Score).
Conclusion
The F1 Score is a widely-used metric for evaluating classification models, especially in situations where the data is imbalanced or both types of classification errors (false positives and false negatives) carry significant consequences. By combining precision and recall into a single harmonic mean, it offers a balanced view of a model’s effectiveness in identifying positive cases accurately and consistently.
However, it’s important to recognize that the F1 Score doesn’t account for true negatives and may not reflect priorities where one type of error is more critical than the other. In such cases, using the Fβ Score, Precision, Recall, or other metrics in combination is more appropriate.
External resources:
- Example code in Rust available on 👉 GitHub Repository
Feedback
Found this helpful? Let me know what you think or suggest improvements 👉 Contact me.