Precision, Recall & F1 Score for Classification Models
Updated on January 30, 2026 5 minutes read
When you build a classification model, the real question is not only "Is it accurate?" but "Is it useful for this decision?" In 2026, classification models are often deployed into products and workflows where different mistakes have different costs.
That is why evaluation usually starts with three core metrics: precision, recall, and the F1 score. Together, they help you understand what kinds of errors your model makes, not just how many.
Start with the confusion matrix
Most classification metrics are built from four counts. If you can explain these clearly, you can explain every metric that follows.
A confusion matrix compares model predictions against the true labels:
- TP (True Positives): predicted positive, actually positive
- FP (False Positives): predicted positive, actually negative
- TN (True Negatives): predicted negative, actually negative
- FN (False Negatives): predicted negative, actually positive
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP | FN |
| Actual Negative | FP | TN |
Once you have TP, FP, TN, and FN, you can compute precision, recall, and F1 with simple formulas.
Precision
Precision measures how reliable your positive predictions are. It answers: "When the model says positive, how often is it correct?"
Formula: Precision = TP / (TP + FP)
Precision matters most when false positives are expensive. In these cases, an incorrect "yes" triggers a costly action or a poor user experience.
When to prioritize precision
- You want fewer false alarms, even if you miss some true cases.
- A positive prediction triggers a limited resource (manual review, on-call response).
- You care about user trust and want "positive" to mean "high confidence."
Recall
Recall measures how well your model finds the actual positives. It answers: "Of all true positive cases, how many did we catch?"
Formula: Recall = TP / (TP + FN)
Recall matters most when false negatives are expensive. In those cases, missing a true positive is worse than flagging extra false positives.
When to prioritize recall
- Missing a positive case has high consequences (safety, compliance, fraud loss).
- You can tolerate extra false positives because they can be filtered later.
- Your goal is coverage: catch as many true cases as possible.
F1 score
Precision and recall often move in opposite directions. If you change your decision threshold, you may gain recall but lose precision, or the reverse.
The F1 score gives you a single value that balances both. It is especially useful when classes are imbalanced, and you care about performance on the positive class.
Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall)
A practical note for 2026 workflows
If you are comparing models, do not compare F1 scores blindly across different thresholds. Decide whether you are comparing fixed-threshold or threshold-optimized performance, and document your choice.
Choosing the right metric for your use case
The "best" metric depends on the decision your model supports. Start by writing down what a false positive and a false negative cost you.
Use this quick guide:
- Prioritize precision when false positives are costly or disruptive.
- Prioritize recall when false negatives are dangerous or expensive.
- Use F1 when you need a balanced score and the positive class matters.
If stakeholders cannot agree on the trade-off, it is a signal to clarify the decision. Metrics do not replace product requirements; they make them measurable.
Other useful classification metrics to know
Precision, recall, and F1 are foundational, but they are not the whole toolbox. Depending on your problem, you may also track the metrics below.
Accuracy (and why it can mislead)
Accuracy is the fraction of all correct predictions.
Formula: Accuracy = (TP + TN) / (TP + FP + TN + FN)
Accuracy can look strong even when your model fails on the minority class. If positives are rare, a model can be "accurate" by predicting negatives almost all the time.
Specificity
Specificity (true negative rate) measures how well the model identifies negatives.
Formula: Specificity = TN / (TN + FP)
This can be useful alongside recall (which is the true positive rate) when you want a balanced view of both classes.
ROC AUC and PR AUC
When your model outputs probabilities, it helps to evaluate it across many thresholds rather than a single cutoff.
- ROC AUC summarizes the trade-off between true positive rate and false positive rate.
- Precision-recall curves are often more informative when the positive class is rare.
Multi-class and imbalanced classification
For multi-class problems, you will often report macro, micro, or weighted averages for precision, recall, and F1. These averaging strategies change what "good performance" means.
If class imbalance is part of your reality, include class support and consider reporting per-class metrics. A single average can hide critical failures.
Common pitfalls to avoid
Even strong models can look weak, or weak models can look strong, when evaluation is misapplied. These issues appear frequently in real deployments.
- Relying on accuracy alone when classes are imbalanced.
- Skipping the confusion matrix, which makes it harder to explain what went wrong.
- Comparing models with different thresholds without stating the threshold policy.
- Tuning on the test set, which inflates metrics and hurts real-world performance.
- Ignoring business costs when choosing what to optimize.
Quick checklist for reporting classification performance
Use this checklist to make your results understandable and comparable:
- Include a confusion matrix (or per-class confusion matrices).
- Report precision, recall, and F1 (overall and per-class where relevant).
- State your decision threshold and how it was chosen.
- Mention class imbalance and class support.
- Add ROC or precision-recall curves if your model outputs probabilities.
Want to build and evaluate models end-to-end, from data prep to deployment-ready evaluation? Explore Code Labs Academy's Data Science and AI Bootcamp