What does “k” mean in k-fold cross-validation?

“k” is the number of folds (subsets) you split your dataset into. The model is trained k times, each time validating on a different fold and averaging the results.

Do I still need a separate test set if I use k-fold cross-validation?

Usually, yes. Use k-fold cross-validation during model selection and tuning, then reserve a final holdout test set for a one-time, unbiased performance estimate.

When should I use Stratified K-fold instead of regular K-fold?

Use stratified splits for classification problems, especially with imbalanced classes. Stratification keeps class proportions similar in each fold, making validation scores more comparable.

Should I shuffle the data before running k-fold cross-validation?

Shuffling is often helpful for independent samples, but it can be harmful for time series or grouped data. For time-dependent problems, use time-aware splits that preserve ordering.

K-Fold Cross-Validation in Machine Learning (2026 Guide)

Updated on January 23, 2026 5 minutes read

K-fold cross-validation is a standard way to evaluate a machine learning model on multiple data splits.
Instead of trusting one train/test split, you rotate the validation fold and summarize performance across runs.
In 2026, it is still a go-to choice for strong baselines, especially when data is limited.

At a high level, you split your dataset into k equally sized folds.
You train on k-1 folds and validate on the remaining fold, repeating until every fold is used once.
You then average the scores to estimate how the model may perform on new, unseen data.

What k-fold cross-validation helps you measure

Cross-validation helps you estimate generalization performance, meaning how well a model might perform beyond the training data.
It is especially useful when you want to compare algorithms, feature sets, or training settings consistently.
It also shows stability: if results vary a lot across folds, your model may be sensitive to sampling.

Cross-validation does not prevent overfitting by itself.
What it does provide is a more reliable evaluation loop that can reveal overfitting earlier than a single split.
For final reporting, many teams still keep a separate holdout test set that is not used during tuning.

How k-fold cross-validation works

1) Split the dataset into k folds

Divide your dataset into k subsets (folds) of roughly equal size.
For example, with 1,000 samples and k = 5, each fold contains about 200 samples.
Most ML libraries can create these folds for you, with optional shuffling.

2) Train and validate k times

Run k training rounds. In each round, you use a different fold as validation and the rest as training.

Round 1: validate on Fold 1, train on Folds 2 to 5
Round 2: validate on Fold 2, train on Folds 1 and 3 to 5
Round 3: validate on Fold 3, train on Folds 1, 2, 4, and 5
Continue until every fold has been the validation fold once

You end up with k validation scores, one per fold.
As long as you keep the metric consistent, these scores are directly comparable.

3) Aggregate the results

Combine the fold scores into a summary of expected performance.
A common summary is mean and standard deviation, which shows both accuracy and stability.
If the standard deviation is high, the model may be unstable, or the dataset may be noisy.

Why k-fold is often better than a single train/test split

A single split can be unusually easy or unusually hard, especially on smaller datasets.
K-fold reduces this sensitivity by evaluating the model across multiple validation slices.
That typically produces a more dependable estimate when choosing between models.

K-fold also uses data efficiently.
Each sample is used for training in k-1 rounds and for validation in 1 round.
This is helpful when collecting labeled data is expensive or slow.

The trade-off is compute.
Training k times can be costly for large datasets and heavy models.
In those cases, you may choose a smaller k, fewer repeats, or a holdout strategy for faster iteration.

Choosing a good value of k in 2026

Many practitioners start with k = 5 or k = 10 as practical defaults.
These often balance runtime with evaluation quality for common ML tasks.
But the best choice depends on your dataset size, model cost, and risk of leakage.

Choose a smaller k when training is expensive,e and you need quicker feedback.
Choose a larger k when the data is scarce,ce and you want larger training sets in each round.
As k increases, the runtime increases, and fold scores can sometimes become more variable.

Common variants you should know

Plain K-fold assumes samples are independent and similarly distributed.
Real-world data often breaks that assumption, so the splitting strategy matters.
These variants are widely used in applied ML:

Stratified K-fold (classification): keeps class proportions similar in each fold, useful for imbalanced classes
Group K-fold: keeps related samples together (for example, same user or same patient) to avoid leakage
Time-series splits: preserve time order to prevent training on future information
Repeated K-fold: repeats CV with different random splits to reduce randomness in the estimate
Nested cross-validation: separates hyperparameter tuning from final evaluation to avoid optimistic results

Common pitfalls and how to avoid them

Data leakage during preprocessing

If you scale features, impute missing values, select features, or resample data, do it inside each training fold.
Doing preprocessing once on the full dataset leaks information from validation folds into training steps.
In practice, this is why end-to-end pipelines are important.

Shuffling when the data has an order or groups

Shuffling is often fine for independent samples.
For time series, grouped records, or ordered processes, shuffling can create unrealistic validation sets.
Use time-aware or group-aware splitters when the problem requires it.

Mixing model selection with final reporting

Cross-validation is great for choosing models and tuning settings.
For a clean final estimate, reserve a test set that you never touch during tuning.
Evaluate on that test set once, after decisions are finalized.

A small scikit-learn example

This is a minimal example using a K-fold splitter and multiple scoring metrics.
In real projects, consider using a pipeline so that preprocessing happens inside each fold.

from sklearn.model_selection import KFold, cross_validate
from sklearn.linear_model import LogisticRegression

cv = KFold(n_splits=5, shuffle=True, random_state=42)
model = LogisticRegression(max_iter=1000)

scores = cross_validate(
    model,
    X, y,
    cv=cv,
    scoring=["accuracy", "precision", "recall"]
)

print("Mean accuracy:", scores["test_accuracy"].mean())
print("Accuracy std:", scores["test_accuracy"].std())

Learn and practice with Code Labs Academy

If you want hands-on practice with evaluation, cross-validation, and ML workflows, explore the
Data Science & AI Bootcamp.