ML Decision Trees and Random Forests
How decision trees work, why a single tree overfits, and how random forests solve that problem by averaging many trees trained on different data.
What you'll learn
- ✓How a decision tree splits data
- ✓Why single trees overfit
- ✓How bagging builds a forest
- ✓When to use forests vs boosted trees
- ✓Practical tuning tips
Prerequisites
- •Basic Python familiarity
Decision trees and random forests are some of the most useful models in practical machine learning. They handle mixed data types, need little preprocessing, and produce interpretable splits. They also have well-understood failure modes that the random forest extension fixes. This post covers both.
What a decision tree really is
A decision tree splits the data into smaller and smaller groups by asking yes-or-no questions about one feature at a time. At each node, the algorithm picks the feature and threshold that best separate the classes (for classification) or reduce variance (for regression). It keeps splitting until a stopping rule kicks in, like a minimum number of samples or a maximum depth.
The leaves of the tree hold predictions. For classification, the prediction is the majority class in the leaf. For regression, it is the mean target value. To predict for a new sample, you walk down the tree following the splits and read the leaf.
Mental model
A tree is a flowchart that the algorithm builds for you from data. Each split tries to make the resulting groups more pure, where pure means most rows share the same label.
single tree:
root
/ \
A B
/ \ / \
leaves...
forest:
tree1 tree2 tree3 ... treeN
\ | /
majority vote / mean Hands-on example
Scikit-learn makes both algorithms one line each.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
X, y = load_breast_cancer(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, random_state=0)
tree = DecisionTreeClassifier(random_state=0).fit(Xtr, ytr)
forest = RandomForestClassifier(n_estimators=200, random_state=0).fit(Xtr, ytr)
print("tree :", tree.score(Xte, yte))
print("forest :", forest.score(Xte, yte))
A deep unpruned tree typically scores well on training data and worse on the test set, the classic sign of overfitting. The forest, which trains many trees on bootstrapped subsamples and averages their predictions, almost always beats a single tree on the test set.
The random forest adds two tricks. Each tree sees a bootstrap sample of the data, not the whole dataset. At each split, the tree considers only a random subset of features. These randomizations decorrelate the trees, which is what makes averaging help so much.
Trade-offs
A single tree is easy to inspect. You can read its rules, plot it, and explain a prediction by walking the path. A forest is much harder to inspect because no single tree carries the model.
Forests are slower at prediction time, scaling linearly with the number of trees. For most workloads this is fine, but on latency-critical paths, a smaller forest with shallower trees can hit a sweet spot.
Trees handle missing values, mixed scales, and categorical features without much preprocessing. Linear models need careful scaling and encoding; trees often do not.
Forests still trail gradient-boosted trees, like XGBoost, LightGBM, and CatBoost, on most tabular benchmarks. The forest is a safe baseline. Boosting is the typical winner once you tune.
Practical tips
Set a maximum depth or minimum samples per leaf for single trees. An unconstrained tree will memorize every training point. For forests, deep trees are fine because averaging reduces variance.
Always evaluate on a held-out test set or cross-validation. Tree training accuracy is misleadingly high on small datasets. Cross-validation gives a realistic number.
Use feature importances as a starting point, not as gospel. Tree importances can be biased toward high-cardinality features. Permutation importance is slower but more honest.
Tune n_estimators by watching the validation score plateau. More trees never hurt accuracy, but they cost memory and prediction time. Stop adding once the curve flattens.
Set a random seed when you compare runs. Forests are stochastic, and small score differences across runs can be due to seed alone.
Wrap-up
Decision trees give you a transparent, no-preprocessing baseline that often performs surprisingly well. Random forests fix the main weakness of a single tree by averaging many decorrelated ones. When you need more accuracy, reach for gradient boosting next. But for a fast, reliable first cut on tabular data, a random forest is hard to beat.
Related articles
- Machine Learning Decision Trees in scikit-learn: A Practical Intro
A working introduction to decision trees in scikit-learn covering splitting criteria, overfitting, max_depth tuning, visualization, and the path to random forests.
- Machine Learning ML Bias Variance Tradeoff
The bias variance tradeoff explained with intuition, examples, and practical guidance on how to diagnose and reduce each component of error in your ML models.
- Machine Learning ML Gradient Descent Explained
An intuitive and practical explanation of gradient descent, the workhorse optimization algorithm behind nearly all modern machine learning models.
- Machine Learning ML Hyperparameter Tuning Strategies
A practical comparison of hyperparameter tuning strategies including grid search, random search, Bayesian optimization, and Hyperband, with guidance on when to use each.