Skip to content
C Codeloom
Machine Learning

ML Decision Trees and Random Forests

How decision trees work, why a single tree overfits, and how random forests solve that problem by averaging many trees trained on different data.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • How a decision tree splits data
  • Why single trees overfit
  • How bagging builds a forest
  • When to use forests vs boosted trees
  • Practical tuning tips

Prerequisites

  • Basic Python familiarity

Decision trees and random forests are some of the most useful models in practical machine learning. They handle mixed data types, need little preprocessing, and produce interpretable splits. They also have well-understood failure modes that the random forest extension fixes. This post covers both.

What a decision tree really is

A decision tree splits the data into smaller and smaller groups by asking yes-or-no questions about one feature at a time. At each node, the algorithm picks the feature and threshold that best separate the classes (for classification) or reduce variance (for regression). It keeps splitting until a stopping rule kicks in, like a minimum number of samples or a maximum depth.

The leaves of the tree hold predictions. For classification, the prediction is the majority class in the leaf. For regression, it is the mean target value. To predict for a new sample, you walk down the tree following the splits and read the leaf.

Mental model

A tree is a flowchart that the algorithm builds for you from data. Each split tries to make the resulting groups more pure, where pure means most rows share the same label.

single tree:
 root
 /  \
A    B
/ \  / \
leaves...

forest:
tree1 tree2 tree3 ... treeN
      \   |   /
       majority vote / mean
Tree split and forest aggregation

Hands-on example

Scikit-learn makes both algorithms one line each.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
Xtr, Xte, ytr, yte = train_test_split(X, y, random_state=0)

tree = DecisionTreeClassifier(random_state=0).fit(Xtr, ytr)
forest = RandomForestClassifier(n_estimators=200, random_state=0).fit(Xtr, ytr)

print("tree   :", tree.score(Xte, yte))
print("forest :", forest.score(Xte, yte))

A deep unpruned tree typically scores well on training data and worse on the test set, the classic sign of overfitting. The forest, which trains many trees on bootstrapped subsamples and averages their predictions, almost always beats a single tree on the test set.

The random forest adds two tricks. Each tree sees a bootstrap sample of the data, not the whole dataset. At each split, the tree considers only a random subset of features. These randomizations decorrelate the trees, which is what makes averaging help so much.

Trade-offs

A single tree is easy to inspect. You can read its rules, plot it, and explain a prediction by walking the path. A forest is much harder to inspect because no single tree carries the model.

Forests are slower at prediction time, scaling linearly with the number of trees. For most workloads this is fine, but on latency-critical paths, a smaller forest with shallower trees can hit a sweet spot.

Trees handle missing values, mixed scales, and categorical features without much preprocessing. Linear models need careful scaling and encoding; trees often do not.

Forests still trail gradient-boosted trees, like XGBoost, LightGBM, and CatBoost, on most tabular benchmarks. The forest is a safe baseline. Boosting is the typical winner once you tune.

Practical tips

Set a maximum depth or minimum samples per leaf for single trees. An unconstrained tree will memorize every training point. For forests, deep trees are fine because averaging reduces variance.

Always evaluate on a held-out test set or cross-validation. Tree training accuracy is misleadingly high on small datasets. Cross-validation gives a realistic number.

Use feature importances as a starting point, not as gospel. Tree importances can be biased toward high-cardinality features. Permutation importance is slower but more honest.

Tune n_estimators by watching the validation score plateau. More trees never hurt accuracy, but they cost memory and prediction time. Stop adding once the curve flattens.

Set a random seed when you compare runs. Forests are stochastic, and small score differences across runs can be due to seed alone.

Wrap-up

Decision trees give you a transparent, no-preprocessing baseline that often performs surprisingly well. Random forests fix the main weakness of a single tree by averaging many decorrelated ones. When you need more accuracy, reach for gradient boosting next. But for a fast, reliable first cut on tabular data, a random forest is hard to beat.