Random Forests, from first principles
A six-minute tour of why random forests work, decision trees, the two tricks that tame them, and the place where they quietly stop being the right tool.
- ml
- tabular
- ensembles
Random forests are one of the few ML ideas that get stronger the longer you stare at them. They are simple, they almost always work on tabular data, and they let you explain the result to a skeptical product manager without a whiteboard. This post rebuilds the model from scratch, no scikit-learn magic, just the two ideas that do the real work.
The single tree, and what's wrong with it
A decision tree splits the feature space with axis-aligned cuts until each leaf is mostly pure. Greedy, recursive, ruthless.
One tree is a machine for memorizing. Grow it deep enough and it nails every training point, and generalizes to new data about as well as a sticky note. The error of a deep tree is dominated by variance: tiny perturbations in the training set redraw the decision boundary in wild ways.
You have two honest options. Regularize the tree into shallowness and eat the bias, or build lots of them and average away the variance. Random forests take the second route, with two ideas.
Idea 1: bagging
Bagging (bootstrap aggregating) is almost embarrassingly simple: to train T trees, draw T bootstrap samples from your data (sample with replacement, each the same size as the original) and fit one tree to each. At prediction time, average their outputs (for regression) or take the majority vote (for classification).
Why does this help? Assume each tree produces an unbiased estimator with variance $\sigma^2$, and assume, for a moment, that the trees are independent. The average of T independent estimators has variance $\sigma^2 / T$. You just divided your variance by the number of trees. Bias unchanged, variance crushed.
The assumption is a lie, of course. Trees trained on bootstrap samples from the same dataset are correlated. If the correlation between any pair is $\rho$, the actual variance of the ensemble is:
$$ \operatorname{Var}(\bar{f}) = \rho \sigma^2 + \frac{1 - \rho}{T} \sigma^2 $$
As T grows, the second term vanishes and you are left with $\rho \sigma^2$. Correlation is the floor. Which brings us to the second trick.
Idea 2: feature randomness
At each split in each tree, instead of considering all p features, consider a random subset of size m (typically $\sqrt{p}$ for classification, $p/3$ for regression). Pick the best split from that subset only.
This is the move that makes random forests random. It forces trees to disagree. A dominant feature can no longer hijack every tree's root split, so trees explore different parts of the hypothesis space. That drops $\rho$, and the floor drops with it.
That's the whole algorithm. Not a metaphor or a lossy summary, that is it.
Why it works: the bias-variance story
A model's expected error decomposes into three pieces:
$$ \mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}] - f)^2}{\text{bias}^2} + \underbrace{\operatorname{Var}(\hat{f})}{\text{variance}} + \sigma_{\text{noise}}^2 $$
A deep decision tree has low bias (it can fit anything) but ruinous variance. The forest leaves the bias alone, each tree is still deep and expressive, and slashes variance by averaging decorrelated predictors. That is the trade you're making.
A beautiful consequence: adding more trees does not cause overfitting. The ensemble error converges to a floor set by $\rho \sigma^2 + \sigma_{\text{noise}}^2$. Pick T by compute budget and sleep well.
The pragmatic bits
A few things that fall out of the structure for free:
- Out-of-bag (OOB) error. On each bootstrap sample, roughly a third of the data is left out. Use those left-out points as a built-in validation set. No cross-validation needed.
- Feature importance. Average the impurity decrease each feature creates across all splits, across all trees. Cheap, coarse, useful.
- Parallelism. Trees don't talk to each other. Train them on separate cores, machines, or planets.
Where it stops being the right tool
Random forests are boringly reliable on tabular data with a few thousand to a few million rows. They stop being the right tool when:
- You need a monotonic extrapolation. Trees can't extrapolate. A forest trained on
x ∈ [0, 10]will be flat past 10. For time series with a trend, or physics where you know a bound, this is a trap. - Features are high-dimensional and sparse. Text, images, raw audio. Trees hunt for axis-aligned cuts; neural nets learn the axes. Use the right tool.
- Inference latency matters and you have millions of trees. Gradient boosting +
treelitecompiles to tight C++. Plain RF inference is fine for batch, rough for sub-ms paths. - You need calibrated probabilities. Forest vote fractions are biased toward the middle. Post-hoc calibration (isotonic, Platt) fixes it, but you have to do it.
The practitioner's version
For most tabular problems you'll reach for a gradient-boosted cousin, XGBoost, LightGBM, CatBoost. Same family, different variance-bias knob: GBMs build trees sequentially, each correcting the last, trading some of the forest's robustness for accuracy.
But start with a random forest. 500 trees, default depth, a reasonable max_features. If it already works, and it usually does, ship it, and spend the afternoon on something harder.