Tackling overfitting in tree-based models
A distilled guide to LightGBM & Random Forest hyperparameter optimization
Introduction
If you're using LightGBM or Random Forest, you know they're powerful. But if you're just using the default settings, you're leaving performance on the table and likely overfitting. To get models that actually work well on new, unseen data, you have to tune their hyperparameters.
This post distills practical experience into a focused discussion on the most impactful hyperparameters for LightGBM and Random Forest. More importantly, the article talks about what they do, and why they matter.
To better understand the technical parts of this post, it is also advisable to read my previous one: Linear vs nonlinear algos - when and why
LightGBM part
LightGBM is incredibly fast and flexible. But with great power comes the need for careful control. The following parameters, presented in a generally recommended order of tuning priority, are critical; however, it does not mean you must tune them all at once, since this is a straight way to the so-called validation overfitting. Best start simple and then move on step by step to see if OOS performance improves.
max_depth: Fundamentally dictates the complexity of individual learners. Unconstrained (None), trees can grow excessively deep, capturing noise rather than signal—a classic path to overfitting. For most financial time series or cross-sectional datasets, a constrained range, typically between 3 and 10, is advisable. Iterative testing within this band is key.
num_leaves: Controls the number of terminal nodes per tree. While the theoretical upper bound is 2^max_depth, allowing this many leaves, especially with deeper trees, drastically increases model capacity and overfitting risk. A more conservative strategy involves setting num_leaves to be significantly less than 2^max_depth. For instance, with a max_depth of 7 (128 potential leaves), values in the 30-60 range might offer a better bias-variance trade-off. Systematic exploration (e.g., 15, 31, 45, 63) is more effective than adhering rigidly to the power-of-2 rule.
n_estimators: The ensemble size. While more trees can improve performance, diminishing returns are typical, and computational cost increases linearly. This parameter interacts strongly with learning_rate; effective calibration often involves finding a balance, frequently guided by early stopping. Initial sweeps from 100 to 1000 in steps of 50-100 are common.
min_child_samples (aliased as min_data_in_leaf): Enforces a minimum data threshold for a leaf node. This prevents the model from creating splits based on idiosyncrasies of very small data subsets, thereby promoting smoother decision boundaries. Its optimal value is highly dataset-dependent; for lower-frequency data (e.g., weekly), 20 to 100 (with steps of 10-20) might be appropriate, whereas high-cardinality, granular data may warrant values in the hundreds.
min_child_weight (aliased as min_sum_hessian_in_leaf): A more refined control than min_child_samples. It refers to the minimum sum of second-order derivatives of the loss function (Hessians) required in a child. This considers not just the count but the "information content" or weight of samples. Starting from the default (1e-3) and exploring up to 10-20 (e.g., in steps of 0.5-1.0) can fine-tune leaf quality.
learning_rate (or eta): The shrinkage factor applied to each tree's contribution. Lower values (e.g., 0.01-0.05) generally necessitate higher n_estimators but often yield better generalization by allowing the model to learn more gradually. A typical range for exploration is 0.01 to 0.3.
reg_lambda (L2) and reg_alpha (L1): Regularization terms crucial for combating overfitting. L2 regularization penalizes the squared magnitude of feature weights, encouraging smaller, more diffuse weights. L1 regularization penalizes the absolute magnitude, potentially leading to sparse solutions (implicit feature selection). Exploring values from 0.0 (no regularization) to 5.0-10.0 in steps of 0.1-0.5 for both is a standard practice.
colsample_bytree (or feature_fraction): The fraction of features randomly sampled for constructing each tree. This is a primary mechanism for de-correlating trees in the ensemble, reducing variance. Values between 0.4 and 0.9 (e.g., with 0.1 steps) are effective. Setting this below 1.0 is almost always beneficial.
subsample (or bagging_fraction): The fraction of observations randomly sampled (without replacement in LightGBM's default) for each tree. This introduces instance-level diversity. Similar to colsample_bytree, values between 0.5 and 0.9 (avoiding 1.0) often improve robustness.
path_smooth: A newer regularization technique that aims to smooth the decision boundaries, particularly for leaves with few samples. It adds a penalty based on the path taken to the leaf. Values from 0 to perhaps 20 could be explored, noting its potential to increase bias if set too high.
min_split_gain (or min_gain_to_split): The minimum improvement in the loss function required to justify a split. This acts as a pre-pruning mechanism, preventing splits that offer negligible informational gain. A range of 0.0 to 1.0 (e.g., in 0.1 steps) is typical.
extra_trees: If True, enables the use of Extremely Randomized Trees logic for finding splits. This introduces more randomness by selecting split points randomly for each considered feature, rather than optimally. While sometimes beneficial, its impact is dataset-specific.
boosting_type: LightGBM supports various boosting algorithms: gbdt (traditional Gradient Boosting Decision Tree), rf (Random Forest), dart (Dropout Additive Regression Trees, often useful for mitigating over-specialization on large datasets but slower), and goss (Gradient-based One-Side Sampling, for efficiency). gbdt is the workhorse, but exploring rf or dart can sometimes yield improvements.
Random Forest part
Random Forest is often easier to get good results with and can be more robust, especially with noisy data. Its key hyperparameters, while fewer than LightGBM's, are equally significant for generalisation performance.
n_estimators: The number of trees in the forest. Performance typically plateaus after a certain number of trees; 100 to 500 is a common and often sufficient range, where latter is for larger datasets.
max_depth: Controls the depth of individual trees. Allowing trees to grow to their maximum depth (None) can lead to overfitting, though Random Forest is generally more resilient than gradient boosting methods. Constraining max_depth (e.g., 3 to 15-20) is a good practice for regularisation.
min_samples_split: The minimum number of samples an internal node must hold to be eligible for splitting. Increasing this from the default (often 2) to values like 5, 10, or even 20 makes the model more conservative, preventing splits on small, potentially noisy sample groups.
min_samples_leaf: The minimum number of samples that must reside in each terminal node (leaf). This ensures that splits are only made if each resulting child node is sufficiently populated, leading to smoother model predictions. Values such as 1, 3 (a common setting), 5, or 10 are worth evaluating.
max_features: The size of the random subset of features considered at each split point. This is critical for de-correlating the trees.
'sqrt': Uses the square root of the total number of features – a widely adopted heuristic and often a good default.
A float (e.g., 0.25 to 0.75): Specifies a percentage of features → often 1/3 is a rule of thumb.
min_weight_fraction_leaf: Similar to min_samples_leaf but considers the fraction of the sum of (weighted) sample weights, rather than raw counts. Particularly relevant when using sample weights. Exploring small fractions (e.g., 0.0 to 0.05) is appropriate.
Synthesis
Knowing the parameters is step one. Using them wisely and testing your model correctly is step two. Here’s what really matters:
What your hyperparams are really doing
Most of the parameters we've talked about for LightGBM and Random Forest boil down to a few key jobs:
Controlling Tree Size/Complexity: Things like max_depth, num_leaves, min_child_samples (min_samples_leaf), and min_samples_split. These directly stop your individual trees from becoming too complex and fitting every tiny wiggle in your training data. Simpler trees often mean a model that works better on new data.
Making Trees Different: Ensemble models like these work best when their individual trees are diverse. Parameters like colsample_bytree (feature_fraction), subsample (bagging_fraction), and max_features (for RF) do this by making sure each tree sees a slightly different slice of your data or features. This diversity is key to reducing the model's overall error.
Directly Fighting Overfitting (Regularization): Terms like reg_alpha, reg_lambda, and min_split_gain in LightGBM directly penalize complexity or splits that don't add much value. They help force the model to be simpler and more robust.
Don't "over-tune" – your validation set isn’t perfect!
It's tempting to tweak parameters endlessly to get the absolute best score on your validation set. But be careful! Your validation set is just one sample of data. If you tune too hard to it, you might just be fitting to the specific noise or quirks of that particular validation set. Your model might look great there, but then fail on completely new test data because it learned the wrong things.
This is called "overfitting to the validation set." A slightly lower score on the validation set with a simpler, more stable model is often better than the highest possible score from a very complex or very specific parameter set.
The golden rule: validate for tuning, test for truth – always!
This is the most important part of any modeling process. If you get this wrong, you can't trust your results.
Validation set: You use this to choose your hyperparameters. You try different settings, train on your training data, and see which settings give the best performance on the validation set.
Test set: This data is kept completely separate. You only use it once, at the very end, after you've picked your best hyperparameters using the validation set. The performance on the test set tells you how well your model is likely to do on brand new, unseen data.
This rule is even more important for things like walk-forward optimization, where you're re-training your model over time as new data comes in:
For each period you're re-training (each "walk-forward" step), you split your available history into a training part and a validation part.
You pick the best hyperparameters for that period using that training and validation data.
Then, you use those chosen settings to make your predictions for the next period (which acts as your test set for that step).
You never let information from the period you're trying to predict leak into your training or hyperparameter choices for that step. Keeping this separation clean at every stage is crucial if you want to build models you can actually rely on.
In short: know what your parameters do, don't chase tiny gains on your validation set too hard, and always, always keep your test data truly separate until the final evaluation. That’s how you build tree models that actually work.
AI Conclusion
So, that's the lowdown on getting your LightGBM and Random Forest models in shape. It's not about memorizing a giant list of parameters, but understanding the core ideas: controlling how complex your trees get, making sure your ensemble has enough variety, and directly fighting off overfitting.
The defaults are just a starting line. The real craft comes in thoughtful tuning, guided by solid testing practices. Don't just throw every parameter combination at the wall; think about what each one is doing. And above all, respect your validation and test sets – they're your compass for navigating the path to a truly robust model. Get these fundamentals right, and you'll be miles ahead in building tree-based models that you can actually trust in the wild. Good luck, and happy tuning!
Piotr, could you please share your approach for a dataset with 8000 rows. Would there be any limits in hyperparameters that would fit such a time series dataset? Time series has lower and upper bounds that it does not exceed. Dziękuję!