Accueil›Blog›Test technique XGBoost et Random Forest : entretien Data Scientist

Guide recrutement data

Test technique XGBoost et Random Forest : entretien Data Scientist

XGBoost et Random Forest sont les algorithmes les plus utilisés en data science tabulaire. En entretien, on va au-delà de l usage basique : hyperparamètres, diagnostics, SHAP.

Data Builder·Juin 2025·7 min de lecture·Data Scientist

Sommaire

Random Forest en profondeur
XGBoost et gradient boosting
Hyperparamètres critiques
Feature importance et SHAP
Overfitting et régularisation
LightGBM vs XGBoost
Grille

1Random Forest : comment ça marche vraiment

Question discriminante

Quelle est la différence entre le bagging et le boosting ? Pourquoi Random Forest est-il robuste à l overfitting ?

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(
    n_estimators=200,        # nombre d arbres (plus = mieux, mais rendements décroissants)
    max_features='sqrt',     # features aléatoires par split (clé du RF)
    max_depth=None,          # arbres profonds par défaut
    min_samples_leaf=1,      # régularisation : augmenter si overfitting
    oob_score=True,          # Out-Of-Bag : estimation gratuite de la généralisation
    n_jobs=-1,               # utiliser tous les coeurs
    random_state=42
)

rf.fit(X_train, y_train)
print(f'OOB score: {rf.oob_score_:.3f}')  # pas besoin de CV séparée

# Cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='roc_auc')
print(f'CV AUC: {scores.mean():.3f}')

Bagging — entraîner N modèles indépendants sur des bootstrap samples et moyenner. Réduit la variance
Boosting — entraîner les modèles séquentiellement, chaque modèle corrige les erreurs du précédent. Réduit le biais
OOB Score — chaque arbre est évalué sur les données non vues pendant son entraînement. Estimation de généralisation sans CV

2XGBoost : gradient boosting avancé

Question discriminante

Comment XGBoost améliore-t-il le gradient boosting classique ?

import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

# Configuration typique de production
model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,      # faible = moins d overfitting, besoin de plus d arbres
    max_depth=6,             # profondeur (3-8 typique)
    subsample=0.8,           # fraction des lignes par arbre
    colsample_bytree=0.8,    # fraction des features par arbre
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=1.0,          # L2 regularization
    early_stopping_rounds=50, # stop si pas d amélioration
    eval_metric='auc',
    random_state=42
)

# Early stopping : éviter l overfitting automatiquement
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=100
)

Régularisation L1/L2 — native dans XGBoost, pas dans le gradient boosting classique
Early stopping — arrêter l entraînement quand la performance sur le validation set ne s améliore plus
Gestion des nulls — XGBoost apprend automatiquement la direction par défaut pour les valeurs manquantes

3Hyperparamètres critiques : lequel toucher en premier

Question discriminante

Quels hyperparamètres ajustez-vous en priorité pour un XGBoost ? Dans quel ordre ?

Hyperparamètre	Impact	Direction si overfitting
n_estimators + learning_rate	Critique	Baisser LR, augmenter n_estimators + early stopping
max_depth	Fort	Réduire (3-5 au lieu de 6-8)
subsample	Modéré	Réduire à 0.6-0.8
colsample_bytree	Modéré	Réduire à 0.6-0.8
min_child_weight	Modéré	Augmenter
reg_alpha / reg_lambda	Variable	Augmenter

Ordre recommandé : 1. Fixer n_estimators avec early stopping. 2. Tuner max_depth + min_child_weight. 3. Tuner subsample + colsample_bytree. 4. Affiner LR.

4Feature importance : feature_importances_ vs SHAP

Question discriminante

Pourquoi préférez-vous SHAP aux feature importances natives de scikit-learn ?

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Feature importance globale
shap.summary_plot(shap_values, X_test, max_display=15)

# Explication individuelle
shap.waterfall_plot(shap.Explanation(
    values=shap_values[0],
    base_values=explainer.expected_value,
    data=X_test.iloc[0],
    feature_names=X_test.columns.tolist()
))

# Interaction entre features
shap.dependence_plot('age', shap_values, X_test,
                     interaction_index='nb_achats')

feature_importances_ sklearn — basé sur la réduction d impureté. Biaisé vers les features à haute cardinalité et peut ignorer les correlations
SHAP — basé sur la théorie des jeux (valeurs de Shapley). Cohérent, global et local, non biaisé
Interaction SHAP — SHAP interaction values montrent comment deux features interagissent

5Diagnostiquer et corriger l overfitting

Question discriminante

Comment détectez-vous l overfitting ? Quels outils utilisez-vous ?

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y,
    cv=5, scoring='roc_auc',
    train_sizes=np.linspace(0.1, 1.0, 10),
    n_jobs=-1
)

# Overfitting : train score >> val score
# Underfitting : les deux scores sont bas
# Bonne fit : train et val convergent

plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Taille du training set')
plt.ylabel('AUC')
plt.legend()

6LightGBM vs XGBoost : quand choisir

Question discriminante

Dans quel cas préférez-vous LightGBM à XGBoost ?

	XGBoost	LightGBM
Vitesse	Modérée	2-10x plus rapide
Mémoire	Élevée	Réduite (histogrammes)
Catégorielles	Besoin d encoding	Support natif
Gros volumes	Lent	Excellent
Petits datasets	Bon	Peut overfitter plus facilement

LightGBM — croissance par feuille (leaf-wise) au lieu de par niveau (level-wise). Plus précis mais plus susceptible d overfitter sur petits datasets
Catégorielles natives — LightGBM encode les catégorielles en interne, pas besoin d OHE

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import optuna

# XGBoost avec early stopping
xgb_model = xgb.XGBClassifier(
    n_estimators=1000,
    early_stopping_rounds=50,
    eval_metric='auc',
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=10,  # desequilibre classes
    tree_method='hist'    # plus rapide, comparable GPU
)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Optuna pour l hyperparameter tuning
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_est', 100, 1000),
    }
    model = xgb.XGBClassifier(**params)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
    return roc_auc_score(y_val, model.predict_proba(X_val)[:,1])

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

XGBoost vs Random Forest - RF : parallel, pas de risque d overfitting, robuste. XGBoost : boosting sequentiel, souvent plus performant mais necessite un tuning soigneux
Early stopping - arreter l entrainement XGBoost quand la metrique de validation ne s ameliore plus. Evite l overfitting et reduit le temps d entrainement
SHAP pour l interpretabilite - shap.TreeExplainer(model) : decompose chaque prediction en contributions par feature. Standard pour expliquer les decisions en production
Optuna vs GridSearchCV - Optuna (Bayesian optimization) : 10-50x plus efficace que GridSearch pour trouver les bons hyperparametres. Aussi plus rapide que RandomSearch
Feature importance - gain (XGBoost) mesure la reduction d impurete ponderee. Peut etre biaise vers les features a haute cardinalite. Prefer SHAP pour une importance causale

7Grille par niveau

Niveau	Maitrise	Signal GO	NO-GO
Junior	sklearn RandomForest, XGBoost basique, train/test split	Sait entraîner un RF et un XGBoost, connaît l OOB score	Utilise l accuracy sans vérifier l équilibre des classes
Confirmé	Hyperparamètres XGBoost, early stopping, SHAP	Utilise early stopping, tuning avec RandomizedSearchCV, SHAP pour l explication	Ne sait pas ce qu est l early stopping
Senior	Learning curves, LightGBM, SHAP interactions, production	Diagnostique l overfitting avec learning curves, compare XGBoost vs LightGBM	Ne sait pas expliquer la différence entre bagging et boosting

1Random Forest: how it really works

Discriminating question

What is the difference between bagging and boosting? Why is Random Forest robust to overfitting?

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(
    n_estimators=200,        # number of trees (more = better, but diminishing returns)
    max_features='sqrt',     # random features per split (key of RF)
    max_depth=None,          # deep trees by default
    min_samples_leaf=1,      # regularization: increase if overfitting
    oob_score=True,          # Out-Of-Bag: free estimation of generalization
    n_jobs=-1,               # use all cores
    random_state=42
)

rf.fit(X_train, y_train)
print(f'OOB score: {rf.oob_score_:.3f}')  # no need for separate CV

# Cross-validation
scores = cross_val_score(rf, X, y, cv=5, scoring='roc_auc')
print(f'CV AUC: {scores.mean():.3f}')

Bagging — train N independent models on bootstrap samples and average. Reduces variance
Boosting — train models sequentially, each model corrects the errors of the previous one. Reduces bias
OOB Score — each tree is evaluated on data not seen during its training. Generalization estimate without CV

2XGBoost: advanced gradient boosting

Discriminating question

How does XGBoost improve on classic gradient boosting?

import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

# Typical production configuration
model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,      # low = less overfitting, needs more trees
    max_depth=6,             # depth (3-8 typical)
    subsample=0.8,           # fraction of rows per tree
    colsample_bytree=0.8,    # fraction of features per tree
    reg_alpha=0.1,           # L1 regularization
    reg_lambda=1.0,          # L2 regularization
    early_stopping_rounds=50, # stop if no improvement
    eval_metric='auc',
    random_state=42
)

# Early stopping: avoid overfitting automatically
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=100
)

L1/L2 Regularization — native in XGBoost, not in classic gradient boosting
Early stopping — stop training when performance on the validation set no longer improves
Null handling — XGBoost automatically learns the default direction for missing values

3Critical hyperparameters: which one to tune first

Discriminating question

Which hyperparameters do you tune first for XGBoost? In what order?

Hyperparameter	Impact	Direction if overfitting
n_estimators + learning_rate	Critical	Lower LR, increase n_estimators + early stopping
max_depth	High	Reduce (3-5 instead of 6-8)
subsample	Moderate	Reduce to 0.6-0.8
colsample_bytree	Moderate	Reduce to 0.6-0.8
min_child_weight	Moderate	Increase
reg_alpha / reg_lambda	Variable	Increase

Recommended order: 1. Fix n_estimators with early stopping. 2. Tune max_depth + min_child_weight. 3. Tune subsample + colsample_bytree. 4. Fine-tune LR.

4Feature importance: feature_importances_ vs SHAP

Discriminating question

Why do you prefer SHAP over native scikit-learn feature importances?

import shap

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Global feature importance
shap.summary_plot(shap_values, X_test, max_display=15)

# Individual explanation
shap.waterfall_plot(shap.Explanation(
    values=shap_values[0],
    base_values=explainer.expected_value,
    data=X_test.iloc[0],
    feature_names=X_test.columns.tolist()
))

# Interaction between features
shap.dependence_plot('age', shap_values, X_test,
                     interaction_index='nb_achats')

feature_importances_ sklearn — based on impurity reduction. Biased toward high cardinality features and may ignore correlations
SHAP — based on game theory (Shapley values). Consistent, global and local, unbiased
SHAP Interaction — SHAP interaction values show how two features interact

5Diagnosing and fixing overfitting

Discriminating question

How do you detect overfitting? What tools do you use?

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

train_sizes, train_scores, val_scores = learning_curve(
    model, X, y,
    cv=5, scoring='roc_auc',
    train_sizes=np.linspace(0.1, 1.0, 10),
    n_jobs=-1
)

# Overfitting: train score >> val score
# Underfitting: both scores are low
# Good fit: train and val converge

plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training set size')
plt.ylabel('AUC')
plt.legend()

6LightGBM vs XGBoost: when to choose

Discriminating question

In what case do you prefer LightGBM over XGBoost?

	XGBoost	LightGBM
Speed	Moderate	2-10x faster
Memory	High	Reduced (histograms)
Categoricals	Requires encoding	Native support
Large volumes	Slow	Excellent
Small datasets	Good	Can overfit more easily

LightGBM — leaf-wise growth instead of level-wise. More accurate but more prone to overfitting on small datasets
Native categoricals — LightGBM encodes categoricals internally, no need for OHE

import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import optuna

# XGBoost with early stopping
xgb_model = xgb.XGBClassifier(
    n_estimators=1000,
    early_stopping_rounds=50,
    eval_metric='auc',
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    scale_pos_weight=10,  # class imbalance
    tree_method='hist'    # faster, comparable GPU
)
xgb_model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

# Optuna for hyperparameter tuning
def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('lr', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_est', 100, 1000),
    }
    model = xgb.XGBClassifier(**params)
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
    return roc_auc_score(y_val, model.predict_proba(X_val)[:,

Vous recrutez un Data Scientist ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel