Machine learningMachine learning

Semi-supervised Gradient Boosting

Semi-supervised Gradient Boosting (Self-training / Pseudo-labeling with Gradient Boosted Trees) · Also known as: pseudo-label gradient boosting, self-training GBM, semi-supervised GBT, label-propagation boosting

Semi-supervised gradient boosting combines gradient boosted trees with self-training or pseudo-labeling to exploit large pools of unlabeled data alongside a small labeled set. An initial GBM fit on labeled data assigns confident predictions to unlabeled examples; those pseudo-labeled points are folded back into training and the model is re-boosted, iterating until convergence. This allows practitioners to harness cheap unlabeled data when labels are scarce or expensive.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Semi-supervised Gradient Boosting

Boosting Gradient Boosting Self-supervised Learning Semi-supervised Learning Semi-supervised Random F…XGBoost Online Gradient Boosting Semi-supervised CatBoost Semi-supervised LightGBM

When to use it

Use semi-supervised gradient boosting when labeled data are scarce (e.g., fewer than a few hundred examples) yet large amounts of unlabeled data from the same distribution are available, and collecting additional labels is costly. It is well-suited to tabular classification and regression tasks where a gradient boosted model already performs decently on the labeled set alone. Avoid it when: labeled and unlabeled data come from different distributions (distribution shift will amplify error); the base model's accuracy on labeled data is very low (poor pseudo-labels corrupt retraining); or the labeled set is already large enough to saturate a standard GBM, in which case the semi-supervised overhead yields negligible benefit.

Strengths & limitations

Strengths

Leverages cheap unlabeled data to improve performance when labels are expensive or scarce.
Compatible with any gradient boosting implementation (XGBoost, LightGBM, CatBoost, scikit-learn) without architectural changes.
Confidence thresholding provides a built-in quality gate that limits pseudo-label noise.
Maintains the full predictive power and feature-interaction modeling of gradient boosted trees.
Iterative refinement allows progressive incorporation of unlabeled data as the model improves.

Limitations

Pseudo-label quality depends on the base model; a weak initial model can corrupt subsequent iterations.
Assumes labeled and unlabeled data are drawn from the same distribution — violating this causes error amplification.
Threshold tuning (tau) adds a hyperparameter with no principled default.
Computational cost grows with each self-training iteration, especially on large unlabeled pools.
Validation must be performed on labeled data only; standard cross-validation over the augmented set inflates estimates.

Frequently asked

How do I choose the confidence threshold?

Start conservatively — for binary classification, values above 0.9 are common in round 1. Monitor labeled-set validation accuracy after each round and relax the threshold gradually only if performance holds or improves. There is no single principled value; treat it as a tunable hyperparameter.

Should pseudo-labeled examples be weighted equally to labeled ones?

Generally no. Assigning a weight less than 1.0 (e.g., 0.5) to pseudo-labeled examples reduces the influence of uncertain predictions on the loss function. Some implementations tie the weight to the model's predicted confidence, giving higher weight to higher-confidence pseudo-labels.

When does semi-supervised gradient boosting hurt performance?

It hurts when the initial model is too weak to produce reliable pseudo-labels, when labeled and unlabeled data differ in distribution, or when the unlabeled set is small relative to the labeled set. In these scenarios a standard supervised GBM on labeled data alone is usually safer.

Is this different from CatBoost's built-in semi-supervised mode?

CatBoost implements semi-supervised training natively by jointly optimizing a supervised loss on labeled data and an unsupervised or pseudo-label loss on unlabeled data within a single training run, which can be more stable than iterative external pseudo-labeling. Manual self-training loops are more general and work with any GBM library.

How many iterations of pseudo-labeling are needed?

Typically 3 to 10 rounds suffice; convergence is detected when the labeled-set validation metric stops improving or the number of newly accepted pseudo-labels drops to near zero. More rounds risk compounding pseudo-label errors.

Sources

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995, 189–196. (Foundational self-training framework underlying pseudo-label approaches.) link ↗
Chapelle, O., Scholkopf, B., & Zien, A. (Eds.) (2006). Semi-Supervised Learning. MIT Press. ISBN: 978-0-262-03358-9

How to cite this page

ScholarGate. (2026, June 3). Semi-supervised Gradient Boosting (Self-training / Pseudo-labeling with Gradient Boosted Trees). ScholarGate. https://scholargate.app/en/machine-learning/semi-supervised-gradient-boosting

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

BoostingMachine learning↔ compare
Gradient BoostingMachine learning↔ compare
Self-supervised LearningMachine learning↔ compare
Semi-supervised LearningMachine learning↔ compare
Semi-supervised Random ForestMachine learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Referenced by

Online Gradient Boosting Semi-supervised CatBoost Semi-supervised LightGBM

Related reference concepts

Ensemble Methods Supervised Learning Unsupervised Learning Self-Supervised and Representation Learning Learning to Rank Cross-Validation and Resampling

Spotted an issue on this page? Report or suggest a fix →

Semi-supervised Gradient Boosting

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Leverages cheap unlabeled data to improve performance when labels are expensive or scarce.
Compatible with any gradient boosting implementation (XGBoost, LightGBM, CatBoost, scikit-learn) without architectural changes.
Confidence thresholding provides a built-in quality gate that limits pseudo-label noise.
Maintains the full predictive power and feature-interaction modeling of gradient boosted trees.
Iterative refinement allows progressive incorporation of unlabeled data as the model improves.

Limitations

Pseudo-label quality depends on the base model; a weak initial model can corrupt subsequent iterations.
Assumes labeled and unlabeled data are drawn from the same distribution — violating this causes error amplification.
Threshold tuning (tau) adds a hyperparameter with no principled default.
Computational cost grows with each self-training iteration, especially on large unlabeled pools.
Validation must be performed on labeled data only; standard cross-validation over the augmented set inflates estimates.

Frequently asked

How do I choose the confidence threshold?

Should pseudo-labeled examples be weighted equally to labeled ones?

When does semi-supervised gradient boosting hurt performance?

Is this different from CatBoost's built-in semi-supervised mode?

How many iterations of pseudo-labeling are needed?

Sources

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995, 189–196. (Foundational self-training framework underlying pseudo-label approaches.) link ↗
Chapelle, O., Scholkopf, B., & Zien, A. (Eds.) (2006). Semi-Supervised Learning. MIT Press. ISBN: 978-0-262-03358-9

Semi-supervised Gradient Boosting

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Semi-supervised Gradient Boosting

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts