

Lower-triangular Pearson correlation matrix of the final feature set used in this study. Each cell shows the Pearson correlation coefficient between a pair of planetary or stellar properties, with the color scale ranging from −1 (strong negative correlation) to +1 (strong positive correlation). The diagonal elements represent self-correlations. To improve readability, only moderate to strong correlations (|r| ≥ 0.5) are annotated. — astro-ph.EP
The increasing size and heterogeneity of exoplanet catalogs have made systematic habitability assessment challenging, particularly given the extreme scarcity of potentially habitable planets and the evolving nature of their labels.
In this study, we explore the use of pool-based active learning to improve the efficiency of habitability classification under realistic observational constraints. We construct a unified dataset from the Habitable World Catalog and the NASA Exoplanet Archive and formulate habitability assessment as a binary classification problem.
A supervised baseline based on gradient-boosted decision trees is established and optimized for recall in order to prioritize the identification of rare potentially habitable planets.
This model is then embedded within an active learning framework, where uncertainty-based margin sampling is compared against random querying across multiple runs and labeling budgets. We find that active learning substantially reduces the number of labeled instances required to approach supervised performance, demonstrating clear gains in label efficiency.
To connect these results to a practical astronomical use case, we aggregate predictions from independently trained active-learning models into an ensemble and use the resulting mean probabilities and uncertainties to rank planets originally labeled as non-habitable. This procedure identifies a single robust candidate for further study, illustrating how active learning can support conservative, uncertainty-aware prioritization of follow-up targets rather than speculative reclassification.
Our results indicate that active learning provides a principled framework for guiding habitability studies in data regimes characterized by label imbalance, incomplete information, and limited observational resources.
R. I. El-Kholy, Z. M. Hayman
Comments: 19 pages, 9 figure, 2 tables
Subjects: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
Cite as: arXiv:2602.23666 [astro-ph.EP] (or arXiv:2602.23666v1 [astro-ph.EP] for this version)
https://doi.org/10.48550/arXiv.2602.23666
Focus to learn more
Submission history
From: Reham El-Kholy PhD
[v1] Fri, 27 Feb 2026 04:18:11 UTC (398 KB)
https://arxiv.org/abs/2602.23666
Astrobiology,






