ESA-Ariel Data Challenge NeurIPS 2022: introduction to exo-atmospheric studies and presentation of the Atmospheric Big Challenge (ABC) Database

Author Notes

Abstract

This is an exciting era for exo-planetary exploration. The recently launched JWST, and other upcoming space missions such as Ariel, Twinkle, and ELTs are set to bring fresh insights to the convoluted processes of planetary formation and evolution and its connections to atmospheric compositions. However, with new opportunities come new challenges. The field of exoplanet atmospheres is already struggling with the incoming volume and quality of data, and machine learning (ML) techniques lands itself as a promising alternative. Developing techniques of this kind is an inter-disciplinary task, one that requires domain knowledge of the field, access to relevant tools and expert insights on the capability and limitations of current ML models. These stringent requirements have so far limited the developments of ML in the field to a few isolated initiatives. In this paper, We present the Atmospheric Big Challenge Database (ABC Database), a carefully designed, organized, and publicly available data base dedicated to the study of the inverse problem in the context of exoplanetary studies. We have generated 105 887 forward models and 26 109 complementary posterior distributions generated with Nested Sampling algorithm. Alongside with the data base, this paper provides a jargon-free introduction to non-field experts interested to dive into the intricacy of atmospheric studies. This data base forms the basis for a multitude of research directions, including, but not limited to, developing rapid inference techniques, benchmarking model performance, and mitigating data drifts. A successful application of this data base is demonstrated in the NeurIPS Ariel ML Data Challenge 2022.

planets and satellites: atmospheres, telescopes, methods: data analysis

Subject

Instrumentation, Detectors and Telescopes

Issue Section:

Research article

1 CONTEXT

The field of exoplanet has come a long way since the discovery of the first exoplanet in 1994 (Wolszczan & Frail 1992). With the launch of dedicated telescopes for the detection of exoplanets, such as the Convection, Rotation et Transits planétaires (CoRoT; Pätzold et al. 2012), the Kepler (Borucki et al. 2010), and the Transiting Exoplanet Survey Satellite (TESS; Ricker et al. 2015) space telescopes, we now have basic characteristics, such as planetary radii or masses, for more than 5000 alien worlds. From the observed population, we deduced that, while exoplanets are ubiquitous (Cassan et al. 2012; Batalha 2014), the architecture of our Solar system does not appear to be a typical outcome of planetary formation. For instance, the first detected exoplanet around a sun-like star is classified as a hot Jupiter (Mayor & Queloz 1995), a planet of a similar size to Jupiter (e.g. about 10 times the size of Earth) but orbiting so close to its host star that it completes a full revolution in about 4 d. Such planet does not exist in our Solar system and so are the majority of the observed planets, referred as sub-Neptunes due to their size being between the size of Earth and Neptune (Howard et al. 2010; Fulton et al. 2017; Petigura et al. 2022). To answer the most fundamental questions of the field, such as ‘what are exoplanets made of?’ or ‘how do planets form?’, one must obtain complementary information to planetary masses and radii.

In the last decade, astronomers have therefore turned their attention to exoplanetary atmospheres, or exo-atmospheres, in the quest for further constraints on these worlds (Charbonneau et al. 2002; Tinetti et al. 2007; Swain, Vasisht & Tinetti 2008; Kreidberg et al. 2014; Schwarz et al. 2015; Sing et al. 2016; Stevenson et al. 2017; de Wit et al. 2018; Hoeijmakers et al. 2018; Tsiaras et al. 2018, 2019; Brogi & Line 2019; Welbanks et al. 2019; Edwards et al. 2020; Changeat & Edwards 2021; Roudier et al. 2021; Yip et al. 2021a; Changeat et al. 2022; Chen et al. 2022; Edwards et al. 2022; Estrela, Swain & Roudier 2022; Mikal-Evans et al. 2022). The study of exoplanet atmospheres has been enabled by the use of space-based instrumentation, such as the Hubble Space Telescope (HST), the retired Spitzer Space Telescope, and ground-based facilities such as the Very Large Telescope (VLT). Many discoveries were made. We, for instance, know that water vapour is present in many hot Jupiter atmospheres, and we have recently recovered evidence for links between atmospheric chemistry and formation pathways. However, with the recent launch of the NASA/ESA/CSA James Webb Space Telescope (JWST; Greene et al. 2016), and the upcoming ESA Ariel Mission (Tinetti et al. 2021) and BSSL Twinkle Mission (Edwards et al. 2019b), the field of exoplanetary atmosphere will undergo a revolution. The quality and quantity of atmospheric data will be multiplied exponentially, bearing many new challenges.

One of the main challenge in the study of exo-atmospheres, even today, concerns with the reliable extraction of information content from observed data. Atmospheres are complex dynamical systems, involving many physical processes (chemical and cloud reactions, energy transport, fluid dynamics) that are coupled, poorly understood, and difficult to reproduce on Earth. Astronomers have therefore attempted to interpret observations of atmospheres using retrieval techniques: simplified models (or reduced order models) for which the parameter space of possible solutions is explored using a statistical framework (Irwin et al. 2008; Madhusudhan & Seager 2009; Line et al. 2012, 2013; Waldmann et al. 2015a,b; Lavie et al. 2017; Gandhi & Madhusudhan 2018; Mollière et al. 2019; Zhang et al. 2019; Min et al. 2020; Al-Refaie et al. 2021; Harrington et al. 2022). With current observational data, state-of-the-art retrieval models use sampling based Bayesian techniques, such as MCMC or Nested Sampling, with non-informative (uniform) priors to obtain the posterior distributions of between 10 and 30 free parameters (Changeat et al. 2021a). The number of free parameters depends on the information content available in the observational data and the chosen atmospheric model. As of today, there is no consensus on the most appropriate atmospheric model to employ, and we cannot obtain in-situ observations (e.g. we cannot travel there). Sampling-based techniques typically require between 10⁵ and 10⁸ forward model calls to reach convergence, meaning that only models providing spectra of the order of seconds are viable. With the increase in data quality, thanks to JWST, Ariel, and Twinkle, it will enable a wider range of atmospheric processes to be probed by the observations, implying that forward models must grow in complexity and so does the dimensionality of the problem (The JWST Transiting Exoplanet Community Early Release Science Team et al. 2022). As such, interpreting next-generation telescope data is currently a real issue, which has been highlighted multiple times by studies relying on simulations, that will require a revolution in both our models and information extraction techniques (Rocchetto et al. 2016; Caldas et al. 2019; Changeat et al. 2019; Taylor et al. 2020, 2021; Yip et al. 2020; Changeat et al. 2021a; Al-Refaie et al. 2022a; Yip et al. 2022a).

In recent years, the community started to explore alternative approaches to circumvent the bottleneck with sampling based approaches. Machine learning (ML) models land itself as a promising candidate with its high flexibility and rapid inference time. Waldmann (2016) pioneered the use of deep learning network in the context of atmospheric retrieval, training a Deep Belief Network to identify molecules from simulated spectra. On the other end, Márquez-Neila et al. (2018) led the first attempt to train a Random Forest regressor to predict planetary parameters directly. Since then, the field has started to look at different ML methodologies to bypass the lengthy and computationally intensive retrieval process (Soboczenski et al. 2018; Zingales & Waldmann 2018; Cobb et al. 2019; Hayes et al. 2020; Nixon & Madhusudhan 2020; Oreshenko et al. 2020; Ardevol Martinez et al. 2022; Haldemann et al. 2022; Himes et al. 2022; Yip et al. 2022a). Pushed by astronomers’ need for explainable solutions, other groups have also looked into the information content of exoplanetary spectra with AI (Guzmán-Mesa et al. 2020; Yip et al. 2021b).

The publicly available Atmospheric Big Challenge (ABC) Database of forward models and retrievals aims to provide the resources to address aforementioned issue via participation of external communities and encourage novel, cross-disciplinary solutions. It is constructed as a permanent data repository for further investigations. The data base is accessible at the following link: https://doi.org/10.5281/zenodo.6770103.

Since the creation of similar data base constitutes a major barrier to anyone interested in applying ML in the domain of exoplanet atmospheres, we emphasize on its release as a community asset. The organization and creation of this data set poses a challenge on its own because of the following:

It requires a cross-disciplinary collaboration. The problem requires domain knowledge (atmospheric chemistry, radiative transfer, atmospheric retrievals) to ensure the data product represents a meaningful science case rather than a trivial example. At the same time, it requires ML expertise to ensure the data product is representative of the problem at hand, and ideally, one that adequately reflects the reality.
It requires access to the relevant tools which is often exclusive to communities in exoplanet: atmospheric retrieval and chemistry codes as well as instrument noise simulators.
It requires significant computing resources. For this project, more than 2000 000 CPUh were used. Simulations of this scale have never been attempted before.

This paper is written to (1) provide non-field experts with a light-weighted introduction to the science behind the data generation process, (2) document the steps involved in the creation of the ABC data base, and (3) to provide a carefully curated, well-organized, and scientifically relevant data set for any research community. This manuscript complements the data challenge proposal description (Yip et al. 2022b) accepted as a NeurIPS 2022 data challenge. It is intended to provide the required domain knowledge for non-field experts. We present a simplified jargon-free introduction to the most commonly employed techniques in the field of exo-atmospheres in Appendix A.

2 DATA GENERATION

For the data generation, we employed Alfnoor (Changeat et al. 2021a), a tool built to expand the forward model and atmospheric retrieval capabilities of TauREx 3 (Al-Refaie et al. 2021) to large populations of exo-atmospheres. Alfnoor allows to automatize the generation or telescope simulations and perform large-scale standardized atmospheric retrievals. A lightweight description of the main concepts behind atmospheric studies of exoplanets are described in Appendix A. In the context of ESA-Ariel, we generated 105 887 simulated forward observations as well as 26 109 standardized retrieval outputs.

2.1 Source of input parameters

To model those extrasolar systems, some preliminary assumptions were required. In particular, all the parameters that are not linked to the atmospheric chemistry needed to be fixed to realistic values. Those parameters include, but are not limited to, stellar radius (R_s), distance to Earth (d), star magnitude K (K_mag), planetary radius (R_p), planetary mass (M_p), planet equilibrium temperature (T), and transit duration (t₁₄).

The planetary objects in this data base were selected from the list of confirmed known exoplanets and the list of TESS exoplanet candidates (TOIs). This list was constructed as part of the ESA-Ariel Target list initiative (Edwards et al. 2019a; Edwards & Tinetti 2022), frozen to 2022 March 1 for this data base. For the TOIs, we are aware that some of those objects will not be exoplanets; however, the observation of their transit by TESS and the first preliminary checks of their inferred properties make them compelling objects. Follow-up observations will allow us to classify their nature, but for the purpose of building this data base, they are as close as possible to what the reality looks like. As radial velocity follow-ups cannot and is not systematically conducted for all targets, the mass of some of those objects is unknown. In this case, as in Edwards & Tinetti (2022), we replace the planetary mass by an estimate from the relation described in Chen & Kipping (2017). To those lists of objects, we filtered all the planets with radius below 1.5 R_⊕, the conservative value for the middle of the Radius Valley (Fulton et al. 2017; Cloutier & Menou 2020; Petigura et al. 2022). This is because the atmospheric composition of small planets would require a much more complex treatment (e.g. the assumption of hydrogen-dominated atmosphere is not theoretically sound) than is proposed here. In total, we obtained data for 2972 confirmed exoplanets and 2928 candidate exoplanets, thus bringing our total to 5900 unique objects. The list of selected planets for this database is shown in Fig. 1.

Figure 1

Size of the considered planets versus the mass of their host star. We exclude planets that have radii below 1.5 R_⊕ marked in grey and approximately corresponding to the lower limit on the Radius Valley.

Open in new tab Download slide

Fig. 2 shows the distributions of nine selected stellar and planetary parameters. These values are taken from the actual planetary system and therefore follows the current observed demographics, these values remains unchanged thorough the data generation process. However, relying on currently known planets is a double-edged knife. While it saved us from making unverified assumptions, our data are prone to selection bias stemmed from the observation technique, strategy, and instrument specification. These biases can be easily spot from Fig. 2. For instance, the distribution of orbital period tends to be shorter (peaks around ∼3 d) as their proximity to the host star makes them easier to discover.

Figure 2

Distribution of nine stellar and planetary parameters used to generate the synthetic spectra. These distributions follow closely to the actual demographic of the currently known population of exoplanets, and therefore they are also subject to biases presented in the original population.

Open in new tab Download slide

2.2 The atmospheric forward model setup

We produce batches of randomized observations for the population described in the previous section. For each planet, the stellar (R_s, d, K_mag), orbital (t₁₄), and bulk parameters (R_p, M_p, T) are fixed to their literature values, while the chemistry of the atmosphere is randomly generated. The thermal profile is assumed to be isothermal (constant temperature) at the equilibrium temperature of the planet, and we simulate the planet’s atmosphere from 10 to 10⁻¹⁰ bar using 100 layers (divided uniformly in log-pressure space).

For the chemistry, we assume a primary atmosphere made mainly from hydrogen and helium (He/H₂ = 0.17), to which we add trace gases. The trace gases are H₂O (Polyansky et al. 2018), CH₄ (Yurchenko et al. 2017; Chubb et al. 2021), CO (Li et al. 2015), CO₂ (Yurchenko et al. 2020), and NH₃ (Coles, Yurchenko & Tennyson 2019), selected based on our current understanding of exoplanetary chemistry (Agúndez et al. 2012; Venot & Agúndez 2015; Drummond et al. 2016; Madhusudhan et al. 2016; Stock et al. 2018; Woitke et al. 2018; Venot et al. 2020; Baeyens et al. 2022; Al-Refaie et al. 2022a). The mixing ratio, or trace abundance, of those gases is randomly chosen using a Log Uniform law and depends on the molecule considered. The Log Uniform law is chosen rather than a more informative law (such as equilibrium chemistry) because we are looking for solutions that are unbiased to our current, most likely limited, understanding of atmospheric chemistry. Such training set is suitable to produce ML solutions behaving in a similar way to the so-called free chemistry retrievals. If correlation exists in a real population (e.g. between the chemistry of the atmosphere and its thermal structure), such method should allow the extraction of this trend without the need to implicitly make a physical assumption. Note that this is required in the cases where data have undergone a data shift (in this case, when the data are generated using a different atmospheric assumption). Another important point to consider involves the detection capabilities of Ariel for each molecule and the degeneracy between molecular species. For instance, CO shares similar features to CO₂ in Ariel but it is a much harder molecule to detect due to its weaker absorption properties. Due to those differences in the strength of spectral features and guided by the Ariel Tier-2 detection limits investigated in Changeat et al. (2020a), we select different bounds for the randomized chemical abundances. This process allows us to balance our data set and ensure that a significant fraction of the planets have detectable amount of CO. The bounds employed for this data set are as follows:

H2O: RandomLogUniform(min = −9, max = −3).
CO: RandomLogUniform(min = −6, max = −3).
CO2: RandomLogUniform(min = −9, max = −4).
CH4: RandomLogUniform(min = −9, max = −3).
NH3: RandomLogUniform(min = −9, max = −4).

For each parametrized atmosphere, we compute the radiative transfer (see Appendix A) layer by layer, including the contributions from molecular absorption, Collision Induced Absorption (CIA), and Rayleigh Scattering.

Each spectrum is first computed at a high resolution,¹ before being convolved with an Ariel instrument simulation. For each planet, we employed the TauREx plugin for ArielRad (Mugnai et al. 2020), the official Ariel noise simulator, to estimate the noise on observation at each wavelength. With ArielRad, we force each observation to satisfy the criteria for Ariel Tier 2 observations (Tinetti et al. 2021), meaning that the observations have a specific resolution profile (e.g: R ≈ 10 for 1.10 < λ < 1.95 μm; R ≈ 50 for 1.95 <λ < 3.90 μm; R ≈ 15 for 3.90 < λ < 7.80 μm) and that the signal-to-noise ratio (SNR) of the observations must be higher than 7 on average. The SNR is here defined on the atmospheric signal (e.g. the second part of equation A2). To produce the simulated spectra, we select the minimum number of transit that allow to reach this threshold, meaning that our sample of observations contains a wide range of final SNR. Since we used real objects for those simulations and that all planets are not favourable targets for Ariel, this means that some targets require an un-realistic number of observations to reach the SNR condition of Tier 2. However, this does not affect the purpose of this data set, providing independent instances of realistic noise profiles.

Following those steps, we obtain a realistic Ariel simulated observation for each planet and each randomized chemistry. We show an example of such simulated observation in Fig. 3. In total, we produced 105 887 simulated observations for the ABC Database.

Figure 3

Example of a simulated Ariel observation with errorbars (data points) for a randomized chemistry. The best-fitting model obtained using atmospheric retrieval is also shown (solid line). The slope at the lowest wavelengths arises from Rayleigh Scattering, while most of the other spectral modulations in this example can be attributed to CH₄. The data points around 4.5 μm are associated with CO and CO₂ absorption. Note the difference in wavelength coverage (0.5–7.8 μm) as compare to the HST spectrum (1.1–1.7 μm) in Fig. A2, which allows us to extract precise information for many molecules.

Open in new tab Download slide

2.3 The atmospheric retrieval setup

For 26 109 (25 per cent) of the simulated observations generated at the previous step, we perform the traditional inversion technique using Alfnoor.

For the model to optimize, we kept the same setup as presented in the previous section and performed parameter search on the following free parameters: isothermal temperature (T), log abundances for H₂O, CO₂, CH₄, CO, and NH₃. The priors are made wide and un-informative, with the atmospheric temperature being fitted between 100 and 5500 K and the chemical abundances between 10⁻¹² and 10⁻¹ in Volume Mixing Ratios. The widely used Nested Sampling Optimizer, MultiNest (Feroz, Hobson & Bridges 2009), was employed with 200 live points and an evidence tolerance of 0.5.

For a single example on Ariel data, we provide the best-fitting spectrum in Fig. 3. From the optimization process, we are able to extract the traces of each parameters and the weights of the corresponding models. This allows to construct the posterior distribution of the free parameters with, for instance, corner. The posterior distribution of the same example is shown in Appendix B, Fig. C1. Processing of the posterior distribution also allows to extract statistical indicators describing the chemical properties of the planet, such as mean, median, and quantiles for each of the investigated parameters.

2.4 Data overview

Following the data generation process outlined above, we have generated a total of 105 887 forward models in Ariel Tier-2 resolution. Of them, 26 per cent of them are complemented with results from atmospheric retrieval (following a generic setting as described in Section 2.3).

Fig. 4 shows the distribution of mean transit depth (red) overlapped with the distribution of feature height (orange). The former served as a proxy of the diverse planetary classes present in the data set. The characteristic dichotomy stemmed from current demographics studies² and selection bias in our observation technique.³ The latter is calculated from the difference between the maximum and minimum transit depth of each spectrum, it served as a proxy of the ‘strength’ of the spectroscopic features presented in the spectra, e.g. the peaks and troughs as seen in Figs A2 and 3. We note that an SED with linear slope will also produce a non-negligible feature height value, which is still considered as spectroscopic feature in our case. The two quantities are closely linked to our targets of interest, which means that any successfully model not only need to account for the inter-variation between different spectra, it also needs to take into account the intra-variation across wavelength channels, which is always one to three orders of magnitude smaller than the variation in mean transit depth.

Figure 4

Distribution of the mean transit depth (red) overlapped with the distribution of the feature height (orange), both measured in a logarithm scale. The dichotomy displayed in mean transit depth distribution stemmed from the observational demographics of planet radius, showing the diversity of currently known exoplanets in our data set. On the other hand, the feature height documents the ‘strength’ of spectroscopic features in each spectrum (such as absorption features or strong trends induced by Rayleigh Scattering). Any successful model must be able to account for the variations in both scales.

Open in new tab Download slide

Next we will look at results from atmospheric retrieval. The quality of the retrieved product is closely related to the information content of individual spectrum, which is a function of the wavelength coverage, size of the spectral bin, observational uncertainties and the abundance of the molecule. Fig. 5 compares the retrieval results against the input values of the six targets of interest (H₂O, CO₂, CH₄, CO, NH₃, temperature). Each data point in every subplot represents a single spectrum and is coloured in accordance to the size of the inter-quartile range (IQR).⁴ Points lying along the diagonal line – those that are retrieved correctly – tend to have tighter constraint, while points that deviate from the diagonal line tend to entail larger uncertainties. For most gases there is a transition region where molecules at certain abundance level starts to depart from the diagonal line. The extent and onset of the transition region is a function of the instrument specification (e.g. its detection limits), the composition of the atmosphere and the strength of the molecular absorption. Changeat et al. (2020a) pioneered an initial study of this transition region and derived the detection limit for each gas based on the size of the errorbar obtained. Here, we find similar results, and the detection limits of Ariel correspond to the region where all the retrieved values from Fig. 5 deviate from the diagonal line (associated with colours from green to red).

Figure 5

Comparison of the retrieved values against the input values for six different targets. Each data point represents a single instance and is colour-coded according to the respective size of the IQR. Ariel data at Tier-2 resolution are able to place tight constrain on the temperature and most molecules down to a certain abundance. Beyond that, the retrieved values starts to deviate from the diagonal and becomes less constrained, highlighting the limitations of the telescope.

Open in new tab Download slide

Appendix D continues our discussion into other aspects of the data product.

2.5 Structure of the ABC data base

The data base contains two levels of data product: The first level is for general use and the second level is designed specifically for the competition. We will describe each level next.

2.5.1 Level 1: cleaned data

Level 1 contains data products for general use. As TauREx 3 performs forward modelling and retrieval on a planet-by-planet basis. The data are pre-processed to provide an unified structure for effective data navigation and a foundation for further processing. Below is the list of operations we performed:

Removed any spectra with NaN values.
Removed spectra with transit depth larger than 0.1 in any wavelength bins.
Removed spectra with transit depth smaller than 1 × 10⁻⁸ in any wavelength bins.
Standardized units and data formats.
Extracted all Stellar, Planetary and Instrumental metadata.
Combined all instances into a single, unified file.

Level 1 data are organized into all_data.csv, observations.hdf5 and all_target.hdf5. all_data.csv contains information on the planetary system and the input values for the generation process, observations.hdf5 contains information on individual observations and all_target.hdf5 contains the corresponding retrieval results (posterior distributions of each atmospheric targets). In total, there are 105 887 planet instances, 25 per cent of them (26 109) has complementary retrievals from Nested Sampling.

2.5.2 Level 2: curated data for model training

The following section is designed for statistical model training. In order to allow for the broadest possible participation and minimize the overhead for non-field experts, we pre-processed the data set with our domain knowledge so that the end product is ready for model development. At the same time, we have tailored the train/test split procedure in order to allow a diverse array of solutions and research directions. Here we outlined the list of operations we performed:

Removed data with less than 1500 points in the tracedata. This is to allow for more accurate comparison.
Removed un-informative and duplicated astrophysical and instrumental features.⁵
Split data into training and test sets (more details in Appendix E).

After performing the above operations, the training data has 91 392 planet instances with 21 988 of them has complementary retrievals results. The test data have 2997 instances, all of which are complemented with retrieval results. There is a notable difference in terms of the volume of data between Level 1 and Level 2 data due to the pre-processing step and train/test split. We have devoted a section in Appendix E to describe the Level 2 data in details.

2.6 Additional resources

Published along with the data base, we provide a series of complementary resources. In particular the data base is provided with a Jupyter Notebook describing the data structure, how to load the data set, and demonstrating its main characteristics. We also include a dedicated TauREx 3 tutorial for those eager to learn the practical aspects of building forward models and performing atmospheric retrievals. All those resources are available under the same link as the data base.

3 OPEN CHALLENGES

With the constructed data set, we intended to accelerate and incentivize dedicated efforts to tackle a number of open challenges common to both the exoplanet field and the ML field.

3.1 Fast and accurate Bayesian Inference

One of the aims of the data base is to enable the development of advanced inference methods that are (1) able to produce posterior distributions, but at the same time, will not require as much computational resources compared to conventional sampling-based methods. This activity is proposed as part of the goal of the NeurIPS 2022 competition with simplified atmospheric cases and has already proven very successful (Yip et al. in preparation).

3.2 Estimating and mitigating the effect of data shifts

ML models are prone to potential performance degradation when the incoming data are different from the training distribution. This phenomenon is commonly known as data shifts (e.g. Lu et al. 2018; Bayram, Ahmed & Kassler 2022).

Any ML application to the study of exoplanetary atmosphere are likely to experience data shifts. Most ML models in the literature are currently limited to simulation-based inference as the amount of actual spectroscopic observations are fall short for model training, which has to be supplemented by simulations. The discrepancy between our simplistic atmospheric models and the actual atmosphere means that data shift is inevitable (Humphrey et al. 2022).

To emulate this situation, the test set in level 2 data are specifically designed to include chemical equilibrium forward models for which the provided ground truth from atmospheric retrievals assumed free chemistry. In some cases, clouds are included in the forward model to force degenerate behaviours in the test set (Line & Parmentier 2016; Pinhas & Madhusudhan 2017; Mai & Line 2019; Barstow 2020; Mukherjee, Batalha & Marley 2021; Changeat et al. 2021b). Those offsets between training and test sets were voluntarily introduced to evaluate whether the performance of ML solutions remain robust and consistent under ‘unseen’ distributions (this is typically the case in real life since we know little about real exo-atmospheres) and if they had correctly learned to faithfully reproduce the Bayesian retrieval technique.

3.3 Adaptation to other atmospheric assumptions

Atmospheric models are physical models built on varying level of complexities and modelling assumptions. ML models, however, are trained to optimize their performance with respect to the provided training set/training assumptions. In this data set, we have included forward models built from two different modeling assumptions, Simple chemsitry and Equilibrium Chemistry. It remains an open question as to how easy one can ‘switch’ from one model assumption to another. In terms of ML terminology, this kind of learning falls under the umbrella of transfer learning/domain adaptation, where one strives to adopt to from source domain (original training set) to the target domain with a limited number of training examples (Wilson & Cook 2020).

3.4 Benchmarking different retrieval techniques

The built data set can be used for more traditional code comparisons. The TauREx retrieval code was rigorously benchmarked against other established codes (Barstow et al. 2020, 2022). With this data set, the exoplanet community now has access to a wide range of well-referenced forward models and retrieval runs that can be used for standard benchmarking of atmospheric models (e.g. forward models) and a diverse array of retrieval techniques (e.g. MCMC, Nested Sampling, Normalizing Flows: Foreman-Mackey et al. 2013; Feroz et al. 2009; Buchner 2021; Yip et al.2022a).

4 FUTURE EXPANSION OF ABC DATABASE

The data base currently builds on highly simplified atmospheric model assumptions (constant or equilibrium chemistry, isothermal temperature, clear atmosphere). This is done to (1) gauge the success of such initiatives and (2) provide a rich data set to complete the required task.

Future iterations could explore more complex atmospheres with much more limited amount of training examples. This is because, as more complexity is embedded into the model (e.g. GCMs, complex chemistry, stellar activity effects), the computation of a single sample can take months. In this instance traditional parameter sampling is not an option, and faster AI accelerated techniques will be required. We therefore plan to further extend this data base over the coming years and provide new training/test sets to develop both exoplanet and ML activities. For example, future instances of this data base could feature the following:

JWST and HST complementary data sets: This would allow to develop telescope-independent ML techniques and evaluate information content between the different data sets.
Other classes of exoplanets: The current set focuses on gaseous exoplanets. Future data releases could include small rocky exoplanets with secondary atmospheres, or water worlds.
More complex processes: Alternative chemical model (with more complete species sets, with dis-equilibrium processes: Stock et al. 2018; Woitke et al. 2018; Venot et al. 2020; Al-Refaie et al. 2022b) could be provided to study retrieval biases and develop chemistry robust ML methods. Similarly, complementary sets could include stellar activity, for which the relevance of AI methods has already been shown (Nikolaou et al. 2020), or even complex cloud models (Ackerman & Marley 2001; Kawashima & Ikoma 2018; Gao et al. 2020; Ma et al.2022).
More complex models: Eclipse observations or phase-curve observations produced using Global Climate Model could be included. This would allow to extend this to new observations as well as studying three-dimensional effects (Cho et al. 2003; Rauscher et al. 2008; Dobbs-Dixon, Cumming & Lin 2010; Showman, Cho & Menou 2010; Cho, Polichtchouk & Thrastarson 2015; Caldas et al. 2019; Komacek & Showman 2020; Skinner & Cho 2022) and to develop fast recovery techniques for phase-curve data. Current approaches to retrieve phase-curve data are limited by computational resources (Feng, Line & Fortney 2020; Irwin et al. 2020; Cubillos et al. 2021; Changeat et al. 2021a; Changeat 2022; Chubb & Min 2022) and can require up to 10 million samples (e.g. weeks on HPC facilities) to fully explore the parameter space of solution with Hubble data (Changeat et al.2021a).

5 CONCLUSIONS

We present here the publicly available ABC Database (https://doi.org/10.5281/zenodo.6770103), a data base of atmospheric forward and inverse models dedicated to the development of ML approaches in the field of exoplanets. In this paper, we introduce, for a non-expert community, the basic physical and chemical processes involved in the creation of such data base, describing the utilized tools,⁶ and clearly stating the adopted hypothesis. The constructed set includes about 105 887 forward models and 26 109 atmospheric retrievals from conventional sampling techniques, and should serve as a community asset to explore novel techniques to solve the inverse problem of retrieving chemical composition from spectroscopic data. This data base was used to support the third instalment of the Ariel Data Challenge, conducted as part of the NeurIPS Conference,⁷ which led to new innovative ML-based solutions to infer posterior distributions from Ariel spectra. With this effort, and with future updates of this permanent data base, we hope to facilitate the development and adoption of ML solutions to a pressing issue for the next generation of space telescopes.

ACKNOWLEDGEMENTS

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 758892, ExoAI), the Science and Technology Funding Council grants ST/S002634/1 and ST/T001836/1, and the UK Space Agency grant ST/W00254X/1. QC is funded by the European Space Agency under the 2022 ESA Research Fellowship Program. The author thanks Ingo P. Waldmann, Giovanna Tinetti, and Ahmed F. Al-Refaie for their useful recommendations and discussions. The authors wish to thank two anonymous referees for their useful comments.

This work utilized resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk).

DATA AVAILABILITY

The data underlying this paper are available as a Zenodo Digital Repository, at https://doi.org/10.5281/zenodo.6770103.

Footnotes

Spectra have to be computed at high-resolution (⁠ $R$ ⁠) since instrumental binning is done on the received photons, e.g. the recorded transit depth Δ. In our case, we used $R$ $= λ / Δ λ = 10 000$ ⁠.

Latest studies show that Super-Earth sized planets are prevalent while there is a deficiency in the population of sub-Neptunes.

Transit technique tends to favour larger planets.

Here we define IQR as the difference between the 84th and the 16th percentile.

Including star_magnitudeK, star_metallicity, star_type, planet_type, star_type, star_mass_kg, star_radius_m, planet_albedo, planet_impact_param, planet_mass_kg, planet_radius_m, planet_transit_time, instrument_nobs.

The main simulation code, taurex 3, is open-source and publicly available at: https://github.com/ucl-exoplanets/TauREx3_public.

https://neurips.cc/Conferences/2022/CompetitionTrack.

For example, stars’ brightness could vary from time to time.

we cannot observe any non-opaque (transparent) part of the atmosphere.

The trace molecules (like H₂O only accounts for a very tiny portion of the atmospheric composition, the rest is filled by gases like hydrogen and helium, this kind of atmosphere is also known as Primary Atmosphere (e.g. Jupiter has a Primary Atmosphere) as opposed to Secondary Atmosphere, which is principally made of heavier elements (e.g. Earth has a Secondary Atmosphere).

Parts per million, 10⁻⁶.

REFERENCES

Ackerman

A. S.

Marley

M. S.

2001

ApJ

556

872

	Subset 1	Subset 2	Subset 3	Subset 4
Planetary configuration	In-Range	Out-Range	In-Range	Out-Range
Atmospheric properties	In-Range	In-Range	Out-Range	Out-Range

Month:	Total Views:
January 2023	13
February 2023	46
March 2023	111
April 2023	84
May 2023	82
June 2023	68
July 2023	65
August 2023	48
September 2023	36
October 2023	55
November 2023	52
December 2023	55
January 2024	46
February 2024	39
March 2024	50
April 2024	60
May 2024	57
June 2024	34
July 2024	45
August 2024	206
September 2024	183
October 2024	122
November 2024	60
December 2024	36
January 2025	25
February 2025	26
March 2025	39
April 2025	18
May 2025	1

Article Contents

ESA-Ariel Data Challenge NeurIPS 2022: introduction to exo-atmospheric studies and presentation of the Atmospheric Big Challenge (ABC) Database

Abstract

1 CONTEXT

2 DATA GENERATION

2.1 Source of input parameters

2.2 The atmospheric forward model setup

2.3 The atmospheric retrieval setup

2.4 Data overview

2.5 Structure of the ABC data base

2.5.1 Level 1: cleaned data

2.5.2 Level 2: curated data for model training

2.6 Additional resources

3 OPEN CHALLENGES

3.1 Fast and accurate Bayesian Inference

3.2 Estimating and mitigating the effect of data shifts

3.3 Adaptation to other atmospheric assumptions

3.4 Benchmarking different retrieval techniques

4 FUTURE EXPANSION OF ABC DATABASE

5 CONCLUSIONS

ACKNOWLEDGEMENTS

DATA AVAILABILITY

Footnotes

REFERENCES

APPENDIX A: INTRODUCTION TO ATMOSPHERIC STUDIES OF EXOPLANETS

A1 Observations of transiting exoplanets

A2 Modelling exoplanet atmospheres

A3 Solving the inverse problem for exo-atmospheres

APPENDIX B: ATMOSPHERIC TRANSMISSION MODEL IN TauREx

APPENDIX C: POSTERIOR DISTRIBUTION OF ATMOSPHERIC RETRIEVAL

APPENDIX D: DATA OVERVIEW – CONTINUED

APPENDIX E: LEVEL 2 DATA – DETAILED DESCRIPTIONS

E1 Structure

E2 Train-test split

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Most Read

Latest

This Feature Is Available To Subscribers Only