ABSTRACT

We show that convolution neural networks (CNNs) trained to find strong gravitational lens systems are biased towards systems with larger Einstein radii and large concentrated sources. This selection function is key to fully realizing the potential of the large samples of strong gravitational lens systems that will be found in upcoming wide-field surveys. In this paper, we use a CNN and three training data sets to quantify the network selection function and its implication for the many scientific applications of strong gravitational lensing. We use CNNs with similar architecture as is commonly found in the literature. The networks preferentially select systems with larger Einstein radii and larger sources with more concentrated source-light distributions. Increasing the detection significance threshold to 12|$\sigma$| from 8|$\sigma$| results in 50 per cent of the selected strong lens systems having Einstein radii |$\theta _\mathrm{E}$||$\ge$| 1.04 arcsec from |$\theta _\mathrm{E}$||$\ge$| 0.879 arcsec, source radii |$R_S$||$\ge$| 0.194 arcsec from |$R_S$||$\ge$| 0.178 arcsec, and source Sérsic indices |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\ge$| 2.62 from |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\ge$| 2.55. The model trained to find lensed quasars shows a stronger preference for higher lens ellipticities than those trained to find lensed galaxies. The selection function is independent of the slope of the power law of the mass profiles, hence measurements of this quantity will be unaffected. The lens finder selection function reinforces that of the lensing cross-section, and thus we expect our findings to be a general result for all galaxy–galaxy and galaxy–quasar lens finding neural networks.

1 INTRODUCTION

Strong gravitational lensing allows one to address a broad range of cosmological and astrophysical questions, from the nature of dark matter (e.g Vegetti et al. 2018; Ritondale et al. 2019; Gilman et al. 2020; Hsueh et al. 2020) and galaxy evolution (e.g. Sonnenfeld et al. 2019; Mukherjee et al. 2021; Rizzo et al. 2021) to measuring the Hubble constant and other cosmological parameters (e.g. Collett & Auger 2014; Birrer & Treu 2021). The field is set to profit enormously from ongoing and upcoming large-sky surveys (with e.g. Euclid, the Square Kilometre Array, and the Vera Rubin Observatory), as the number of known strong gravitational lens systems is expected to increase by many orders of magnitude (Collett 2015; McKean et al. 2015). An understanding of the selection function of these surveys is required for a correct interpretation of the scientific results from strong gravitational lensing analyses of the larger sample.

There are two sides to the problem of selection bias in the newly discovered strong lens systems: the effect of the lensing cross-section, and that of the lens finding method used to identify the lenses. Sonnenfeld et al. (2023) focused on the former and showed that strong lenses are a biased subset of the true population with respect to the lens and source parameters. The effect of the lens finding methodology, however, is still not well understood.

The task of finding strong gravitational lens systems among the large number of objects imaged in future surveys is a challenging one, especially as strong lenses are, by their nature, very rare. Hence, automation in lens finding is expected to play an important role in the identification of these systems.

Neural networks are a class of machine learning models that are often applied in astronomy to search for rare objects. Lens finder neural networks are those trained to sift through large volumes of image data specifically to find strong gravitational lens candidates. Over the years, convolution neural networks (CNNs) have emerged as the state of the art for lens finding. These models are essentially classifiers, trained to distinguish between lens and non-lens gravitational systems. For example, Lanusse et al. (2018) developed CMU DeepLens, which is a CNN trained to find strong gravitational lens systems on LSST data. The ResNet architecture implemented in their work became the standard for this task, and was adapted for the DESI Legacy survey by Huang et al. (2020) and Huang et al. (2021). CNNs have also been used to find strong lenses in the Kilo-Degree Survey (Petrillo et al. 2017, 2019a, b), HST/ACS data of the COSMOS field (Pourrahmani, Nayyeri & Cooray 2018), PANSTARS (Cañameras et al. 2020), HSC (Cañameras et al. 2021, 2023), and LOFAR (Rezaei et al. 2022).

In this paper, we present the first systematic study of the strong lens finder selection function. We develop lens finder neural networks in a similar fashion as commonly done in the lens finding literature and then characterize the selection effects that they introduce into the recovered sample of strong gravitational lenses. In particular, we are interested in identifying the characteristics of a strong gravitational lens system that drive the classification. Finally, we discuss the implications of these selection effects for different scientific applications of strong gravitational lensing.

The paper is organized as follows. In Section 2 we discuss the mathematical formalism used. In Section 3 we describe the image simulation process for the three data sets used in this work. In Section 4 we detail the machine learning training and interpretability framework. In Section 5 we present our results on the selection function. In Section 6 and 7 we discuss our results and summarize our conclusions.

2 SELECTION BIAS FORMALISM

From Sonnenfeld et al. (2023), the probability |$\mathrm{P}_{\mathrm{SL}}$| of selecting a sample of strong lens systems for a given selection criterion S can be expressed as

(1)

Here, |$\Psi _l$| and |$\Psi _s$| are the set of parameters describing the lens galaxy mass and the background source light distributions, respectively. |$\mathrm{P}_l$| and |$\mathrm{P}_s$| are the corresponding probability density distributions. |$\mathrm{P}_{\mathrm{sel}}$| encapsulates the probability that a specific combination of lens and source produces a strong lens system and that this is found in the survey. It can be further separated into two components:

(2)

where |$\mathrm{P}_{\mathrm{det}}$| is the probability that multiple images of the source form as clearly distinct features in the survey image, and |$\mathrm{P}_{\mathrm{find}}$| is the probability that this image is correctly classified as a strong lens system. The latter depends on the identification procedure adopted. By focusing on the first term, Sonnenfeld (2022) and Sonnenfeld et al. (2023) have quantified how strongly the properties (⁠|$\Psi _l$| and |$\Psi _s$|⁠) of the lenses and background sources are biased with respect to the general population of galaxies. The goal of this paper is to quantify |$\mathrm{P}_{\mathrm{find}}$| for the case in which a CNN is used to identify lens systems in a given survey.

3 DATA

The Euclid mission is expected to find about |$10^5$| strong gravitational lens systems (Collett 2015). In this work, we focus on neural networks trained only with single-band data because most of the gravitational lens systems that will be found in the Euclid Wide Survey will be observed using the Visual imager (VIS) instrument.

VIS is a broad-band optical instrument with a resolution of 0.16 arcsec, which is about three times better than that of the three infrared bands of the Near Infrared Spectrometer and Photometer (NISP) instrument (Euclid Collaboration 2022; O’Riordan et al. 2023). Since the Einstein radius of the strong gravitational lenses expected to be found by Euclid peaks around 0.5 arcsec (Collett 2015), the arcs and lens light are expected to be blended together within the NISP instrument. Hence, these features will be much better resolved by VIS, at the cost of losing colour information. The absence of multiband data can be crucial in informing the decision rationale that CNNs arrive at during training, and may have a significant effect on the type of strong gravitational lens systems that will be identified. Moreover, Petrillo et al. (2019a) found that the best-performing network in terms of purity and completeness was the one trained on single-band data. Similarly, Lanusse et al. (2018) showed that CMU DeepLens is a successful tool for finding strong gravitational lens systems, and that even without colour information, it is able to learn enough about the lensed-arc morphology to solve the classification problem.

We make three data sets, |$\mathcal {D_A}$|⁠, |$\mathcal {D_B}$|⁠, and |$\mathcal {D_C}$|⁠, with |$10^6$| images for training and |$2 \times 10^5$| for testing each. The samples are split evenly between two distinct classes, lens and non-lens, which contain all and none of the lens systems, respectively.

Figs 1 and 2 show examples of the two classes from |$\mathcal {D_B}$| and |$\mathcal {D_C}$|⁠, and Table 1 summarizes the simulation parameter distributions for all data sets.

A representative sample from $\mathcal {D_B}$, ranked in increasing order of SNR from the top left to the bottom right in each panel. Note the range of complexities of the lens light models in the sample. In some cases, distinguishing between lens light and lensed source emission is trivial, but in cases where the lens light model contains complex features like spiral arms, this is not the case. Examples of lenses are shown on the left, and examples of non-lenses are shown on the right. The simulation parameters are sampled as shown in Table 1.
Figure 1.

A representative sample from |$\mathcal {D_B}$|⁠, ranked in increasing order of SNR from the top left to the bottom right in each panel. Note the range of complexities of the lens light models in the sample. In some cases, distinguishing between lens light and lensed source emission is trivial, but in cases where the lens light model contains complex features like spiral arms, this is not the case. Examples of lenses are shown on the left, and examples of non-lenses are shown on the right. The simulation parameters are sampled as shown in Table 1.

A representative sample from $\mathcal {D_C}$, ranked in increasing order of SNR from the top left to the bottom right in each panel. The images are simulated as described in Section 4, with the inclusion of contaminant galaxies and a variable PSF FWHM. Examples of lenses are shown on the left, and examples of non-lenses are shown on the right. The simulation parameters are sampled as shown in Table 1.
Figure 2.

A representative sample from |$\mathcal {D_C}$|⁠, ranked in increasing order of SNR from the top left to the bottom right in each panel. The images are simulated as described in Section 4, with the inclusion of contaminant galaxies and a variable PSF FWHM. Examples of lenses are shown on the left, and examples of non-lenses are shown on the right. The simulation parameters are sampled as shown in Table 1.

Table 1.

Parameter distributions used to create all data sets. From top to bottom: the parameters describing the source light distribution, the mass and light properties of the lens, distribution for the light contaminants, and the observational set-up. The superscripts S and L denote source and lens properties, respectively.

Parameter|$\mathcal {D_A}$||$\mathcal {D_B}$||$\mathcal {D_C}$|
Source lightSérsicSérsicPoint-like
Source radius, |$R_{\mathrm{S}}$| (arcsec)|$\mathcal {U}(0.05, 0.3)$||$\mathcal {U}(0.05, 0.3)$|-
Source Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\mathcal {U}(1, 4)$||$\mathcal {U}(1, 4)$|-
Source axis-ratio, |$q_{\mathrm{S}}$||$\mathcal {U}(0.5, 1.0)$||$\mathcal {U}(0.5, 1.0)$|-
Source redshift2.02.02.0
Source apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{S}}$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$|-
Source flux--1.0
Lens lightSérsicGANGAN
Lens mass axis-ratio, q|$\mathcal {U}(0.5, 1.0)$|--
Lens power-law slope, |$\gamma$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$|
Einstein radius, |$\theta _\mathrm{E}$| (arcsec)|$\mathcal {U}(0.5, 2.0)$|--
Lens redshift0.80.80.8
Lens Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{L}}$||$\mathcal {U}(1, 4)$|--
Lens apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{L}}$||$\mathcal {U}(18.0, 22.0)$||$\mathcal {U}(18.0, 22.0)$|-
Lens flux, |$F_\mathrm{L}$|--|$\mathcal {U}(5.0, 100.0)$|
Shear strength|$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$|
ContaminantsNoNoYes
Contaminant flux, |$F_\mathrm{C}$|--|$\mathcal {U}(1.0, 100.0)$|
Pixel size (arcsec)0.10.10.1
Field of view (arcsec)101010
|$\mathrm{PSF}_{\mathrm{fwhm}}$| (arcsec)0.160.16|$\mathcal {U}(0.16, 0.3)$|
Parameter|$\mathcal {D_A}$||$\mathcal {D_B}$||$\mathcal {D_C}$|
Source lightSérsicSérsicPoint-like
Source radius, |$R_{\mathrm{S}}$| (arcsec)|$\mathcal {U}(0.05, 0.3)$||$\mathcal {U}(0.05, 0.3)$|-
Source Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\mathcal {U}(1, 4)$||$\mathcal {U}(1, 4)$|-
Source axis-ratio, |$q_{\mathrm{S}}$||$\mathcal {U}(0.5, 1.0)$||$\mathcal {U}(0.5, 1.0)$|-
Source redshift2.02.02.0
Source apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{S}}$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$|-
Source flux--1.0
Lens lightSérsicGANGAN
Lens mass axis-ratio, q|$\mathcal {U}(0.5, 1.0)$|--
Lens power-law slope, |$\gamma$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$|
Einstein radius, |$\theta _\mathrm{E}$| (arcsec)|$\mathcal {U}(0.5, 2.0)$|--
Lens redshift0.80.80.8
Lens Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{L}}$||$\mathcal {U}(1, 4)$|--
Lens apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{L}}$||$\mathcal {U}(18.0, 22.0)$||$\mathcal {U}(18.0, 22.0)$|-
Lens flux, |$F_\mathrm{L}$|--|$\mathcal {U}(5.0, 100.0)$|
Shear strength|$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$|
ContaminantsNoNoYes
Contaminant flux, |$F_\mathrm{C}$|--|$\mathcal {U}(1.0, 100.0)$|
Pixel size (arcsec)0.10.10.1
Field of view (arcsec)101010
|$\mathrm{PSF}_{\mathrm{fwhm}}$| (arcsec)0.160.16|$\mathcal {U}(0.16, 0.3)$|
Table 1.

Parameter distributions used to create all data sets. From top to bottom: the parameters describing the source light distribution, the mass and light properties of the lens, distribution for the light contaminants, and the observational set-up. The superscripts S and L denote source and lens properties, respectively.

Parameter|$\mathcal {D_A}$||$\mathcal {D_B}$||$\mathcal {D_C}$|
Source lightSérsicSérsicPoint-like
Source radius, |$R_{\mathrm{S}}$| (arcsec)|$\mathcal {U}(0.05, 0.3)$||$\mathcal {U}(0.05, 0.3)$|-
Source Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\mathcal {U}(1, 4)$||$\mathcal {U}(1, 4)$|-
Source axis-ratio, |$q_{\mathrm{S}}$||$\mathcal {U}(0.5, 1.0)$||$\mathcal {U}(0.5, 1.0)$|-
Source redshift2.02.02.0
Source apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{S}}$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$|-
Source flux--1.0
Lens lightSérsicGANGAN
Lens mass axis-ratio, q|$\mathcal {U}(0.5, 1.0)$|--
Lens power-law slope, |$\gamma$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$|
Einstein radius, |$\theta _\mathrm{E}$| (arcsec)|$\mathcal {U}(0.5, 2.0)$|--
Lens redshift0.80.80.8
Lens Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{L}}$||$\mathcal {U}(1, 4)$|--
Lens apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{L}}$||$\mathcal {U}(18.0, 22.0)$||$\mathcal {U}(18.0, 22.0)$|-
Lens flux, |$F_\mathrm{L}$|--|$\mathcal {U}(5.0, 100.0)$|
Shear strength|$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$|
ContaminantsNoNoYes
Contaminant flux, |$F_\mathrm{C}$|--|$\mathcal {U}(1.0, 100.0)$|
Pixel size (arcsec)0.10.10.1
Field of view (arcsec)101010
|$\mathrm{PSF}_{\mathrm{fwhm}}$| (arcsec)0.160.16|$\mathcal {U}(0.16, 0.3)$|
Parameter|$\mathcal {D_A}$||$\mathcal {D_B}$||$\mathcal {D_C}$|
Source lightSérsicSérsicPoint-like
Source radius, |$R_{\mathrm{S}}$| (arcsec)|$\mathcal {U}(0.05, 0.3)$||$\mathcal {U}(0.05, 0.3)$|-
Source Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\mathcal {U}(1, 4)$||$\mathcal {U}(1, 4)$|-
Source axis-ratio, |$q_{\mathrm{S}}$||$\mathcal {U}(0.5, 1.0)$||$\mathcal {U}(0.5, 1.0)$|-
Source redshift2.02.02.0
Source apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{S}}$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$||$\mathcal {U}(M_{\mathrm{VIS}}^{\mathrm{L}}, 24.0)$|-
Source flux--1.0
Lens lightSérsicGANGAN
Lens mass axis-ratio, q|$\mathcal {U}(0.5, 1.0)$|--
Lens power-law slope, |$\gamma$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$||$\mathcal {U}(1.8, 2.2)$|
Einstein radius, |$\theta _\mathrm{E}$| (arcsec)|$\mathcal {U}(0.5, 2.0)$|--
Lens redshift0.80.80.8
Lens Sérsic index, |$n_{\mathrm{Sc}}^{\mathrm{L}}$||$\mathcal {U}(1, 4)$|--
Lens apparent magnitude, |$M_{\mathrm{VIS}}^{\mathrm{L}}$||$\mathcal {U}(18.0, 22.0)$||$\mathcal {U}(18.0, 22.0)$|-
Lens flux, |$F_\mathrm{L}$|--|$\mathcal {U}(5.0, 100.0)$|
Shear strength|$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$||$\mathcal {U}(0.0, 0.1)$|
ContaminantsNoNoYes
Contaminant flux, |$F_\mathrm{C}$|--|$\mathcal {U}(1.0, 100.0)$|
Pixel size (arcsec)0.10.10.1
Field of view (arcsec)101010
|$\mathrm{PSF}_{\mathrm{fwhm}}$| (arcsec)0.160.16|$\mathcal {U}(0.16, 0.3)$|

We refer the reader to the following sections for more details on the properties of each data set. Briefly, the data sets |$\mathcal {D_A}$| and |$\mathcal {D_B}$| have extended sources, while the sources in |$\mathcal {D_C}$| are unresolved. This is done to understand how the selection function, |$\mathrm{P}_{\mathrm{find}}$|⁠, differs between galaxy–galaxy and galaxy–quasar lens systems. The lens galaxies in |$\mathcal {D_A}$| have a simple analytical model for the lens-light distribution, while those in |$\mathcal {D_B}$| and |$\mathcal {D_C}$| are characterized by a higher level of complexity. This is done in order to quantify how lens light model complexity affects the neural network selection function.

We add complexity to the lens rather than the source light because distinguishing between these components becomes non-trivial when the lens contains features resembling lensed source emission (e.g. arcs). The lensed emission is typically sufficiently distinct from analytic Sérsic components because of its distorted morphology, such that introducing structure in the source will only make the task of lens finding easier. Moreover, using analytical source models allows us to more easily quantify how their parameters affect the network selection function for different choices of the lens light distribution.

3.1 Source light

We use Sérsic profiles (Sérsic 1963; Ciotti & Bertin 1999) for the source-light model in the case of |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠. A point-like source model is used for |$\mathcal {D_C}$|⁠. We refer the reader to Table 1 for more detailed information.

3.2 Lens light

The lens light distribution in |$\mathcal {D_A}$| is created using a Sérsic profile with an index and effective radius sampled uniformly from between 1.0 and 4.0 and between 0.05 and 0.5 arcsec, respectively. |$\mathcal {D_B}$| and |$\mathcal {D_C}$| use images generated with a generative adversarial network (GAN, see Holzschuh et al. 2022, for more details). The GAN was trained to generate images that imitate those created from the SKIRT code on the IllustrisTNG simulation (Springel et al. 2018; Rodriguez-Gomez et al. 2019). The resulting data sets consist therefore of realistic early- (i.e. Sérsic-like) and late-type galaxy images, several of which contain complex features like star-forming clumps, spiral arms, and satellite galaxies (see Fig. 1 for examples of lens systems taken from |$\mathcal {D_B}$|⁠). Structures of this kind can easily be confused for lensed-source emission when dealing with single-band data, making the problem of lens-finding more challenging. In this respect, |$\mathcal {D_A}$| represents the simplest formulation of the problem for the network to solve. The lens magnitudes in |$\mathcal {D_A}$| and |$\mathcal {D_B}$| are sampled from a uniform distribution, and the source magnitudes are chosen such that they are dimmer than the lens. In |$\mathcal {D_C}$|⁠, we instead use fluxes defined relative to the source.

3.3 Light contaminants

We do not include any light contaminants in |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠. On the other hand, |$\mathcal {D_C}$| does contain light contribution from nearby field galaxies. These are sampled from the GAN data set and are placed randomly on the image while ensuring that they lie outside the Einstein radius of the system and at least a set minimum distance apart from each other. These contaminants are 1 to 100 times brighter than the source light and their number in each image is sampled uniformly between 0 and 4. We sample their redshift between 0.2 and 5.0 from a probability density distribution based on the comoving volume at these redshifts. Examples of gravitational lens systems from |$\mathcal {D_C}$| are shown in Fig. 2.

When choosing the relative brightness of the contaminant flux to the source flux, we aim to keep the flux per pixel area the same, as the source light is concentrated into a point. Thus, we set the contaminant fluxes to be 1 to 100 times brighter than the source. In addition, we try to minimize the number of mock images where the lens light is completely absent. We define |$\mu$| as the magnification of each of the multiple images of the source in a single mock in our image simulation. The distribution of |$\mu$| for the range of parameters chosen for the lens mass models in |$\mathcal {D_C}$| (see Section 3.4 for more details) peaks at |$\mu \approx 3$| (i.e the lensed image is three times brighter than the source). Therefore, setting the lens magnitudes to also be 1 to 100 times brighter than the source as we did with the contaminant fluxes will lead to mock systems where the lensed emission is so bright that the lensing galaxy will not be visible at all. To avoid having a large number of these images in the data set, we set the lens flux to be 5 to 100 times the source flux.

3.4 Lens mass

For all three data sets, an elliptical power law is assumed for the lens mass model with the slope, |$\gamma$|⁠, ranging uniformly between 1.8 and 2.2 (Koopmans et al. 2006). In the case where the lens light is a GAN image (i.e. |$\mathcal {D_B}$| and |$\mathcal {D_C}$|⁠), we use the image moments of the latter to align the position angle and axis-ratio of the mass profile with those of the lens light, thus ensuring that the light and mass distributions are consistent with each other. Moreover, the Einstein radius is scaled proportionally to the total flux of the selected GAN image.

We assume that the total integrated flux F is proportional to the total mass within the Einstein radius, M  = |$\pi \rho \theta _\mathrm{E} ^2$|⁠, where |$\rho$| is the density. The Einstein radius is therefore proportional to the square root of the total flux, i.e. |$\theta _\mathrm{E} \propto \sqrt{F}$|⁠. We thus ensure that the Einstein radius is related to the mass of the galaxy in a realistic way. |$\mathcal {D_B}$| and |$\mathcal {D_C}$| have the same simulation parameters as |$\mathcal {D_A}$|⁠, except for the Einstein radius and lens axis-ratio, for which the distributions are shown in the left and centre panels of Fig. 3, respectively.

From left to right: distribution of lens Einstein radius, axis-ratio, and SNR of the images for the data sets $\mathcal {D_B}$ (orange histograms) and $\mathcal {D_C}$ (green histograms). The bins are in log-space for the panel on the right.
Figure 3.

From left to right: distribution of lens Einstein radius, axis-ratio, and SNR of the images for the data sets |$\mathcal {D_B}$| (orange histograms) and |$\mathcal {D_C}$| (green histograms). The bins are in log-space for the panel on the right.

External shear with no preferred direction and with strength sampled from a uniform distribution is applied to all data sets. Additionally, for |$\mathcal {D_C}$|⁠, we further ensure that the numbers of doubly and quadruply imaged systems are equal within the lens class. This is done by first calculating the area inside the inner caustic for the specific axis-ratio chosen. We then find the radius of a circle that would be twice this area and sample the source position uniformly from within this circle.

3.5 Noise

In order to add noise to the data, we follow the definition of signal-to-noise ratio (SNR) by O’Riordan, Warren & Mortlock (2019):

(3)

where |$\sigma _d$| is the standard deviation of the sky noise, |$\mathrm{d}_i$| is the pixel value at index i, and |$m_i$| is the masking variable defined such that |$m_i = 1$| for pixels corresponding to lensed source-light and 0 otherwise. Pixel values that lie within twice the Einstein radius of the lens-subtracted image of the system are considered to be source-light for |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠, and in the case of |$\mathcal {D_C}$|⁠, this is considered to be all pixels with a value greater than 1.5 times the standard deviation of the lens-subtracted image.

The SNR of each image in |$\mathcal {D_A}$| and |$\mathcal {D_B}$| is set by the specific combination of source and lens magnitudes, and the resulting distribution is shown in the right panel of Fig. 3. Even though we did not have explicit control over the SNR of each image, we arrive at a reasonable range of SNRs for the data sets (shown in the right panel of Fig. 3). In the case of |$\mathcal {D_C}$|⁠, we first sample a value for the SNR from a uniform distribution (see Table 1), then add uncorrelated Gaussian noise to the images with a standard deviation calculated from equation (3). In order to set the SNR of each image a priori, we use relative fluxes instead of magnitudes when generating |$\mathcal {D_C}$|⁠.

3.6 Point spread function

We use a circular Gaussian function for the point spread function (PSF). The full width at half-maximum (FWHM) has a value of 0.16 arcsec for |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠. For |$\mathcal {D_C}$|⁠, we employ a variable FWHM, which is sampled uniformly between 0.16 and 0.3 arcsec. An FWHM of 0.16 arcsec corresponds to that of the VIS instrument aboard the Euclid telescope, and 0.3 arcsec represents one that is about two times worse than this.

4 METHODS

4.1 Training phase

During the training phase, the data sets were split into batches of |$10^3$| images each. ResNet18 CNNs were trained for 800 epochs each on |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠. For |$\mathcal {D_C}$|⁠, we also included dynamic learning rate scheduling, wherein the learning rate is decreased by a factor |$10^{-0.5}$| if the test loss remains stagnant for 20 epochs. These values were chosen after extensive hyperparameter tuning. The convergence criteria is set to be three consecutive drops in learning rate without a change in the test loss. With this set-up, the network is trained for 260 epochs.

For all data sets, several data augmentation techniques were used during training, namely, random cropping to 80 |$\times$| 80 pixels, random vertical and horizontal flips, rotation, and erasing. The learning rate was set to 0.001 and the Adam optimizer was used to minimize the binary cross-entropy loss. The images were normalized such that each pixel value is between 0 and 1 before being passed to the neural network.

Table 2 shows the differences in the final testing accuracy and loss for each of the three networks, as well as the True Positive Rate (TPR), False Positive Rate (FPR), and Area Under the Receiver Operator Characteristic (AUROC) curve for each network with corresponding data set version. All three networks achieve very high classification accuracy, and have converged. This is also reflected in the AUROC values: the closer the AUROC is to 1, the closer the network is to a perfect classifier; thus the AUROC curve is a measure of the quality of the classifier for the particular data set it is trained on. Instead of focusing on lens systems with specific properties and data quality, this work uses a uniform sample representative of all possible systems, with a broad range of SNR values. As a consequence, our data sets contain a larger fraction of objects that are easy to classify compared to other published works on the topic of lens finding. This is reflected in the performance metrics of the CNNs, which are much better in comparison to other studies in the literature which were specific to a certain instrument and tailored to be used on real data. Our performance metrics will probably suffer if real data are included in the testing data sets. However, this does not affect the main result of this analysis, which focuses on the selection effects of the CNNs.

Table 2.

Network performance in terms of the final test accuracy, test data set loss, the True Positive and False Negative Rate, and the Area Under ROC curve, for each data set.

Data setTest accuracy (per cent)LossTPR (per cent)FPR (per cent)AUROC
|$\mathcal {D_A}$|99.990.03599.990.0020.9999998
|$\mathcal {D_B}$|99.640.06499.610.2020.9998687
|$\mathcal {D_C}$|99.880.00699.890.0810.9999866
Data setTest accuracy (per cent)LossTPR (per cent)FPR (per cent)AUROC
|$\mathcal {D_A}$|99.990.03599.990.0020.9999998
|$\mathcal {D_B}$|99.640.06499.610.2020.9998687
|$\mathcal {D_C}$|99.880.00699.890.0810.9999866
Table 2.

Network performance in terms of the final test accuracy, test data set loss, the True Positive and False Negative Rate, and the Area Under ROC curve, for each data set.

Data setTest accuracy (per cent)LossTPR (per cent)FPR (per cent)AUROC
|$\mathcal {D_A}$|99.990.03599.990.0020.9999998
|$\mathcal {D_B}$|99.640.06499.610.2020.9998687
|$\mathcal {D_C}$|99.880.00699.890.0810.9999866
Data setTest accuracy (per cent)LossTPR (per cent)FPR (per cent)AUROC
|$\mathcal {D_A}$|99.990.03599.990.0020.9999998
|$\mathcal {D_B}$|99.640.06499.610.2020.9998687
|$\mathcal {D_C}$|99.880.00699.890.0810.9999866

The small decrease in network performance between |$\mathcal {D_A}$| and |$\mathcal {D_B}$| is a consequence of the fact that the network is exposed to a larger fraction of lens-light models and negative examples (galaxies in the non-lens class) that have complex structure which can be confused for lensed arcs. We note that the higher test accuracy of the network trained on |$\mathcal {D_A}$| does not imply a superior performance, rather, it indicates the relative simplicity of the lens light model.

4.2 Detection significance of a lens image

We use the inverse error function to convert from probabilities to detection significance:

(4)

where C is the network lens detection significance and p is the probability that an image belongs to the lens class. More details can be found in Appendix A1. In a lens finding campaign, a detection significance threshold is chosen such that all classifications made by the network above this |$\sigma$|-cut constitute a strong lens system. The distributions of the outputs for the three different networks on their respective data sets, after converting to significance, are shown in the right panel of Fig. 4. Note that the distributions are roughly Gaussian and peak at different values for each data set.

Left: the number of lens images at each detection significance threshold for $\mathcal {D_A}$, $\mathcal {D_B}$, and $\mathcal {D_C}$. Right: distribution of detection significance values for the three networks trained on $\mathcal {D_A}$, $\mathcal {D_B}$, and $\mathcal {D_C}$.
Figure 4.

Left: the number of lens images at each detection significance threshold for |$\mathcal {D_A}$|⁠, |$\mathcal {D_B}$|⁠, and |$\mathcal {D_C}$|⁠. Right: distribution of detection significance values for the three networks trained on |$\mathcal {D_A}$|⁠, |$\mathcal {D_B}$|⁠, and |$\mathcal {D_C}$|⁠.

4.3 Kullback–Leibler divergence

We would like to quantify the difference between the ground truth population of lenses and sources (⁠|$\Psi _l$|⁠, |$\Psi _s$|⁠) and the population recovered by the lens finder. To do this, we employ the Kullback–Leibler (KL) divergence, which is defined as

(5)

Here, |$p(x)$| is the probability distribution function of the true parent distribution (all lenses in the testing data set) for the parameter x, and |$q(x)$| is that of the selected sample at a particular detection significance threshold. We estimate these distribution functions at a specific |$\sigma$|-cut using Kernel Density Estimation (KDE) with a Gaussian kernel. We use Scott’s rule (Scott 2015) to estimate a reasonable bandwidth for the KDE.

The purpose of using the KL-divergence as a metric is twofold: (i) from an interpretability point of view, the parameters that show an increase in the KL-divergence at high detection significance thresholds are clearly important for the networks to classify a system as a lens or a non-lens, lending insight into what the networks have learnt in order to solve the lens finding problem. (ii) For physical parameters that are measured in a strong lensing survey, the KL-divergence indicates by how much the parameters inferred will deviate from the parent distribution for a given detection threshold.

It must be noted here that we have neglected the correlations between the different parameters when calculating the KL-divergence, which is akin to marginalizing over all parameters except the one being considered in the calculation. This means that the variation of the KL-divergence for a specific parameter as the detection significance threshold becomes stricter indicates by how much this parameter will differ from the true sample, if we were only interested in measuring this specific parameter in a given survey. A more thorough analysis, which is beyond the scope of this work, would require the covariance of the parameters to be taken into account.

We also note that as the detection significance threshold increases, the number of lens images drops exponentially, as shown in the left panel of Fig. 4. Thus, there is an increase in the KL-divergence due to sampling noise, which needs to be accounted for in order to disentangle it from the increase in the KL-divergence that is a result of the selected sample differing from the truth. To this end, we sample the true distribution with different numbers of total samples (N) corresponding to each |$\sigma$|-cut, and then calculate the KL-divergence of this sample relative to the the entire true distribution of |$10^5$| systems. This is repeated several thousands of times to get enough realizations to account for sampling variance.

5 RESULTS

Our results are shown qualitatively in Figs 5, 6, and 7 for the networks trained on |$\mathcal {D_A}$|⁠, |$\mathcal {D_B}$|⁠, and |$\mathcal {D_C}$|⁠, respectively. The contours containing 68 per cent of the mass of the distribution are shown for different detection significance thresholds. As the latter increases, the selected sample begins to deviate from the parent distribution.

Distributions of the parent and selected sample for detection significance thresholds of 3σ, 7σ, and 10$\sigma$ for the network trained on $\mathcal {D_A}$. $\theta _\mathrm{E}$ is the Einstein radius, q is the axis-ratio of the lens mass, $\gamma$ is the power-law slope, $M_{\mathrm{VIS}}^{\mathrm{L}}$ and $M_{\mathrm{VIS}}^{\mathrm{S}}$ are the apparent magnitude of the lens and source, respectively, $R_{\mathrm{S}}$ is the source radius, $n_{\mathrm{Sc}}^{\mathrm{S}}$ is the Sérsic index of the source, $q_{\mathrm{S}}$ is the axis-ratio of the source, and SNR is the signal-to-noise ratio.
Figure 5.

Distributions of the parent and selected sample for detection significance thresholds of 3σ, 7σ, and 10|$\sigma$| for the network trained on |$\mathcal {D_A}$|⁠. |$\theta _\mathrm{E}$| is the Einstein radius, q is the axis-ratio of the lens mass, |$\gamma$| is the power-law slope, |$M_{\mathrm{VIS}}^{\mathrm{L}}$| and |$M_{\mathrm{VIS}}^{\mathrm{S}}$| are the apparent magnitude of the lens and source, respectively, |$R_{\mathrm{S}}$| is the source radius, |$n_{\mathrm{Sc}}^{\mathrm{S}}$| is the Sérsic index of the source, |$q_{\mathrm{S}}$| is the axis-ratio of the source, and SNR is the signal-to-noise ratio.

Distributions of the parent and selected sample for detection significance thresholds of 3σ, 7σ, and 10$\sigma$ for the network trained on $\mathcal {D_B}$. $\theta _\mathrm{E}$ is the Einstein radius, q is the axis-ratio of the lens mass, $\gamma$ is the power-law slope, $M_{\mathrm{VIS}}^{\mathrm{L}}$ and $M_{\mathrm{VIS}}^{\mathrm{S}}$ are the apparent magnitude of the lens and source, respectively, $R_{\mathrm{S}}$ is the source radius, $n_{\mathrm{Sc}}^{\mathrm{S}}$ is the Sérsic index of the source, $q_{\mathrm{S}}$ is the axis-ratio of the source, and SNR is the signal-to-noise ratio.
Figure 6.

Distributions of the parent and selected sample for detection significance thresholds of 3σ, 7σ, and 10|$\sigma$| for the network trained on |$\mathcal {D_B}$|⁠. |$\theta _\mathrm{E}$| is the Einstein radius, q is the axis-ratio of the lens mass, |$\gamma$| is the power-law slope, |$M_{\mathrm{VIS}}^{\mathrm{L}}$| and |$M_{\mathrm{VIS}}^{\mathrm{S}}$| are the apparent magnitude of the lens and source, respectively, |$R_{\mathrm{S}}$| is the source radius, |$n_{\mathrm{Sc}}^{\mathrm{S}}$| is the Sérsic index of the source, |$q_{\mathrm{S}}$| is the axis-ratio of the source, and SNR is the signal-to-noise ratio.

Distributions of the parent and selected sample for detection significance thresholds of 3σ, 4σ, 5σ, and 6$\sigma$ for the network trained on $\mathcal {D_C}$. $\theta _\mathrm{E}$ is the Einstein radius, q is the axis-ratio of the lens mass, $\gamma$ is the power-law slope, $F_{\mathrm{L}}$ and $F_{\mathrm{C}}$ are the lens and contaminant flux, respectively, and SNR is the signal-to-noise ratio.
Figure 7.

Distributions of the parent and selected sample for detection significance thresholds of 3σ, 4σ, 5σ, and 6|$\sigma$| for the network trained on |$\mathcal {D_C}$|⁠. |$\theta _\mathrm{E}$| is the Einstein radius, q is the axis-ratio of the lens mass, |$\gamma$| is the power-law slope, |$F_{\mathrm{L}}$| and |$F_{\mathrm{C}}$| are the lens and contaminant flux, respectively, and SNR is the signal-to-noise ratio.

A more quantitative measure is given in terms of the KL-divergence, plotted in Figs 8, 9, and 10, and calculated for a range of detection significance thresholds and for several parameters of interest, as described in Section 4.3. The blue regions in these figures show the 1σ, 2σ. and 3|$\sigma$| contours of the KL-divergence from sampling noise. If a certain parameter is important for a classification, we expect the effect to be captured by the KL-divergence. It should manifest as an increase in the KL-divergence between the selected sample and the parent distribution, which is much greater than the increase due to sampling noise. We use these variations in the KL-divergence to infer which parameter distributions are biased, and the contour plots to estimate the nature of the bias. Together, they provide a valuable tool to understand the selection function of the three neural networks considered here.

The increase in the KL divergence ($D_{KL}$) as the detection significance threshold is increased for the network trained on $\mathcal {D_A}$. The blue lines show the $D_{KL}$ that is due to sampling noise, with the increasingly shaded regions depicting 1σ, 2σ, and 3$\sigma$ contours.
Figure 8.

The increase in the KL divergence (⁠|$D_{KL}$|⁠) as the detection significance threshold is increased for the network trained on |$\mathcal {D_A}$|⁠. The blue lines show the |$D_{KL}$| that is due to sampling noise, with the increasingly shaded regions depicting 1σ, 2σ, and 3|$\sigma$| contours.

The increase in the KL divergence ($D_{KL}$) as the detection significance threshold is increased for the network trained on $\mathcal {D_B}$. The blue lines show the $D_{KL}$ that is due to sampling noise, with the increasingly shaded regions depicting 1σ, 2σ, and 3$\sigma$ contours.
Figure 9.

The increase in the KL divergence (⁠|$D_{KL}$|⁠) as the detection significance threshold is increased for the network trained on |$\mathcal {D_B}$|⁠. The blue lines show the |$D_{KL}$| that is due to sampling noise, with the increasingly shaded regions depicting 1σ, 2σ, and 3|$\sigma$| contours.

The increase in the KL divergence ($D_{KL}$) as the detection significance threshold is increased for the network trained on $\mathcal {D_C}$. The blue lines show the $D_{KL}$ that is due to sampling noise, with the increasingly shaded regions depicting 1σ, 2σ, and 3$\sigma$ contours.
Figure 10.

The increase in the KL divergence (⁠|$D_{KL}$|⁠) as the detection significance threshold is increased for the network trained on |$\mathcal {D_C}$|⁠. The blue lines show the |$D_{KL}$| that is due to sampling noise, with the increasingly shaded regions depicting 1σ, 2σ, and 3|$\sigma$| contours.

The detection significance threshold at which selection biases become important are high for the neural networks used in this work. However, we urge the reader to keep two important caveats in mind: (i) typically a combination of people and NNs are used to find new lens systems, as the networks find possible targets that are then manually classified (see for example Rojas et al. 2023). The large survey volumes expected necessitate high detection significance cuts in the first pass in order to keep the number of lens candidates that need to be confirmed manually low enough to be manageable. (ii) The actual values of the detection significance when the distributions of the true and selected samples begin to differ strongly is highly dependent on the details of the simulation used to create the training data set. We have already shown (in Fig. 4) that the three networks used in this work have distinct distributions of detection significance on their training data sets, even though the differences between them are small. Thus, other lens finder neural networks trained on data sets created with other simulation codes (as will be the case for the networks used on Euclid data, LSST data, etc.) will have different detection significance thresholds which are acceptable from a selection bias point of view. The absolute values of the detection significance thresholds at which we see the effect of selection bias manifest for our networks are a relative measure of the importance of each parameter to the network classification, and can only facilitate a comparison between the networks used here. However, we expect the nature and relative strengths of these biases to be a general result for all lens finding CNNs.

From Figs 5, 6, and 7, we see that all three networks show a clear preference for systems with larger Einstein radii. For example, 50 per cent of all ground truth lens systems have Einstein radii |$\theta _\mathrm{E}$||$\ge$| 1.25, 0.839, and 0.842 arcsec for |$\mathcal {D_A}$|⁠, |$\mathcal {D_B}$|⁠, and |$\mathcal {D_C}$|⁠, respectively. Beyond 8|$\sigma$| however, 50 per cent of the identified lens systems have |$\theta _\mathrm{E}$||$\ge$| 0.880 arcsec, and an additional 4|$\sigma$| leads to a further increase to |$\theta _\mathrm{E}$||$\ge$| 1.04 arcsec for the network trained on |$\mathcal {D_A}$|⁠. For |$\mathcal {D_B}$| and |$\mathcal {D_C}$|⁠, the same increase in detection significance results in a change in the 50 per cent mark of the selected sample from |$\theta _\mathrm{E}$||$\ge$| 0.879 arcsec to |$\theta _\mathrm{E}$||$\ge$| 1.04 arcsec and from |$\theta _\mathrm{E}$||$\ge$| 0.844 arcsec to |$\theta _\mathrm{E}$||$\ge$| 0.975 arcsec, respectively. This can be understood intuitively: lensed images that are in general further away from the lens light are easier to de-blend. From the KL-divergence plots, we see that the Einstein radius is the quantity that most strongly affects the decision of the neural network. Indeed, this is the parameter for which the increase in KL-divergence is the steepest for all three networks. Figs 8, 9, and 10 show that the KL-divergence rises sharply at |$\approx 8\sigma$| for |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠, and at |$\approx 6\sigma$| for |$\mathcal {D_C}$|⁠. Our finding justifies the assumption by Sonnenfeld et al. (2023) to consider |$\mathrm{P}_{\mathrm{find}}$| (the probability of finding lens system in a survey) to be solely a function of the Einstein radius, to first approximation.

As the training data are made more complex and realistic, however, other properties of strong gravitational lens systems also become important for the network to correctly identify them as such. The network trained on |$\mathcal {D_A}$| shows a small degree of bias at very high detection significance values (> 14|$\sigma$|⁠) where we see a sensitivity to parameters other than the Einstein radius, namely, the lens axis-ratio, source radius, and source and lens magnitudes (as seen in Fig. 8). The most important difference between the networks trained on |$\mathcal {D_A}$| and |$\mathcal {D_B}$| is a bias of the latter in favour of specific source properties. This is related to the fact that |$\mathcal {D_A}$| consists of a simpler lens-light model and therefore the network does not need to learn complex concepts to distinguish between lensed and un-lensed emission. The more complex lens-light distributions in |$\mathcal {D_B}$| cause the network to not only be more sensitive to the lens axis-ratio and the source radius (we see a steep rise in KL-divergence already at > 7|$\sigma$| for |$R_S$|⁠), but also to the Sérsic index of the source. In particular, it is more likely to select sources with larger radii (⁠|$R_\mathrm{S}$|⁠), Sérsic indices (⁠|$n_{\mathrm{Sc}}^{\mathrm{S}}$|⁠), and axis-ratios (⁠|$q_\mathrm{S}$|⁠). Specifically, 50 per cent of the selected sample have |$R_\mathrm{S}$||$\ge$| 0.194 arcsec, |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\ge$| 2.62, and |$q_\mathrm{S}$||$\ge$| 0.758 at a 12|$\sigma$| cut as opposed to the ground truth values of |$R_\mathrm{S}$||$\ge$| 0.175 arcsec, |$n_{\mathrm{Sc}}^{\mathrm{S}}$||$\ge$| 2.51, and |$q_\mathrm{S}$||$\ge$| 0.751. All these properties lead to larger and more concentrated source light distributions, making the lensed arcs more easily distinguishable from the lens light.

Additionally, all three networks tend to identify as lens systems those with a slightly more elliptical lens mass distribution at very strict detection significance thresholds, (i.e. > 12|$\sigma$| for the networks trained on |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠). An increase in the detection significance cut from 8σ to 12|$\sigma$| causes 50 per cent of the selected lens systems to have q|$\le$| 0.736 and q|$\le$| 0.713 from q|$\le$| 0.750 and q|$\le$| 0.723, for |$\mathcal {D_A}$| and |$\mathcal {D_B}$|⁠, respectively (with associated true values of q|$\le$| 0.750 and q|$\le$| 0.723). The KL-divergence plot for |$\mathcal {D_C}$| (Fig. 10) shows more sensitivity (steep rise in KL-divergence at C > 6|$\sigma$|⁠) to the lens axis-ratio compared to the extended source lens finders. At a 7|$\sigma$| cut, half of the lens systems have lens axis-ratios q|$\le$| 0.670 as opposed to the ground truth values of q|$\le$| 0.723. Moreover, this network is more efficient at selecting images with lower |$\mathrm{PSF}_{\mathrm{FWHM}}$| values. At lower |$\mathrm{PSF}_{\mathrm{FWHM}}$|⁠, it becomes easier to distinguish between the lensed point-source and the lens/contaminant galaxies. However, the inclusion of stars as contaminants will make this more difficult, and the preference for lower |$\mathrm{PSF}_{\mathrm{FWHM}}$| may be negligible with stars as contaminants.

For this data set, we also calculate the KL-divergence as a function of |$\sigma$|-cut for the flux of the lens |$F_{\mathrm{L}}$| and the contaminants, |$F_{\mathrm{C}}$|⁠, which are proxies for how bright these components are with respect to the source. We find that the KL-divergence is negligible at all values of |$\sigma$|-cut. This points to the fact that the network has learnt to distinguish between the lens/contaminant light and source light very effectively within our mock data set.

6 DISCUSSION

Our main findings indicate that the sample of strong gravitational lens systems identified in wide-sky surveys with CNNs is biased with respect to both lens and source properties. This could have important implications for the many scientific applications of strong gravitational lensing, which we discuss in this section.

The CNN selection function, coupled with the lensing cross-section, will lead to samples of deflectors which are biased towards higher masses. This will effectively limit our prospects of studying the properties of lens galaxies with strong lensing to only the most massive objects. Essentially, the blurring by the PSF produces a lower limit on the Einstein radii of lens systems that can be found in a survey. This effect will be increased by the selection function of the neural networks.

Interestingly, we find no selection bias in favour of specific values of the slope, |$\gamma$|⁠, of the lens mass density profiles. Assuming that this result is confirmed with other CNNs, we can conclude that the large samples of new lens systems will allow one to measure this parameter and its evolution with redshift. This will provide a valuable probe of galaxy evolution and feedback models (e.g. Koopmans et al. 2006; Mukherjee et al. 2021).

Strong gravitational lensing acts as a cosmic telescope, allowing the study of high-redshift galaxies at higher physical resolution and SNR than usually possible otherwise. We have shown that networks trained on images with complex and more realistic lens light models prefer larger source radii and Sérsic indices. Sources with these properties are bigger and more concentrated, and thus produce arcs which are easier to distinguish from the lens light. A similar selection bias also occurs from the strong lensing cross-section (Hezaveh, Marrone & Holder 2012; Serjeant 2012), and the neural networks will reinforce this effect when the source profile is Sérsic-like. Hence, attempts to interpret the properties of lensed galaxies in the context of galaxy evolution models (as for example in Oldham et al. 2017a, b; Stacey et al. 2021) from the large sample of discovered lenses will need to additionally account for the effect of the neural network selection function.

Time-delay cosmography can be used as an additional probe of the Hubble constant (Refsdal 1964; Birrer et al. 2020; Birrer & Treu 2021; Shajib et al. 2023). However, this requires the breaking of the mass-sheet degeneracy (Falco, Gorenstein & Shapiro 1985) by incorporating independent mass measurements from stellar dynamics. Velocity dispersion measurement uncertainties limit the overall precision on |$H_0$| to |$\approx$| 10 per cent (Schneider & Sluse 2013; Kochanek 2020). These uncertainties are too large to address the Hubble tension in a meaningful way. To overcome this issue, Birrer & Treu (2021) proposed a joint analysis using mass models and kinematic properties of galaxy–galaxy lenses as a prior on the mass profiles for the analysis of the time-delayed galaxy–quasar lenses. This method inherently assumes that two types of strong gravitational lens systems are drawn from the same deflector parent population (see also Sonnenfeld 2021). However, Sonnenfeld et al. (2023) pointed out that in the case of quadruply imaged quasars, the lens galaxies tend to have larger ellipticities and halo masses for a given stellar mass. Similarly, we find that the neural network trained to find lensed quasars has a preference for higher ellipticity lens profiles as these have wider caustics and thus a higher chance of producing four-image systems. More importantly, the networks trained to find lensed extended sources show a much weaker preference for higher ellipticities. Hence, the approach proposed by Birrer & Treu (2021, see also Birrer et al. 2020; Gomer et al. 2022) is more likely to introduce additional systematic errors as opposed to constraints. These can be accounted for with an understanding of the selection function of the lens finder used, which can be obtained with an analysis of the type done in this work.

6.1 Limitations of our work

Our image simulations already account for several characteristics of real data that are often ignored in the lens finding literature, like complexity in the lens light models, field galaxies, and a variable PSF FWHM. However, the training data used in this work could be further improved by the inclusion of complexities that (i) make the data more diverse (e.g. an elliptical PSF model with varying ellipticity, stars, and non-lensed quasars as contaminants, complex source light models, gamma-rays, CCD artefacts) and (ii) make the task of separating lensed emission from source light more difficult (for example ring galaxies). We have seen that complexities like the former require the network to become adept at ignoring contaminants, as with the network trained on |$\mathcal {D_C}$|⁠. For the latter, we expect a more complex selection function when trained on more realistic data (as with the neural network trained on |$\mathcal {D_B}$|⁠). Note that for a study such as this, we need to compare distributions of parameters of the input data set with the population found by the CNNs, and so using real data would require lens modelling that would introduce further uncertainties. Including real lenses in our training data set or reducing the SNR values of the images will lower the training metrics, but will not affect the kinds of objects identified as lenses.

We have confined this study to the case of single-band data, thus it is unclear how training with multiband data will influence the selection function of the neural networks. It is possible that in this scenario, colour-related selection biases may be introduced in addition to the ones listed in this paper. Moreover, we have considered only the case of the ResNet18 architecture, as it is often the best-performing network for lens finding (Cañameras et al. 2023). How the selection function changes for different CNN architectures and for different ML models is an interesting question which is beyond the scope of this work.

Note that our data sets are split evenly between the two classes. A real survey will have far more non-lens than lens images. This is another example of the class imbalance problem in the machine learning literature. A real sample would have at most one lens system for every 100 objects, and a neural network trained on a data set with a class ratio of 1:99 could achieve a 99 per cent accuracy by simply classifying every object as a non-lens. In order to alleviate this issue, the machine learning community has developed many techniques that centre around the theme of artificially making the class ratio closer to 1:1, which can be achieved by oversampling the minority class or undersampling the majority class during training. Since we use simulated data, we can circumvent this issue by creating a data set with an equal number of lens and non-lens systems. Using a realistic class ratio could alter the notions that the networks learn during training. This may further exacerbate the selection effects that we outline in this paper.

7 CONCLUSIONS

In this paper, we quantified the selection function of the machine learning models that will likely be used to find the strong lens systems in future surveys. We focused our efforts on the ResNet18 architecture trained on three different data sets, and found that the classification task of lens finding leads to a selection bias on the parameters of the identified sample of strong lenses.

This results from the fact that lens finding neural networks are more efficient at finding lenses with certain properties, making them more likely to be above any detection significance threshold chosen for a given survey. Moreover, samples of lenses found by neural networks in the first pass are used as training data for the next iteration of lens finding networks. This might have the effect of further exacerbating the selection bias in each iteration.

We have shown that neural networks are most sensitive to the Einstein radius of the system as they preferentially select strong lens systems with larger values of |$\theta _\mathrm{E}$|⁠. In addition, the networks are biased towards bigger sources with more concentrated light distributions. Galaxy–quasar lens finding neural networks also show a stronger preference for more elliptical lens mass distributions than those trained to find galaxy–galaxy lens systems. We also find that the networks show no preference for any values of the lens power-law slope.

Lens finding neural networks reinforce the biases introduced by the lensing cross-section. Our results clearly show that an analysis of the selection effects of lens finding neural networks is a key additional step that needs to be incorporated into any systematic attempt to find strong gravitational lenses in upcoming surveys.

ACKNOWLEDGEMENTS

AH thanks the Max Planck Computing and Data Facility (MPCDF) for computational resources and support. AH also thanks Daniel Grün, Matteo Guardiani, and Philipp Frank for useful insights and discussions. SV thanks the Max Planck Society for support through a Max Planck Lise Meitner Group, and acknowledges funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (LEDA: grant agreement No 758853).

DATA AVAILABILITY

The data used in this paper are available from the corresponding author on request.

REFERENCES

Birrer
S.
,
Treu
T.
,
2021
,
A&A
,
649
,
A61

Birrer
S.
et al. ,
2020
,
A&A
,
643
,
A165

Cañameras
R.
et al. ,
2020
,
A&A
,
644
,
A163

Cañameras
R.
et al. ,
2021
,
A&A
,
653
,
L6

Cañameras
R.
et al. ,
2023
,
preprint
()

Ciotti
L.
,
Bertin
G.
,
1999
,
A&A
,
352
,
447

Collett
T. E.
,
2015
,
ApJ
,
811
,
20

Collett
T. E.
,
Auger
M. W.
,
2014
,
MNRAS
,
443
,
969

Euclid Collaboration
,
2022
,
A&A
,
662
,
A112

Falco
E. E.
,
Gorenstein
M. V.
,
Shapiro
I. I.
,
1985
,
ApJ
,
289
,
L1

Gilman
D.
,
Birrer
S.
,
Nierenberg
A.
,
Treu
T.
,
Du
X.
,
Benson
A.
,
2020
,
MNRAS
,
491
,
6077

Gomer
M. R.
,
Sluse
D.
,
Van de Vyvere
L.
,
Birrer
S.
,
Courbin
F.
,
2022
,
A&A
,
667
,
A86

Hezaveh
Y. D.
,
Marrone
D. P.
,
Holder
G. P.
,
2012
,
ApJ
,
761
,
20

Holzschuh
B. J.
,
O’Riordan
C. M.
,
Vegetti
S.
,
Rodriguez-Gomez
V.
,
Thuerey
N.
,
2022
,
MNRAS
,
515
,
652

Hsueh
J. W.
,
Enzi
W.
,
Vegetti
S.
,
Auger
M. W.
,
Fassnacht
C. D.
,
Despali
G.
,
Koopmans
L. V. E.
,
McKean
J. P.
,
2020
,
MNRAS
,
492
,
3047

Huang
X.
et al. ,
2020
,
ApJ
,
894
,
78

Huang
X.
et al. ,
2021
,
ApJ
,
909
,
27

Kochanek
C. S.
,
2020
,
MNRAS
,
493
,
1725

Koopmans
L. V. E.
,
Treu
T.
,
Bolton
A. S.
,
Burles
S.
,
Moustakas
L. A.
,
2006
,
ApJ
,
649
,
599

Lanusse
F.
,
Ma
Q.
,
Li
N.
,
Collett
T. E.
,
Li
C.-L.
,
Ravanbakhsh
S.
,
Mandelbaum
R.
,
Póczos
B.
,
2018
,
MNRAS
,
473
,
3895

McKean
J.
et al. ,
2015
,
Proc. Sci., Strong Gravitational Lensing with the SKA
.
SISSA
,
Trieste
, PoS#
84

Mukherjee
S.
,
Koopmans
L. V. E.
,
Metcalf
R. B.
,
Tortora
C.
,
Schaller
M.
,
Schaye
J.
,
Vernardos
G.
,
Bellagamba
F.
,
2021
,
MNRAS
,
504
,
3455

O’Riordan
C. M.
,
Warren
S. J.
,
Mortlock
D. J.
,
2019
,
MNRAS
,
487
,
5143

O’Riordan
C. M.
,
Despali
G.
,
Vegetti
S.
,
Lovell
M. R.
,
Moliné
Á.
,
2023
,
MNRAS
,
521
,
2342

Oldham
L.
et al. ,
2017a
,
MNRAS
,
465
,
3185

Oldham
L.
,
Auger
M.
,
Fassnacht
C. D.
,
Treu
T.
,
Koopmans
L. V. E.
,
Lagattuta
D.
,
McKean
J.
,
Vegetti
S.
,
2017b
,
MNRAS
,
470
,
3497

Petrillo
C. E.
et al. ,
2017
,
MNRAS
,
472
,
1129

Petrillo
C. E.
et al. ,
2019a
,
MNRAS
,
482
,
807

Petrillo
C. E.
et al. ,
2019b
,
MNRAS
,
484
,
3879

Pourrahmani
M.
,
Nayyeri
H.
,
Cooray
A.
,
2018
,
ApJ
,
856
,
68

Refsdal
S.
,
1964
,
MNRAS
,
128
,
307

Rezaei
S.
,
McKean
J. P.
,
Biehl
M.
,
de Roo
W.
,
Lafontaine
A.
,
2022
,
MNRAS
,
517
,
1156

Ritondale
E.
,
Vegetti
S.
,
Despali
G.
,
Auger
M. W.
,
Koopmans
L. V. E.
,
McKean
J. P.
,
2019
,
MNRAS
,
485
,
2179

Rizzo
F.
,
Vegetti
S.
,
Fraternali
F.
,
Stacey
H. R.
,
Powell
D.
,
2021
,
MNRAS
,
507
,
3952

Rodriguez-Gomez
V.
et al. ,
2019
,
MNRAS
,
483
,
4140

Rojas
K.
et al. ,
2023
,
MNRAS
,
523
,
4413

Schneider
P.
,
Sluse
D.
,
2013
,
A&A
,
559
,
A37

Scott
D. W.
,
2015
,
Multivariate Density Estimation: Theory, Practice, and Visualization
.
Wiley Series in Probability and Statistics

Serjeant
S.
,
2012
,
MNRAS
,
424
,
2429

Sérsic
J. L.
,
1963
,
Boletin de la Asociacion Argentina de Astronomia La Plata Argentina
,
6
,
41

Shajib
A. J.
et al. ,
2023
,
A&A
,
673
,
A9

Sonnenfeld
A.
,
2021
,
A&A
,
656
,
A153

Sonnenfeld
A.
,
2022
,
A&A
,
659
,
A132

Sonnenfeld
A.
,
Jaelani
A. T.
,
Chan
J.
,
More
A.
,
Suyu
S. H.
,
Wong
K. C.
,
Oguri
M.
,
Lee
C.-H.
,
2019
,
A&A
,
630
,
A71

Sonnenfeld
A.
,
Li
S.-S.
,
Despali
G.
,
Shajib
A. J.
,
Taylor
E. N.
,
2023
,
A&A
,
678
,
A4

Springel
V.
et al. ,
2018
,
MNRAS
,
475
,
676

Stacey
H. R.
et al. ,
2021
,
MNRAS
,
500
,
3667

Vegetti
S.
,
Despali
G.
,
Lovell
M. R.
,
Enzi
W.
,
2018
,
MNRAS
,
481
,
3661

APPENDIX A: APPENDIX

A1 From logits to detection significance

The networks were trained for a binary classification task, i.e they produce one 2D vector corresponding to the logits assigned for each class per image. This is encapsulated in the following equations:

(A1)

where |$x_i$| is the input to the network (ith instance of the test data set) and |$f()$| is the mapping learnt by the network. The outputs correspond to a logit value for each of the two possible classes, lens and non-lens. |$Y_{\mathrm{NL}, i}$| and |$Y_{\mathrm{L}, i}$| are the non-lens and lens logits, respectively, and are typically passed through a softmax function which transforms the network’s outputs from logit space to prediction space. The softmax function is defined by

(A2)

where z is the vector corresponding to the output of the final fully connected layer, and K is the number of classes the network is trained to classify. In our case, the final prediction probability for a given input being a lens is

(A3)

and

(A4)

where |$\mathrm{L}$| and |$\mathrm{NL}$| are the lens and non-lens classes, respectively.

The inverse error function provides a measure of how close an input is to 1.0. The function has a null-value at 0 and approaches infinity asymptotically at 1.0, which can be interpreted as a detection significance. Thus, a lens image that produces a probability of being a lens of exactly 1.0 is given an infinite detection significance, and one that produces a 0.99 is given a |$3 \sigma$| detection confidence. Between the probabilities of 0.99 and 1.0, the network’s decision confidence is easier to interpret if one makes a conversion through the inverse error function.

However, conversion from logits to detection significance using the inverse error function is non-trivial as one runs into numerical precision issues due to the asymptotic behaviour of the function as it approaches 1.0. Essentially, high-precision floating point calculations need to be performed in order to make this conversion. Interestingly, we find that converting from logit space directly to detection significance (without the intermediate conversion to probability space via the softmax function) can be done using the relation

(A5)

where C is the network’s lens detection significance.

A2 Number of images and contaminants

It is ensured that the number of quadruply and doubly lensed quasars in |$\mathcal {D_C}$| are roughly equal (in reality, we would expect more doubles than quads). There is also a negligible fraction of three-image systems. As the detection significance is increased, we see in the left panel of Fig. A1 that the network very strongly prefers quadruply imaged quasars as opposed to doubly imaged ones.

Left: distribution of the number of images of the source at different detection significance thresholds for the network trained on $\mathcal {D_C}$. Right: distribution of the number of contaminants (field galaxies) in the image at different detection significance thresholds for the network trained on $\mathcal {D_C}$.
Figure A1.

Left: distribution of the number of images of the source at different detection significance thresholds for the network trained on |$\mathcal {D_C}$|⁠. Right: distribution of the number of contaminants (field galaxies) in the image at different detection significance thresholds for the network trained on |$\mathcal {D_C}$|⁠.

In the testing data set, the number of contaminants in each image is sampled uniformly between 0 and 4. The right panel of Fig. A1 shows the distributions of the number of contaminants in the image at different detection significance thresholds. The plot indicates a subtle preference for images with fewer field galaxies as these may make the separation of the lensed point images from the rest of the field more complex.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.