-
PDF
- Split View
-
Views
-
Cite
Cite
Anupreeta More, Raoul Cañameras, Anton T Jaelani, Yiping Shu, Yuichiro Ishida, Kenneth C Wong, Kaiki Taro Inoue, Stefan Schuldt, Alessandro Sonnenfeld, Systematic comparison of neural networks used in discovering strong gravitational lenses, Monthly Notices of the Royal Astronomical Society, Volume 533, Issue 1, September 2024, Pages 525–537, https://doi.org/10.1093/mnras/stae1597
- Share Icon Share
ABSTRACT
Efficient algorithms are being developed to search for strong gravitational lens systems owing to increasing large imaging surveys. Neural networks have been successfully used to discover galaxy-scale lens systems in imaging surveys such as the Kilo Degree Survey, Hyper-Suprime Cam (HSC) Survey, and Dark Energy Survey over the last few years. Thus, it has become imperative to understand how some of these networks compare, their strengths and the role of the training data sets which are essential in supervised learning algorithms used commonly in neural networks. In this work, we present the first-of-its-kind systematic comparison and benchmarking of networks from four teams that have analysed the HSC Survey data. Each team has designed their training samples and developed neural networks independently but coordinated a priori in reserving specific data sets strictly for test purposes. The test sample consists of mock lenses, real (candidate) lenses, and real non-lenses gathered from various sources to benchmark and characterize the performance of each of the network. While each team’s network performed much better on their own constructed test samples compared to those from others, all networks performed comparable on the test sample with real (candidate) lenses and non-lenses. We also investigate the impact of swapping the training samples among the teams while retaining the same network architecture. We find that this resulted in improved performance for some networks. These results have direct implications on measures to be taken for lens searches with upcoming imaging surveys such as the Rubin-Legacy Survey of Space and Time, Roman, and Euclid.
1 INTRODUCTION
Machine learning applications in astronomy have been growing within the last decade including the field of gravitational lensing. In strong gravitational lensing, multiple lensed images of the same distant galaxy or a quasar are observed owing to the gravitational deflection by a massive galaxy or a cluster in the foreground. Since this requires sufficient line-of-sight alignment between the distant source and the foreground lens with the observer, such lens systems are rare occurrence. However, with increasing number of large imaging surveys with sufficiently deep observations, discovery of large lens samples has become feasible, for instance, from the Dark Energy Survey (Diehl et al. 2017; O’Donnell et al. 2022), Survey of Gravitationally lensed Objects in HSC Imaging (SuGOHI, e.g. Sonnenfeld et al. 2018, 2020; Jaelani et al. 2021; Wong et al. 2022; Chan et al. 2024), Kilo Degree Survey (KiDS, e.g. Petrillo et al. 2017; Khramtsov et al. 2019; Li et al. 2020), and DECam Legacy Survey (DECaLS, e.g. Huang et al. 2020; Storfer et al. 2022). Searching for lens systems is a classical pattern-recognition problem as it involves identifying specific configurations, morphologies, and colours that are expected as a result of lensing. Additionally, the rarity of the lens systems requires sifting through hundreds of images before a promising candidate lens system is discovered. Thus, this is an apt challenge that can be addressed with machine learning algorithms.
Supervised, deep learning algorithms based on convolutional neural networks (CNNs) are favourable as the majority of astronomy data include analysis of multiwavelength imaging. In the last few years, the CNNs have been successfully implemented for searching, primarily, galaxy-scale lenses (e.g. Jacobs et al. 2017; Petrillo et al. 2017, 2019; Cañameras et al. 2020; He et al. 2020; Rojas et al. 2023). A few studies have attempted to make a comparison between different neural network algorithms with other lens search methods with real survey data. For instance, Jacobs et al. (2017) compared the results of a CNN search on Canada–France–Hawaii Telescope Legacy Survey (CFHTLS) data to the results from a purely visual-inspection-based search conducted via Space Warps (Marshall et al. 2016; More et al. 2016), a citizen science program. It is worth noting that the Space Warps results from CFHTLS data are also produced using a supervised-learning approach. Similarly, More et al. (2016) citizen-science-based results are also compared with non-machine-learning algorithms (Gavazzi et al. 2014; More et al. 2012). Such comparison studies have suggested that each of these approaches and algorithms tend to find a subset of lens systems with some overlap with each other.
Others have compared diverse lens search methods which include pure visual inspection and algorithms with/without machine learning on simulated space-based and ground-based data sets (Metcalf et al. 2019). They highlighted that multiband imaging plays an important role in increasing the efficiency of lens identification. Further study by Magro et al. (2021) on the same data sets but after applying modified data pre-processing and augmentation showed an improved performance of the various neural networks and emphasized the adaptability of CNNs. In Knabel et al. (2020), lens search methods such as machine learning, visual inspection, and spectroscopy are compared by analysing the data from the KiDS - Galaxy Mass Assembly (GAMA). They find that each of the methods had distinct selection functions resulting into hardly any overlapping candidates in spite of analysing the same footprints across three different fields. Surveys from upcoming telescopes such as Vera Rubin Observatory,1 Euclid,2 and Nancy Grace Roman3 will increase the rate of detection of lenses by an order of magnitude. The need for efficient and robust machine learning algorithms is stronger ever than before given the challenge of big data.
In this work, we attempt to do the first systematic comparison of multiple networks and training sets which are tested on a common and diverse test data set. Such a study is crucial in identifying the strengths and weaknesses of the network architectures along with construction strategies of different training-validation data sets and thus enabling the development of a superior and robust approach that will produce lens searches with high efficiency. In Holloway et al. (2024), a companion study, we combine different machine learning networks and Space Warps with the goal of constructing a unified, superior ensemble classifier that will be much more efficient than any of the individual methods.
The paper is structured as follows. In Section 2, we briefly introduce the various networks and methodologies used in generating the training-validation data sets. In Section 3, we describe the construction of the various common test data sets. In Section 4, we list the metrics used in our comparison study. In Section 5, we present the results and give conclusions in Section 6.
2 OVERVIEW OF NEURAL NETWORKS
Below we give a brief overview of the different neural networks that are used for comparison in this work. The participating teams have used the data from the HSC SSP Public Data Release 2 (PDR2) (Aihara et al. 2019) for this study.
2.1 Canameras et al.
The classification in Cañameras et al. (2021, hereafter C21) uses a residual neural network (ResNet) inspired from the ResNet-18 architecture (He et al. 2016). After the 64
The network was trained and validated on
In C21, this ResNet was chosen among a range of networks to optimize lens identification over all extended galaxies in DR2 with i-band Kron radius
2.2 Shu et al.
Two lens classifiers were presented in Shu et al. (2022, hereafter S22), both of which were constructed based on the deep residual network, deeplens_classifier, pre-built in the CMU DeepLens package (Lanusse et al. 2018). The main difference between those two lens classifiers was the mock lens population in the training set. For Classifier-1, the mock lenses in the training set covered a lens redshift range of 0–1.0 with a peak at
A full description of how Classifier-1 was built, trained, and tested can be found in S22. Here, we only summarize a few aspects that are relevant for comparing with the other networks. Classifier-1 was trained and validated on HSC
Classifier-1 was optimized based on a test set consisting of 92 grade-A or B strong-lens candidates from the SuGOHI project that were also in the parent sample and 50 000 non-lens objects randomly selected from the parent sample. In S22, the probability threshold was chosen to be
2.3 Jaelani et al.
The lens classification in J23 uses a classical CNN inspired from the CNN architecture used in Jacobs et al. (2017). The network comprises five convolutional layers with 11
The CNN was trained and validated on HSC
The parent sample of 2.3 million galaxies that we used in J23 was selected based on criteria on, e.g. multiband magnitudes, stellar mass, star formation rate, extendedness limit, and photometric redshift range.
2.4 Ishida et al.
This strong lens classifier (Ishida et al. 2024 in preparation; hereafter I24) uses a classical CNN architecture. The CNN is composed of six blocks. Each block consists of two convolutional layers with an equal number of filters and a batch normalization layer. Convolutional layers within these blocks have 32, 64, 64, 64, 128, and 128 filters, respectively, with ReLU activation. The first layer uses a 7
The training and validation data are the same as for the J23 network, comprising 18 660 mock lenses and 18 660 non-lenses. We scale the fits image data using an algorithm (hereafter ‘SDSS normalization’) based on Lupton et al. (2004). We first scale the g-, r-, and i-band images by multiplicative factors of 2.0, 1.2, and 1.0, respectively. These values were determined through testing of various scaling factors and were found to give the best results. We then apply the normalization described by the equations
where B are the fluxes of each pixel in the respective bands, while
We then apply data augmentation to the data set as follows:
a random shift ranging between −6 and +6 pixels in both the x and y directions;
a random horizontal and vertical flip, each with 50 per cent probability;
a random rotation in the range [−36,36] deg;
a random adjustment of the image contrast in the range [0.9, 1.1];
a random scaling of the image brightness in the range [−0.1, 0.1].
The data augmentation is applied directly to the input training and validation data at the start of the training (i.e. it does not create duplicate objects) and is not re-applied at each epoch. The data augmentation steps can result in the transformed images containing points outside the original cut-outs of the input images, so we fill these regions with zeros to maintain
3 CONSTRUCTION OF THE COMMON TEST SAMPLES
Here, we describe the various real and simulated lens and non-lens samples used in constructing the common test data sets for the networks to be compared systematically and to benchmark their performances. The participating teams had agreed that all of the HSC data from the GAMA09H field be reserved for testing and comparison of various networks. A summary of sample sizes of the various data sets are given in Table 1 and their further details are given in the following.
3.1 Known galaxy-scale lenses - L1
Each network is tested on an observational data set of 42 galaxy-scale strong lenses in GAMA09H that have been either spectroscopically confirmed or listed as high-quality candidates. First, we use all systems listed as galaxy–galaxy systems with grade A or B in SuGOHI papers (Sonnenfeld et al. , 2020; Wong et al. 2018; Jaelani et al. 2020), which corresponds to four grade A and 32 grade B. These lenses were found in HSC Wide imaging from a range of data releases up to PDR2 either with Yattalens, an arc-finder combining lens light subtraction and lens modelling, or with crowdsourcing. All high-quality candidates were also validated by experts. Secondly, we consider the galaxy-scale lens candidates identified in GAMA09H with deep learning classification of images from Data Release 4 of the Kilo-Degree Survey (LinKS, Petrillo et al. 2017, 2019). We only consider the subset classified as highest quality, with a visual score larger than 28 in the grading scheme adopted by the authors.
In summary, strong lenses in data set L1 have been found either via non-machine-learning techniques applied to HSC multiband imaging, or via supervised CNNs applied to KiDS imaging, but none has been identified by neural networks from the HSC Wide images we are testing the networks on. They cover a large variety of multiple-image configurations and angular separations, as well as various source over lens flux ratios (see Fig. 1).

Mosaic of grade A and B galaxy–galaxy lenses and lens candidates in data set L1. We list the neural networks that successfully classify them as lenses, for the score threshold assumed in the corresponding lens search papers.
3.2 Lens candidates from our own networks - L2
The data set L2 contains lens candidates found in HSC PDR2 images of GAMA09H with three of our networks.4 After removing duplicates and galaxy-scale systems part of data set L1, we obtained 138 grade A or B candidates with visual grades

Mosaic of grade A and B galaxy–galaxy lenses and lens candidates in data set L2, obtained from our four machine learning searches in HSC PDR2 images. We list the neural networks that successfully classify these cut-outs as lenses, for the score thresholds used in the corresponding papers.

3.3 Mock lenses from Canameras et al. - L3
The data set L3 includes 3000 mock lenses generated following the methodology of C21 except the GAMA09H field which was excluded during training. In short, the simulations follow the procedure described in Schuldt et al. (2021) and Cañameras et al. (2020) by co-adding lensed sources to HSC Wide images of LRGs in GAMA09H with SDSS DR16 spectroscopy. We used the spectroscopic redshifts
For each lens system, the HUDF-selected source was randomly centred at positions with
3.4 Mock lenses from Jaelani et al. - L4
The data set L4 is a sample of 3000 mock lenses from GAMA09H generated by J23 and excluded during their training. These mocks are hybrid in nature, that is, simulated lensed features are superposed on real images of HSC galaxies. We use the simct5 pipeline (see More et al. 2016, for details) for this purpose. Starting with a catalogue of massive galaxies, we use the photometric redshifts and magnitudes to determine the luminosity of the galaxies. Using standard relations, we obtain the velocity dispersion and assuming mass follows light, we characterize the lens mass model for each potential lens galaxy. The background galaxies are drawn from luminosity functions, colours are taken from real galaxy catalogues, and light profile is parametrized with a Sersic model. Lensed arc-like images are generated of the Sersic model for the background source which are then merged with the
3.5 Random non-lenses - N1
We randomly selected 2996 non-lenses from the GAMA09H field which were classified by the Space Warps citizens as low lensing probability candidates. We ensure that none of the SuGOHI lenses or the recently published grade A and B strong lens candidates are part of this sample.
3.6 Selection of non-lenses following Canameras et al. - N2
This data set includes 3000 real galaxies from GAMA09H selected with the same recipe as non-lenses for training the network in C21. Galaxies are selected from the HSC Wide DR2 catalogue to match four different classes. First, a total of 33 per cent spiral galaxies from Tadaki et al. (2020) with i-band Kron radii below 2 arcsec are selected. This size cut is intended to exclude the brightest, most extended galaxies in the input catalogue to focus on spirals with arms located at similar angular separation as multiple images of galaxy-scale lenses. Secondly, 27 per cent LRGs from the input sample of the lens simulation pipeline, namely isolated LRGs from data set L2 without lensed arcs, are selected. Thirdly, 6 per cent compact galaxy groups are selected from Wen, Han & Liu (2012), with at least four galaxies falling within the HSC cut-out. Lastly, 33 per cent random galaxies with r
3.7 Selection of non-lenses following Shu et al. - N3
The data set N3 includes 3000 real galaxies within the GAMA09H field that satisfy a data set of criteria defined in section 2 of S22, which were used to construct the non-lens examples for training, validation, and testing purposes. When selecting, we made sure that none of the 3000 galaxies was included for training or validation in S22 and none of them were reported as a strong-lens system or candidate according to a strong lens compilation built by S22.
3.8 Selection of non-lenses following Jaelani et al. - N4
The data set N4 includes 3000 real galaxies from GAMA09H following the similar selection as non-lenses in J23. We selected non-lens objects for the negatives, which contain 40 per cent galaxies that are randomly selected with photometric redshift range between 0.2 and 1.2, and
3.9 Tricky non-lenses - N5
The data set N5 comprise of a sample of 727 non-lenses in GAMA09H from SpaceWarps (found by YattaLens and visually classified as FP; used for training Citizens), after excluding any overlap with SuGOHI or with recently published grade A or B strong lens candidates.
4 METRICS USED IN COMPARISON ANALYSIS
The performance of each network is evaluated with a range of metrics, and various combinations of test data sets. First, the Receiver Operating Characteristic (ROC) curves are computed by varying the network thresholds between 0 and 1, as shown in Fig. 4, using the following definitions of the true positive rate (TPR or recall) and false positive rate (FPR or contamination):

ROC curves for the networks presented in C21 (maroon), S22 (blue), J23 (orange), and I24 (green). The first and second rows focus on test data sets drawn from observations. The five first panels from top-left measure recall with real SuGOHI and KiDS lenses and lens candidates from data set L1 + L2, and derive contamination from various combinations of non-lenses in data sets N1, N2, N3, N4, and N5. The last panel in the second row combines all lenses and non-lenses with real HSC images (excluding the mock lenses in data sets L3 and L4), after removing duplicates. The panels in the bottom two rows focus on positive and negative examples used for training networks in C21, S22, and J23. Sections with less than five objects are masked to account for the variations in sample sizes. The threshold scores defined in each paper are indicated by dots. The thick grey lines show a random classifier.
This allows us to deduce the classical metrics for binary classification problems that are used in the previous lens finding challenges (Metcalf et al. 2019) and other studies (Schaefer et al. 2018; Cheng et al. 2020), namely, the Area Under the ROC (AUROC). Computing these quantities for the current HSC test data sets, including a wide range of spirals, rings, mergers, and other types of non-lens galaxies, allows us to compare the network classification performances with previous challenges focusing on less representative test samples drawn from simulations.
In Fig. 5, we show an additional metric called the F1 score as a function of the threshold score. We use the standard definitions of F1 score, precision, and recall (i.e. TPR) as follows:

Performance metrics as a function of network threshold for the same combination of data sets as shown in Fig. 4. Top: precision (solid) and recall (dashed) curves and Bottom: F1 scores for the four networks by C21 (maroon), S22 (blue), J23 (orange), and I24 (green).
While an ideal network would have both high precision and high recall, the networks, in practice, tend to perform better on only one of them while compromising the other. The F1 score allows one to assess the accuracy of the network by combining precision and recall. It is defined to result in a high value when both the precision and the recall are high.
We also compute the performance for different combinations of test data sets. We start exclusively with test data sets drawn from observations by measuring recall from SuGOHI and KiDS lenses and lens candidates from data set L1, and deriving contamination from non-lenses in data sets N1 to N5. We then combine all non-lenses together; we include lens candidates from data set L2, and we estimate the performance jointly for all real and simulated lenses, and for all non-lens galaxies. Finally, we focus on data sets L3/N2, L3/N3, and L4/N4 that mimic the positive and negative examples used for training networks in C21, S22, and J23, respectively. These various combinations of test data sets range from 40 to 6200 positive examples and 700–12 700 negatives (see Table 1).
5 RESULTS OF COMPARISON
We have characterized various networks using the different metrics listed in the previous section. We have two networks based on the ResNet architecture i.e. C21 and S22. Their respective training sets are L3
5.1 Comparison of different networks
Fig. 4 shows the ROC curves for different combinations of test data sets in each panel where the random classifier (dark grey) always serves as a reference. The real (candidate) lens sample (L1
A similar behaviour is noted for J23 (orange) and I24 (green) networks. In the final panel ‘l’ where all (simulated and real) lenses are combined with all non-lenses, the J23 (orange) network performs better than the rest, even if marginally so. The main reason being the scatter in the ROCs of J23 across different data set combinations is smaller than the other networks providing a relatively stable performance. While the above conclusions are also evident from a quantitative result based on the AUROCs reported in Table 2, we want to emphasize that the differences between the AUROCs for various networks across different data set combinations are fairly small (particularly for rows f and l). In fact, Fig. 4 is made linear
Test data sets . | AUROC . | F1 | ||||||
---|---|---|---|---|---|---|---|---|
. | C21 . | S22 . | J23 . | I24 . | C21 . | S22 . | J23 . | I24 . |
a) L1 + L2 | 0.97 | 0.98 | 0.86 | 0.80 | 0.77 | 0.76 | 0.55 | 0.45 |
b) L1 + L2 | 0.97 | 0.91 | 0.90 | 0.80 | 0.74 | 0.53 | 0.61 | 0.47 |
c) L1 + L2 | 0.97 | 0.99 | 0.88 | 0.83 | 0.78 | 0.77 | 0.62 | 0.48 |
d) L1 + L2 | 0.94 | 0.96 | 0.97 | 0.87 | 0.68 | 0.72 | 0.74 | 0.51 |
e) L1 + L2 | 0.96 | 0.90 | 0.84 | 0.79 | 0.77 | 0.70 | 0.69 | 0.52 |
f) L1 + L2 | 0.96 | 0.96 | 0.90 | 0.82 | 0.59 | 0.47 | 0.35 | 0.37 |
g) L3 | 1.00 | 0.99 | 0.90 | 0.71 | 0.99 | 0.94 | 0.70 | 0.25 |
h) L3 | 1.00 | 1.00 | 0.88 | 0.76 | 1.00 | 0.96 | 0.70 | 0.25 |
i) L3 | 1.00 | 1.00 | 0.96 | 0.82 | 0.99 | 0.96 | 0.71 | 0.25 |
j) L4 | 0.80 | 0.70 | 0.98 | 0.99 | 0.18 | 0.30 | 0.95 | 0.95 |
k) L4 | 0.65 | 0.81 | 0.99 | 0.99 | 0.17 | 0.31 | 0.97 | 0.96 |
l) L (all) | 0.87 | 0.91 | 0.94 | 0.87 | 0.70 | 0.70 | 0.82 | 0.68 |
Test data sets . | AUROC . | F1 | ||||||
---|---|---|---|---|---|---|---|---|
. | C21 . | S22 . | J23 . | I24 . | C21 . | S22 . | J23 . | I24 . |
a) L1 + L2 | 0.97 | 0.98 | 0.86 | 0.80 | 0.77 | 0.76 | 0.55 | 0.45 |
b) L1 + L2 | 0.97 | 0.91 | 0.90 | 0.80 | 0.74 | 0.53 | 0.61 | 0.47 |
c) L1 + L2 | 0.97 | 0.99 | 0.88 | 0.83 | 0.78 | 0.77 | 0.62 | 0.48 |
d) L1 + L2 | 0.94 | 0.96 | 0.97 | 0.87 | 0.68 | 0.72 | 0.74 | 0.51 |
e) L1 + L2 | 0.96 | 0.90 | 0.84 | 0.79 | 0.77 | 0.70 | 0.69 | 0.52 |
f) L1 + L2 | 0.96 | 0.96 | 0.90 | 0.82 | 0.59 | 0.47 | 0.35 | 0.37 |
g) L3 | 1.00 | 0.99 | 0.90 | 0.71 | 0.99 | 0.94 | 0.70 | 0.25 |
h) L3 | 1.00 | 1.00 | 0.88 | 0.76 | 1.00 | 0.96 | 0.70 | 0.25 |
i) L3 | 1.00 | 1.00 | 0.96 | 0.82 | 0.99 | 0.96 | 0.71 | 0.25 |
j) L4 | 0.80 | 0.70 | 0.98 | 0.99 | 0.18 | 0.30 | 0.95 | 0.95 |
k) L4 | 0.65 | 0.81 | 0.99 | 0.99 | 0.17 | 0.31 | 0.97 | 0.96 |
l) L (all) | 0.87 | 0.91 | 0.94 | 0.87 | 0.70 | 0.70 | 0.82 | 0.68 |
Test data sets . | AUROC . | F1 | ||||||
---|---|---|---|---|---|---|---|---|
. | C21 . | S22 . | J23 . | I24 . | C21 . | S22 . | J23 . | I24 . |
a) L1 + L2 | 0.97 | 0.98 | 0.86 | 0.80 | 0.77 | 0.76 | 0.55 | 0.45 |
b) L1 + L2 | 0.97 | 0.91 | 0.90 | 0.80 | 0.74 | 0.53 | 0.61 | 0.47 |
c) L1 + L2 | 0.97 | 0.99 | 0.88 | 0.83 | 0.78 | 0.77 | 0.62 | 0.48 |
d) L1 + L2 | 0.94 | 0.96 | 0.97 | 0.87 | 0.68 | 0.72 | 0.74 | 0.51 |
e) L1 + L2 | 0.96 | 0.90 | 0.84 | 0.79 | 0.77 | 0.70 | 0.69 | 0.52 |
f) L1 + L2 | 0.96 | 0.96 | 0.90 | 0.82 | 0.59 | 0.47 | 0.35 | 0.37 |
g) L3 | 1.00 | 0.99 | 0.90 | 0.71 | 0.99 | 0.94 | 0.70 | 0.25 |
h) L3 | 1.00 | 1.00 | 0.88 | 0.76 | 1.00 | 0.96 | 0.70 | 0.25 |
i) L3 | 1.00 | 1.00 | 0.96 | 0.82 | 0.99 | 0.96 | 0.71 | 0.25 |
j) L4 | 0.80 | 0.70 | 0.98 | 0.99 | 0.18 | 0.30 | 0.95 | 0.95 |
k) L4 | 0.65 | 0.81 | 0.99 | 0.99 | 0.17 | 0.31 | 0.97 | 0.96 |
l) L (all) | 0.87 | 0.91 | 0.94 | 0.87 | 0.70 | 0.70 | 0.82 | 0.68 |
Test data sets . | AUROC . | F1 | ||||||
---|---|---|---|---|---|---|---|---|
. | C21 . | S22 . | J23 . | I24 . | C21 . | S22 . | J23 . | I24 . |
a) L1 + L2 | 0.97 | 0.98 | 0.86 | 0.80 | 0.77 | 0.76 | 0.55 | 0.45 |
b) L1 + L2 | 0.97 | 0.91 | 0.90 | 0.80 | 0.74 | 0.53 | 0.61 | 0.47 |
c) L1 + L2 | 0.97 | 0.99 | 0.88 | 0.83 | 0.78 | 0.77 | 0.62 | 0.48 |
d) L1 + L2 | 0.94 | 0.96 | 0.97 | 0.87 | 0.68 | 0.72 | 0.74 | 0.51 |
e) L1 + L2 | 0.96 | 0.90 | 0.84 | 0.79 | 0.77 | 0.70 | 0.69 | 0.52 |
f) L1 + L2 | 0.96 | 0.96 | 0.90 | 0.82 | 0.59 | 0.47 | 0.35 | 0.37 |
g) L3 | 1.00 | 0.99 | 0.90 | 0.71 | 0.99 | 0.94 | 0.70 | 0.25 |
h) L3 | 1.00 | 1.00 | 0.88 | 0.76 | 1.00 | 0.96 | 0.70 | 0.25 |
i) L3 | 1.00 | 1.00 | 0.96 | 0.82 | 0.99 | 0.96 | 0.71 | 0.25 |
j) L4 | 0.80 | 0.70 | 0.98 | 0.99 | 0.18 | 0.30 | 0.95 | 0.95 |
k) L4 | 0.65 | 0.81 | 0.99 | 0.99 | 0.17 | 0.31 | 0.97 | 0.96 |
l) L (all) | 0.87 | 0.91 | 0.94 | 0.87 | 0.70 | 0.70 | 0.82 | 0.68 |
Next, we show the precision (solid curves – top half), recall (dotted curves – top half), and F1 score (bottom half) in Fig. 5 as a function of thresholds for the same combination of data sets as in the previous figure with the ROCs. Since F1 combines both precision and recall of a network shown for varying thresholds, it allows us to compare the overall accuracy of various networks in a more comprehensive manner across different data sets. In the panel ‘f’ of top half of Fig. 5 when analysing real lenses and non-lenses, we see that the precision of S22, J23, and I24 are all comparable and relatively poor as compared to C21 which has much higher precision. In terms of recall, S22 shows the best performance with J23 following closely and I24 and C21 come next in that order. In panel ‘l’, which now also includes simulated lenses, the performance improves quantitatively but the same qualitative trends are seen as panel ‘f’.
In the bottom half of Fig. 5, we find that apart from C21, all of the other networks show similar rising or nearly constant trend in the F1 score with increasing thresholds. Thus, a low threshold for C21 and high thresholds for other networks are expected to give better network accuracy and performance. It is not a surprise then that the thresholds chosen for the respective network by C21, S22, J23, and I24 are about 0.1, 0.97, 0.99, and 0.9, respectively. In spite of being trained on the same simulated lenses, the distinction between C21 and S22 networks becomes more apparent in panels ‘a’ to ‘f’ when tested on real lenses combined with different kinds of non-lenses. The F1 score curve of S22 shows moderate to marginally better performance in most of these panels followed by J23 and then I24.
As before, the networks trained on their own simulated lenses (L3 for C21 and S22, whereas L4 for J23 and I24, see panels ‘g’ to ‘k’) produce superior F1 score at all thresholds. On the all lens - all non-lens samples (panel ‘l’), the J23 network performs better than others. These results highlight a general and prevalent issue of overfitting of networks to specific training samples which then result in an unknown and usually, poorly characterized performance on real data.
We also report the F1 score for the aforementioned specific thresholds in Table 2. These thresholds, chosen by each team, are applied when conducting the actual lens search on the entire HSC survey data (except for I24 which has not been applied to the entire HSC data yet). We note similar trends as before. The F1
5.2 Comparison of networks when trained on interchanged data sets
To further understand the possible causes of the different trends seen in previous section, we decide to perform a number of tests among the various networks in which we keep each individual network architecture the same, but train on the training data initially used by the other teams. These tests will help us to assess the the role of training sample on the performance of the networks.
The I24 network, which initially was trained on the J23 training sample, is subsequently trained on the C21 and S22 training samples with no modifications to the network architecture. The preprocessing steps, including the SDSS normalization and data augmentation described in Section 2, are kept the same and applied to the new training samples. The results of these tests are shown in Table 3 and Fig. 6 where the dashed curves represent the new I24 performance and the solid curves of the four original networks are also shown for reference. Here, curves of same colour will have same training data sets.

Comparison of performance for the network from I24 when trained on different training data samples. For simplicity, only a subset of cases are shown. The dashed curves show the I24 CNN trained on the data set from C21 (maroon) and from S22 (blue). For reference, the solid lines reproduce the original curves from Fig. 4, namely for the C21 (maroon), S22 (blue), and J23 (orange) networks trained on the corresponding data sets. Note that I24 (green solid curve) was originally trained on J23 data set.
Comparison of performance for the network from I24 when trained on different training data samples. The original network I24 CNN was trained on the J23 data set (see I24 in Table 2) which is shown here as J23 for reference. Columns labelled C21 and S22 correspond to the I24 CNN trained on their data sets, respectively.
Network . | Training data . | |||||
---|---|---|---|---|---|---|
. | AUROC . | F1 | ||||
. | C21 . | S22 . | J23 . | C21 . | S22 . | J23 . |
g) L3(mock) | 1.00 | 0.99 | 0.71 | 0.74 | 0.90 | 0.25 |
h) L3(mock) | 0.99 | 1.00 | 0.82 | 0.74 | 0.91 | 0.25 |
i) L1 + L2(real) | 0.97 | 0.95 | 0.82 | 0.36 | 0.33 | 0.37 |
j) L4(mock) | 0.94 | 0.70 | 0.99 | 0.21 | 0.21 | 0.95 |
k) L4(mock) | 0.87 | 0.82 | 0.99 | 0.21 | 0.21 | 0.96 |
l) L(all) | 0.95 | 0.92 | 0.87 | 0.52 | 0.63 | 0.68 |
Network . | Training data . | |||||
---|---|---|---|---|---|---|
. | AUROC . | F1 | ||||
. | C21 . | S22 . | J23 . | C21 . | S22 . | J23 . |
g) L3(mock) | 1.00 | 0.99 | 0.71 | 0.74 | 0.90 | 0.25 |
h) L3(mock) | 0.99 | 1.00 | 0.82 | 0.74 | 0.91 | 0.25 |
i) L1 + L2(real) | 0.97 | 0.95 | 0.82 | 0.36 | 0.33 | 0.37 |
j) L4(mock) | 0.94 | 0.70 | 0.99 | 0.21 | 0.21 | 0.95 |
k) L4(mock) | 0.87 | 0.82 | 0.99 | 0.21 | 0.21 | 0.96 |
l) L(all) | 0.95 | 0.92 | 0.87 | 0.52 | 0.63 | 0.68 |
Comparison of performance for the network from I24 when trained on different training data samples. The original network I24 CNN was trained on the J23 data set (see I24 in Table 2) which is shown here as J23 for reference. Columns labelled C21 and S22 correspond to the I24 CNN trained on their data sets, respectively.
Network . | Training data . | |||||
---|---|---|---|---|---|---|
. | AUROC . | F1 | ||||
. | C21 . | S22 . | J23 . | C21 . | S22 . | J23 . |
g) L3(mock) | 1.00 | 0.99 | 0.71 | 0.74 | 0.90 | 0.25 |
h) L3(mock) | 0.99 | 1.00 | 0.82 | 0.74 | 0.91 | 0.25 |
i) L1 + L2(real) | 0.97 | 0.95 | 0.82 | 0.36 | 0.33 | 0.37 |
j) L4(mock) | 0.94 | 0.70 | 0.99 | 0.21 | 0.21 | 0.95 |
k) L4(mock) | 0.87 | 0.82 | 0.99 | 0.21 | 0.21 | 0.96 |
l) L(all) | 0.95 | 0.92 | 0.87 | 0.52 | 0.63 | 0.68 |
Network . | Training data . | |||||
---|---|---|---|---|---|---|
. | AUROC . | F1 | ||||
. | C21 . | S22 . | J23 . | C21 . | S22 . | J23 . |
g) L3(mock) | 1.00 | 0.99 | 0.71 | 0.74 | 0.90 | 0.25 |
h) L3(mock) | 0.99 | 1.00 | 0.82 | 0.74 | 0.91 | 0.25 |
i) L1 + L2(real) | 0.97 | 0.95 | 0.82 | 0.36 | 0.33 | 0.37 |
j) L4(mock) | 0.94 | 0.70 | 0.99 | 0.21 | 0.21 | 0.95 |
k) L4(mock) | 0.87 | 0.82 | 0.99 | 0.21 | 0.21 | 0.96 |
l) L(all) | 0.95 | 0.92 | 0.87 | 0.52 | 0.63 | 0.68 |
Based on the metrics, the I24 network performs better when trained on the C21 (maroon dashed) and S22 (blue dashed) data sets than when trained on the J23 data sets (green solid), even though the network architecture is unchanged. In fact, comparing the dashed and solid maroon curves in Fig. 6 indicates that, for most combinations of test data sets, the I24 network even performs comparably or even better than the C21 network (see panels ‘i’,‘l’). Furthermore, the F1 score as a function of the threshold (Fig. 7) too shows that I24 improves with training on the C21 and S22 data sets (dashed curves with respect to solid green, see panels ‘g’,‘h’,‘i’,‘l’) and it outperforms C21 (dashed with respect to solid, panels ‘i’,‘j’, ‘k’ and ‘l’).

Performance metrics F1 score of I24 as a function of threshold for the same combination of data sets as shown in Fig. 6. As before, the dashed curves show the modified F1 score performance of I24 when trained on C21 data set (maroon) and S22 data set (blue).
Encouraged by these results, we decide to perform a similar exercise with J23 and C21 networks by interchanging their training data sets. When the CNN of J23 is trained on the C21 data set, Fig. 8 shows a similar improvement in the performance of J23 (maroon dashed curve) compared to the original CNN trained on its own data set (orange solid curve). Interestingly, the performance of the C21 network when trained on the J23 data set becomes substantially worse as is evident from the orange-dashed curve in Fig. 8 with respect to the original Resnet of C21 trained on its own data set (maroon solid curve, panels ‘i’ and ‘l’).

Comparison of performance for the network from C21 when trained on J23 (orange dashed) and the one from J23 when trained on C21 (maroon dashed). For reference, solid curves reproduce the original ROC from Fig. 4, namely for the C21 (maroon), S22 (blue), J23 (orange), and I24 (green) networks.
These tests reinforce the fact that the training sample has a significant impact on the performance of the network. Also, the CNNs (I24 and J23) perform better on the C21 training data set wherein the parent galaxy catalogue comes from broader selection cuts (see section 2.1 of C21). It will be worth investigating in the future if the differences in the selection of the parent catalogues itself is the cause of these improvements which is beyond the scope of this work.
5.3 Qualitative comparison of performance on SuGOHI lenses
We also make a qualitative comparison between the output of different networks. To illustrate this, we plot in Fig. 1 all grade A and B galaxy–galaxy lenses and lens candidates included in data set L1 and we list the networks that would recover these systems based on their respective detection thresholds. Interestingly, HSCJ091331
6 SUMMARY AND CONCLUSIONS
We present one of the first systematic studies of comparison and benchmarking of multiple neural networks, for searching strong gravitational lenses, tested on the common data sets generated from the HSC Survey imaging. Four teams devised their own training samples selected from the HSC Survey data with some teams having partially common data sets during training. Every team refrained from validating and testing on the varied fixed test data sets comprising of known and/or simulated lenses and real non-lenses. Subsequently, the teams exchanged their training data sets to retrain their original networks and tested again on the same common test data sets to evaluate the (lack of) sensitivity of the networks to the nature of the training samples. We note that the analyses always includes non-lenses selected from real galaxy catalogues. We use standard metrics such as the ROC curves and F1 score that combines precision and recall for comparing the performances across the four networks.
Our main conclusions are (i) each network performs extremely well and better than the rest when trained and tested on their own data sets which are drawn from the same population of their own simulated lenses and real non-lenses; (ii) all networks seem to show comparable performance on the sample of real lenses (also combined with all simulated lenses) and non-lenses; (iii) while the C21 network has somewhat better AUROC on the combined test data sets, the J23 network is found to be more robust across the different combinations of test data sets; (iv) when training data sets are exchanged, the CNNs (I24 and J23) give better performance on most test data sets and at times outperform all of the original networks, whereas the newly trained Resnet network (e.g. C21) tends to underperform on the various test data sets implying that the nature of training sample plays a crucial role; and (v) prima facie, the combination of CNNs and training data set of C21 is found to give the most optimal performance which needs further investigation.
ACKNOWLEDGEMENTS
This research is supported in part by the Max Planck Society, and by the Excellence Cluster ORIGINS which is funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC- 2094 - 390783311. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (LENSNOVA: grant agreement No 771776). ATJ is supported by the Riset Unggulan ITB 2024.
This paper is based on data collected at the Subaru Telescope and retrieved from the HSC data archive system, which is operated by Subaru Telescope and Astronomy Data Center at National Astronomical Observatory of Japan. The Hyper Suprime-Cam (HSC) collaboration includes the astronomical communities of Japan and Taiwan, and Princeton University. The HSC instrumentation and software were developed by the National Astronomical Observatory of Japan (NAOJ), the Kavli Institute for the Physics and Mathematics of the Universe (Kavli IPMU), the University of Tokyo, the High Energy Accelerator Research Organization (KEK), the Academia Sinica Institute for Astronomy and Astrophysics in Taiwan (ASIAA), and Princeton University. Funding was contributed by the FIRST program from Japanese Cabinet Office, the Ministry of Education, Culture, Sports, Science and Technology (MEXT), the Japan Society for the Promotion of Science (JSPS), Japan Science and Technology Agency (JST), the Toray Science Foundation, NAOJ, Kavli IPMU, KEK, ASIAA, and Princeton University. This paper uses software developed for the LSST. We thank the LSST Project for making their code available as free software at http://dm.lsst.org. This work is supported by JSPS KAKENHI Grant Numbers JP20K14511 and JP24K07089.
DATA AVAILABILITY
The various lens and non-lens test data sets are available upon reasonable request to the authors.
Footnotes
I24 network is yet to be run on the entire HSC footprint and does not yet have a corresponding sample of lens candidates.