Discovering Ca ii absorption lines with a neural network

Xia, Iona; Ge, Jian; Willis, Kevin; Zhao, Yinan

doi:10.1093/mnras/stac2905

ABSTRACT

Quasar absorption line analysis is critical for studying gas and dust components and their physical and chemical properties as well as the evolution and formation of galaxies in the early universe. Calcium II (Ca ii) absorbers, which are one of the dustiest absorbers and are located at lower redshifts than most other absorbers, are especially valuable when studying physical processes and conditions in recent galaxies. However, the number of known quasar Ca ii absorbers is relatively low due to the difficulty of detecting them with traditional methods. In this work, we developed an accurate and quick approach to search for Ca ii absorption lines using deep learning. In our deep learning model, a convolutional neural network, tuned using simulated data, is used for the classification task. The simulated training data are generated by inserting artificial Ca ii absorption lines into original quasar spectra from the Sloan Digital Sky Survey (SDSS), while an existing Ca ii catalogue is adopted as the test set. The resulting model achieves an accuracy of 96 per cent on the real data in the test set. Our solution runs thousands of times faster than traditional methods, taking a fraction of a second to analyse thousands of quasars, while traditional methods may take days to weeks. The trained neural network is applied to quasar spectra from SDSS’s DR7 and DR12 and discovered 399 new quasar Ca ii absorbers. In addition, we confirmed 409 known quasar Ca ii absorbers identified previously by other research groups through traditional methods.

methods: data analysis, techniques: spectroscopic, quasars: absorption lines

1 INTRODUCTION

Quasar absorption lines are critical to study gas and dust components in galaxies and galaxy evolution as these lines can be analysed to investigate the physical properties and kinematics of interstellar medium (ISM) in their associated galaxies at different redshifts (Sardane, Turnshek & Rao 2014). Most previous studies involving quasar absorbers concentrate on absorption lines at higher redshifts such as magnesium II (Mg ii) absorbers and damped Ly α absorbers (DLAs). The rare calcium II (Ca ii) absorbers are the main type of absorbers that allow the study of environments with absorption redshifts z_abs of <0.4 (i.e. the most recent 4.3 billion years), as the Ca ii H&K doublet at wavelengths of λλ3934, 3969 can be detected in quasar optical spectra to trace galaxies at a redshift range of 0 < z < 1.4, which covers galaxies from about 8.9 billion years ago to the present epoch. Previous studies show that most Ca ii absorbers lie in areas that are dense and dusty, full of metal and molecular hydrogen, the perfect spots for star formation (Wild & Hewett 2005; Wild, Hewett & Pettini 2006, 2007; Zych et al. 2007, 2009; Nestor et al. 2008; Sardane et al. 2014).

However, Ca ii absorption lines are uncommon because calcium is a highly refractory element, causing them to be depleted in the ISM (Savage & Sembach 1996; Wild & Hewett 2005; Wild et al. 2006; Sardane et al. 2014). Ca⁺ is also not the most dominant ionization state of calcium as the ionization potential of Ca ii is only 11.871 eV, while the ionization potential of H is 13.598 eV (Moore 1970). With the ionization potential of Ca ii being less than that of H, it is not able to be shielded by hydrogen in the ISM. With both of these factors combined, Ca ii absorption lines are extremely rare. In fact, compared to the more common Mg ii absorption lines, only about 3 per cent of quasar spectra containing Mg ii absorption lines contain Ca ii absorption lines (Sardane et al. 2014). Therefore, there are currently a relatively small number of quasar spectra with Ca ii absorption lines discovered, with the largest catalogue being from Sardane et al. (2014) with 435 quasar Ca ii absorbers. This situation limits the number of statistical tests that can be accurately performed, causing a challenge to confirm previous theoretical models on physical processes and environments associated with these absorbers and understand their properties and evolution with redshifts.

Currently, a few groups have formed theoretical models on how Ca ii absorption lines form in ISM associated with distant galaxies. For instance, Sardane et al. (2014) found that quasar spectra with Ca ii absorption lines may be composed of two distinct populations, which arise when the distribution of equivalent width (EW) of the λ3934 line requires a double exponential function to produce a satisfactory fit of the data (Sardane et al. 2014). They inferred that the split of the two populations is when the ratio of EW of the λ2796 (Mg ii absorption line) and the EW of the λ3934 (Ca ii absorption line) equals 1.8, and at the EW of the Ca ii absorption line at λ3934 = 0.7 Å, with absorption lines that have a λ3934 EW below 0.7 Å considered as weak absorbers, and absorption lines that have a λ3934 EW above 0.7 Å considered as strong absorbers. The weak population can be identified with a halo-like gas, while the strong population can be identified with a mix of halo and disc-like gas. Investigations on chemical and dust depletion properties of Ca ii absorption lines’ subsamples done by Sardane, Turnshek & Rao (2015) and Fang et al. (in preparation) have results supporting that stronger absorbers are likely to be associated with low impact parameter and disc-like environments compared to the weaker absorbers’ association with larger impact parameter environments typical of galactic haloes. Strong absorption lines are also found to be six times more reddened than weak absorption lines. They also noted an imbalance between the two populations with there being a much larger amount of weak absorbers.

Another theoretical model that was postulated by Wild & Hewett (2005) speculates that Ca ii absorbers are an unusual subclass of DLA systems, as Ca ii absorbers have similar strengths of neutral hydrogen column densities of various lines of sight through the Milky Way galaxy as DLAs. In a study by Zych et al. (2007), they found that 18 of the 19 Ca ii quasar absorbers studied have <0.3 dex variation in [Cr/Zn], much like the general DLA population, providing more evidence for this model. However, data for verifying these models are relatively small and uncertainties are large. Thus, more data are required to provide strong constrains on these models and understand their characteristics.

There is also a possible link to molecular clouds and star-forming galaxies, which is suggested by the detection of H₂ absorption in a few Ca ii absorbers. Sardane et al. (2015) suggested that this might be a common phenomenon and linked this with their depletion of dust grains and reddening.

Nevertheless, despite the necessity to discover more quasars with Ca ii absorption lines, the traditional method of detecting them takes an extremely long time, taking usually many days to many weeks to detect absorption lines in large data sets such as the Sloan Digital Sky Survey (SDSS) releases (e.g. Fang et al. in preparation). It also requires considerable human interference, making it quite inconvenient and possibly inconsistent results. The basic search procedure for Ca ii absorbers is actually rather simple. As summarized in Sardane et al. (2014), which uses methods described in Nestor, Turnshek & Rao (2005), Rimoldini (2007), and Quider et al. (2011), there is first an automated process in which the quasar spectrum is normalized and fitted using a combination of cubic splines and Gaussian functions. In order to provide a good fit, it takes significant computational time. This is not ideal as there are tens of thousands of quasar spectra that need to be processed. A sliding line-finding window is then applied on the quasar spectrum to search for the possible Ca ii doublet. If it is found, and it has a sufficient signal-to-noise ratio (S/N), it is flagged as a candidate set of Ca ii absorption lines. Next, the candidates are manually checked to determine whether they are real Ca ii absorption lines. These manual checking criteria differ between different groups, with most earlier research work seeking stronger Ca ii absorption lines. For example, Zych et al. (2007) required an EW > 0.2 Å and a minimum significance level of 4σ for the Ca ii λ3934 line and a minimum significance level of 1σ for the λ3969 line. Sardane et al. (2014), meanwhile, had a harsher criterion with a 5σ minimum level of significance for the λ3934 line, and a 2.5σ minimum level of significance for the λ3969 line, and also sought for a 2:1 line ratio between the λ3934 line and the λ3969 line. This step takes the largest amount of time and human interference, as often there are thousands of absorption lines that need to be double checked one at a time. With all the time and human interference that is required to search for Ca ii absorption lines, it would be extremely helpful to use a faster and more accurate method.

Because of this, we decided to develop a deep neural network approach in order to create a novel, fast, and accurate method. A neural network is a set of algorithms that are designed to recognize patterns in a series of data in a process that mimics the human brain and is one kind of artificial intelligence. It is able to solve problems that are otherwise impossible or difficult to deal with as it can act fast like a computer and accurately like the human brain. Neural networks consist of neurons that are connected through the input layer, hidden layers, and the output layer. The two most popular neural networks that are used for data classification are the convolutional neural network (CNN) and the recurrent neural network (RNN). The CNN is adopted as it contains classification capabilities that are necessary to identify Ca ii absorption lines as opposed to RNNs that are more adapted for time series data analysis (Zhao et al. 2019).

In the astronomy field, neural networks have been used in a variety of ways from 1D spectra analysis (e.g. Hála 2014; Graff et al. 2016; Hampton et al. 2017; Parks et al. 2018) to 2D image pattern recognition in star galaxy classification (e.g. Bertin & Arnouts 1996; Kim & Brunner 2017). Zhao et al. (2019) developed a deep neural network for identifying Mg ii λλ2796, 2803 Å doublets in the SDSS DR12 data set and were able to achieve about a 94 per cent accuracy while analysing 50 000 quasar spectra in around 9 s for identifying Mg ii absorber candidates.

In this paper, we aim to modify the deep neural network developed by Zhao et al. (2019) for identifying Ca ii absorbers in the SDSS DR7 (Abazajian et al. 2005) and DR12 (Alam et al. 2015) data sets. As Ca ii absorbers are weaker than Mg ii absorbers, it is a more difficult and challenging task to identify them. In Section 2, the data sets used for the training and testing purposes in this paper are discussed, including the artificial spectra generated, as well as data pre-processing for maximum accuracy and efficiency. In Section 3, the steps to build our neural network are introduced and the developed model is described. Experimental results are summarized in Section 4, followed by a discussion, concluding remarks, and plans for future improvements in Section 5.

2 DATA SETS AND DATA PRE-PROCESSING

Because of the lack of quasar spectra with Ca ii absorption lines discovered and the magnitude of data necessary to have an effective and accurate training set for the neural network model, there is a need for an artificial training data set using existing quasar data. Our artificial spectral data are generated based on SDSS DR12 (Alam et al. 2015). Nevertheless, an existing catalogue of quasar Ca ii absorbers from Sardane et al. (2014) is used as our test set since the different parameters and outliers (such as imperfections in real data) may be present in real data and may not be covered in the artificial spectra. This would help identify potential loose ends in our neural network design. Because of this, to make our model as practical as possible, it is beneficial to use a real data set as the test data set in order to not only test our model but also to evaluate our training data set. It is also important to note that all negative samples in our test set and all data sets used to search for Ca ii absorption lines include Mg ii absorption lines detected before, as our approach requires the use of z_abs to select the spectral region covering both Ca ii absorption lines. All quasar Ca ii absorption lines have been found to also have Mg ii absorption lines with a z_abs > 0.4 and a good S/N at the Mg ii absorption line location.

2.1 Real Ca ii absorption lines and test data set

431 SDSS quasar spectra in the quasar Ca ii absorber’s catalogue of Sardane et al. (2014) are used as the positive samples, making half of our test data set. Only four samples in this catalogue were missed because we were unable to locate the mjd, fibre, and plate for them. This catalogue has quasar Ca ii absorbers with 0.0277 < z_abs < 1.3428, with a mean z_abs of 0.5747. Their catalogue was chosen with a 5σ minimum level of significance for the λ3934 line, and a 2.5σ minimum level of significance for the λ3969 line. They also mentioned that their data set should be very complete with no more than 10 absorption lines missed (Sardane et al. 2014). These Ca ii absorbers were found from quasar spectra from SDSS DR9 (Ahn et al. 2012) and DR7 (Abazajian et al. 2005).

In addition to that, we randomly chose 431 other quasar spectra that do not contain Ca ii absorption lines as the negative half of our test data set. A 1:1 ratio between the sizes of the positive and negative data sets was adopted as studies have shown that having equal numbers of positive and negative samples is the most optimal way to train and test neural networks (Masko & Hensman 2015). Some examples of positive samples in the test set are shown in Fig. 1.

Figure 1.

Examples of positive quasar spectra samples with Ca ii absorption lines (from Sardane et al. 2014) in the test set. The λλ3934 and 3969 lines are indicated by the black lines.

Open in new tab Download slide

2.2 Artificial Ca ii absorbers and training data set

Since it is essential to produce an abundance of training data to train an accurate neural network, artificial quasar spectra were generated. In order for the artificial data to be representative of real quasar spectra with Ca ii absorption lines, artificial Ca ii absorption lines need to follow parameter distributions of real Ca ii absorption lines. Therefore, measurements of key parameters describing Ca ii absorption lines, such as the EW, the full width at half-maximum (FWHM), and z_abs in real Ca ii absorber data, were conducted. The measurements are used for generating artificial Ca ii absorption lines.

The EW values from Sardane’s catalogue were remeasured by our program to be consistent with our measurements for new Ca ii absorbers, even though Sardane et al. (2014) listed the EWs of their discovered absorbers in their paper (Fig. 2). When comparing our measurements with theirs, we found that our EW measurements were slightly lower overall by about 6 per cent. This could be due to our difference in continuum fitting. Fig. 5 shows an example of our continuum fitting, which filters out above 2σ outliers of spectral data points before fitting the continuum. Sometimes, it is necessary to have multiple iterations in fitting before converging to a final continuum solution. Since a typical measurement error for most weak absorption lines is of the order of 20–25 per cent, this 6 per cent systematic small offset is at least three times smaller than these measurement errors. Therefore, our EW measurements are consistent with the measurements by Sardane et al.

$A comparison between our measured EWs and Sardane et al. (2014)’s measurements for its catalogue. While there were a couple of outliers, most of the measurements were extremely similar (being very close to the black line x = y). We find that some of the outliers may be false detections as described in Section 4.1. Top panel: A comparison with our measurements on the y-axis and Sardane et al. (2014)’s measurements on the x-axis. Bottom panel: A comparison between the fractional difference of our measurements and Sardane et al. (2014)’s measurements with the original EW measurements of Sardane et al. (2014)’s. There was a greater fractional difference for absorption lines with a lower EW due to uncertainties in continuum fitting for weak absorption lines. We also found that our measurements were slightly lower overall, which may be due to different approaches in spectral continuum fitting adopted by us and Sardane et al. (2014).$

Figure 2.

A comparison between our measured EWs and Sardane et al. (2014)’s measurements for its catalogue. While there were a couple of outliers, most of the measurements were extremely similar (being very close to the black line x = y). We find that some of the outliers may be false detections as described in Section 4.1. Top panel: A comparison with our measurements on the y-axis and Sardane et al. (2014)’s measurements on the x-axis. Bottom panel: A comparison between the fractional difference of our measurements and Sardane et al. (2014)’s measurements with the original EW measurements of Sardane et al. (2014)’s. There was a greater fractional difference for absorption lines with a lower EW due to uncertainties in continuum fitting for weak absorption lines. We also found that our measurements were slightly lower overall, which may be due to different approaches in spectral continuum fitting adopted by us and Sardane et al. (2014).

Open in new tab Download slide

In addition, the FWHM values of their absorbers were also measured by first cropping the spectrum to where the absorption line is, by shifting the spectrum to the rest frame and then cropping it to 3900–4000 Å. Then, the remaining segment was normalized by fitting it to a continuum using a polynomial function. Our absorption fitting program is based on a python Voigt fitting package called voigtfit (Krogager 2018), which can measure both the FWHM and the EW of absorption lines. The FWHMs follow a Gaussian distribution of around μ = 2.1 Å and σ = 0.9 Å. The EW distribution has a range between 0.16 and 2.57 Å for the λ3934 line with an average EW of 0.76 Å, while the λ3969 line has an EW distribution of between 0.11 and 1.60 Å with an average of 0.48 Å. Their z_abs values are between 0.03 and 1.34.

Our artificial data set follows a similar distribution from 0.05 to 4.00 Å for the EW to account for possible stronger absorption lines, but with a greater amount having a lower EW to match the distribution of the real data set. The comparisons of the distributions are shown in Fig. 3. As can be seen, the distributions of the two data sets are similar and have the same trend for both λλ3934 and 3969 lines, indicating that a randomly selected sample from the artificial data would roughly match the distribution of Sardane et al. (2014)’s catalogue. However, the simulated set contains a greater amount of lower EWs for the λ3934 line, so the trained neural network can be more sensitive to samples with low EWs. As for z_abs, because we are searching quasar spectra with detected Mg ii absorption lines, the redshift distribution was changed to 0.36 < z_abs < 1.4. In other words, we use redshifts of Mg ii absorbers to search for potentially corresponding Ca ii absorbers at the same redshifts, which makes the search much quicker and more accurate than a blind search, but loses the capability in searching for Ca ii absorbers at redshifts lower than 0.36. We also explored the Sloan quasar spectra and noted that there is heavy contamination of high sky noises in quasar spectra with observer wavelengths greater than 8500 Å due to strong sky emissions lines. However, our method with normalization of data noise in pre-processing as described in Section 2.3 is able to reliably identify Ca ii absorbers in these noisy spectra.

Figure 3.

The distribution of the EWs for the Sardane et al. (2014)’s catalogue that we used as a guiding set compared to the distribution of our simulated data set, with the red bars being the simulated data and the blue bars being the real data. Left-hand panel: the EW distributions of the λ3934 line. The distributions of both data sets are similar and have the same trend. However, our simulated data have a greater amount of lower EWs for the λ3934 line, so it can be more sensitive to samples with low EWs. Right-hand panel: the EW distributions of the λ3969 line. The distributions of the two data sets are very similar.

Open in new tab Download slide

The base data for our artificial data set are drawn from DR12 of SDSS (Alam et al. 2015). We randomly chose a subset of around 12 000 spectra that had the correct z_qso range from the DR12 SDSS catalogue, making sure that none are part of our test set. Because an even larger training set is desired, six variations are created for every spectrum, each with a different redshift and EW of the inserted lines. When inserting an absorption line, the method used in Zhao et al. (2019) was adopted, which involved first performing a continuum fit for each spectrum, injecting artificial Ca ii absorption lines to each continuum, and then adding noise after that. The continuum fitting to quasar spectra was conducted using the principal component analysis method, which is a commonly used unsupervised machine learning algorithm for data de-noising and smoothing. The entire quasar sample set was divided into a number of subsets by the quasar’s emission redshift with a bin size of 0.2, and then performed the fitting to each subset and derived the eigenspectra.

The Ca ii absorption lines are inserted at their specified locations of 3934.78 and 3969.59 Å of a quasar spectrum at a redshift in the observer frame based upon FWHM and EW distributions of real Ca ii absorbers. The redshift is randomly selected in the range stated above. The quasar spectrum’s S/N around the Ca ii absorption line area must meet the minimum requirement of 3.0 measured with the spectrum flux and error arrays. In addition, an S/N greater than 2.0 must be met for both of the inserted λλ3934 and 3969 lines. The EW error is computed as the multiplication of the FWHM by the local noise. In total, 72 000 artificial spectra were generated with lines inserted, and another 72 000 spectra without line insertion, which we had created in the same way as we created the 72 000 positive samples by randomly selecting six different redshifts for each of the 12 000 real spectra. Examples of these positive and negative samples are shown in Fig. 4. This data set is randomly divided between the training and validation sets with a ratio of 4:1 for neural network training.

Figure 4.

Examples of the artificial spectra generated. For each example, the whole spectrum is shown in the bigger box with the red line being the flux and the black line being the continuum fit of the spectrum. A close-up of the normalized areas around the Ca ii absorption lines is shown in the smaller box on the top right-hand side of the plot. The blue lines mark where Ca ii absorption lines are inserted. Top panel: a positive artificial spectrum. Bottom panel: a negative artificial spectrum made using the same original spectrum as the one shown above.

Open in new tab Download slide

2.3 Data pre-processing

An optimized data pre-processing method was developed to improve the neural network’s accuracy and efficiency. In this method, the quasar spectra are shifted to the absorber rest frame for searching for Ca ii absorption lines. By doing this, the absorption lines are always at the same location on every spectrum and thus it is easier for the neural network model to be trained and used to find them. This allows cropping the quasar spectra to a relatively small fixed wavelength range between 3900 and 4000 Å for data processing and absorption line search, which not only decreases the processing time and training time, but also allows the model to be focused on the region of importance, therefore making it easier for the model to locate the absorption lines and increasing the accuracy. On the other hand, this wavelength range is still large enough so that the model is able to evaluate and consider local features in spectra to assist its determination. Therefore, in order to make this method work, the absorber’s redshifts of all spectra that are analysed by the model are required to be known so that the spectra can be shifted to the rest frame.

In the processing, each spectrum is first normalized through continuum fitting. If absorption lines are close to or on quasar emission lines at z_qso, a flexible smoothing spline function is used to fit and normalize the continuum. Otherwise, they are normalized with a polynomial fit. In the second step, a constant of 1 is subtracted from the normalized spectrum, then the spectrum is divided by the standard deviation of the spectrum between 3900 and 4000 Å except for 3930–3938 Å and 3965–3973 Å to normalize the noises in this spectral region. In the third step, the noise normalized spectrum is further divided by a constant value, A, then added by a constant offset, B, to complete the processing. We divided it by A, so that all the flux values are scaled to be within a small enough range, with the maximum difference between any two samples not greater than 1.0. We then added it by the constant offset, B, so that it would have the desired relative intensity range from 0.0 to 1.0. Many different combinations of A and B were tried, and values of A = 30 and B = 0.5 appear to make the model more sensitive for detecting Ca ii absorption lines than other combinations. This process is illustrated in Fig. 5, which is especially helpful to search for Ca ii absorption lines in spectral regions with high sky noises due to sky emission lines. This helps extend the Ca ii absorber’s searching redshifts to higher values. This pre-processing has been applied to all the 862 spectra in the test set and the 144 000 spectra in the training set. In the last step, the positive and negative samples in the data were separated with the positive samples labelled as 1, and the negative samples labelled as 0 for both the training and testing sets to complete the processing.

Figure 5.

A demonstration of the procedure used for pre-processing. Top left-hand panel: the original SDSS quasar spectrum with Ca ii absorption lines with the wavelength in log scale. The part of the spectrum (3900–4000 Å) that is actually pre-processed is shown in black colour. Top right-hand panel: the spectrum after the original spectrum is shifted to the rest frame and cropped, with its continuum fit shown as a red line. Middle left-hand panel: the cropped spectrum is normalized. Middle right-hand panel: the normalized spectrum is subtracted by 1, then divided by the standard deviation of the flux in the wavelength window except for 3930–3938 and 3965–3973 Å to have a noise normalized spectrum. Bottom left-hand panel: the noise normalized spectrum is divided by 30. Bottom right-hand panel: 0.5 is added to the processed spectrum to have the desired flux range from 0.0 to 1.0, which has an optimal sensitivity for feeding our trained neural network.

Open in new tab Download slide

3 NEURAL NETWORK MODEL STRUCTURE

We chose to use a CNN model, which is designed to search within a fixed wavelength window of each input spectrum. This significantly simplifies the task and the model, as shown in Fig. 6, allowing for fewer model layers and significantly reducing processing time. Our model includes convolutional layers, fully connected layers, dropout layers, activation layers, and normalization layers.

Figure 6.

A diagram of the developed neural network model featuring three convolutional layers, two fully connected layers, and a dropout layer with an input spectrum of 100 pixels.

Open in new tab Download slide

Convolutional layers are the main components of a CNN, with the layers applying a number of filters to the input. These filters will then create feature maps that represent certain features of the input data. In our model, the convolutional layers served to identify patterns of the spectral lines. Key characteristics of these layers include the kernel shape and quantity, which were determined through hyperparameter tuning, and further explained in Section 3.1. Likewise, convolutional layer quantity was also selected in a similar fashion.

Following each convolutional layer are activation and batch normalization layers. Activation layers are used to introduce non-linearity to the model so that it is able to learn and represent the complexity in the data patterns. The batch normalization layers are applied to ensure that inputs to the next layer are scaled to the standard normal distribution, giving the model training stability and consistency. A max pooling layer was also considered following each convolutional layer, to reduce the amount of input data to the next layers. A dropout layer is often included towards the end of the model to randomly set a certain amount of input units to zero in order to prevent the model from overfitting. Overfitting would cause the model to have a hard time predicting anything other than the training set because it is too attuned to the specific training set’s detailed features, including noises. Fully connected layers are finally used at the end after extracting the features with the convolutional layers to classify features and get the final result.

3.1 Hyperparameter tuning

In order to produce a model with optimal accuracy, many different configurations of hyperparameters of the neural network were tested. These include basic hyperparameters such as which types of layers were used, the number of layers, and the order of the layers in the model. This constitutes the basic structure of our model. Other optimized attributes include features of each layer such as the number of kernels, filter sizes, and strides in the convolutional layers, and the dropout rate and output activation function. The results of these hyperparameter tuning tests are explained further in Section 3.2.

The first important hyperparameter testing was done to determine the amount of convolutional layers to be used, which was tested with the amount of max pooling per pooling layer, with a max pooling size of 1 not having an effect on the model. It was important to test how many convolutional layers were necessary, as too few layers would not be able to pick up on the complexities of the spectral data features. On the other hand, having too many convolutional layers would add an unnecessary computation to the model and even reduce the sensitivity of the model. For max pooling size, it was crucial because while a pooling layer helps reduce the amount of computation, too much of pooling would take out key information from the feature maps, which would reduce the final sensitivity of the CNN in detecting key features.

Kernel shapes are the sizes of the filters that the convolutional layer uses. Kernel shapes for each convolutional layer were chosen by estimating the size of the features (spectral lines) we would be looking for (Zhao et al. 2019). Kernel count, or the number of filters applied, is the other key characteristic, which was determined in hyperparameter tuning tests. The number of kernels affects the ways that the layer can represent the features in the spectrum. If there are too many kernels, there would be a high possibility of overfitting. If there are too few kernels, the feature maps produced would not be enough to represent various aspects of the patterns of the spectral lines. Because the convolutional layer requires the movement of the filter along the data points, the stride is a parameter used to define the amount of movement each time, or the number of sections the layer will look at throughout the spectrum. Therefore, if the stride is too large, it is easy for the convolutional layer to skip some of the details. However, if the stride is too small, the computation time would be quite large.

The dropout rate is the fraction of data in the input layer that would be removed towards the end. If the dropout rate is too small, it would defeat the purpose of the layer, which is to prevent overfitting. However, if the dropout rate is too large, many of the important features in the data could be removed. The output activation is used as part of the final classification to determine what the eventual model score is. Different output activations will use different functions and obtain different classification results, so it is important to know which output activation achieves the best accuracy.

Extensive tuning of the aforementioned hyperparameters was conducted with grid search and cross-validation (Dangeti 2017) using the training and validation sets. Some of the hyperparameters tested and the resulting selections are shown in Table 1.

Table 1.

Open in new tab

Examples of model hyperparameters tested and selected, where the selected choice is marked by an ‘X’.

Max pooling		Conv. layers		Output activation
1	X	1		ReLU
2		2		sigmoid	X
3		3	X	tanh
Dropout rate		Number of first layer kernels		First layer filter size
0.1		4		1
0.3	X	8	X	3
0.5		16		5	X
0.7		32		7
0.9		64		9

Max pooling		Conv. layers		Output activation
1	X	1		ReLU
2		2		sigmoid	X
3		3	X	tanh
Dropout rate		Number of first layer kernels		First layer filter size
0.1		4		1
0.3	X	8	X	3
0.5		16		5	X
0.7		32		7
0.9		64		9

Table 1.

Open in new tab

Examples of model hyperparameters tested and selected, where the selected choice is marked by an ‘X’.

Max pooling		Conv. layers		Output activation
1	X	1		ReLU
2		2		sigmoid	X
3		3	X	tanh
Dropout rate		Number of first layer kernels		First layer filter size
0.1		4		1
0.3	X	8	X	3
0.5		16		5	X
0.7		32		7
0.9		64		9

Max pooling		Conv. layers		Output activation
1	X	1		ReLU
2		2		sigmoid	X
3		3	X	tanh
Dropout rate		Number of first layer kernels		First layer filter size
0.1		4		1
0.3	X	8	X	3
0.5		16		5	X
0.7		32		7
0.9		64		9

3.2 Details about our model design and tuning

Our model is built in python with the open-source libraries keras and tensorflow. The input spectra have a 100 pixel length representing the flux between the wavelengths of 3900 and 4000 Å. It features five major layers (excluding normalization, activation, and dropout), three convolutional layers, and two fully connected layers (Fig. 6). We also decided to not include any max pooling layers because due to our relatively low dimension and highly information-dense starting input, reducing this information further in this manner is more destructive than helpful. We included both an activation layer and a batch normalization layer and found that when placing the batch normalization layer after the activation function layer, the accuracy improved by about 1 per cent.

In the first convolutional layer, the 1D kernel size is 5 pixels, which is approximately the average width of the spectral lines, plus a data point at each side. We also found that when the filter size was too high (9 pixels) or too low (3 pixels), the accuracy would decrease by around 2 per cent. Additionally, a filter count of 8 was chosen for the first layer to be able to best represent all of the intricacies of the spectrum including the two spectral lines, line locations and separations, line widths, depths, and profiles of the two spectral lines, and noise characteristics. The other two convolutional layers have a filter count of 16 and 32, respectively, to capture more details of the characteristics of both lines. Moreover, tests using a greater amount of kernels did not show much of a difference in model performance, which is not surprising as there is limited information to represent these two spectral lines and their variations. A stride (or a moving step) of 1 for each convolutional layer was chosen to look through all of the important spectral features within our data because the input size was relatively small so computation efficiency was not a problem. We also chose a 30 per cent dropout rate because we found that due to our small input size, it was necessary that the majority of the input still remain at the end. We used the rectified linear unit (ReLU) activation function after each of the convolutional layers and first fully connected layer (i.e. dense or densely connected layer), as it proved to be the most effective activation function between models due to its simplicity. However, for our output activation, we used the sigmoid function, which is the most often used output activation, because it converts the model output into a probability score between 0 and 1.

For training, the adaptive moment estimation (Adam; Kingma & Ba 2015) optimizer was chosen with a mini-batch size of 64. This mini-batch size, which is the number of training samples per iteration, is usually chosen through hyperparameter tuning after the model is created. While a mini-batch size of 32 is usually the default, we found that we had a higher accuracy by about 1 per cent with a mini-batch size of 64. The Adam optimizer computes adaptive learning rates for each parameter, which effectively mitigates the fluctuation of the cost function during training, and keeps the training more stable. Test results and the receiver operating characteristic (ROC) curve of the model are shown in Fig. 7. Since the true scores are concentrated near 1.0 and false scores are comparatively spread out, the true–false separation is set at 0.9, as indicated by the red marker in the ROC curve.

Figure 7.

The output scores from the model given to negative samples and positive samples from the test set and the ROC curve of the model. Top left-hand panel: model scores for negative samples in the test set. Although most of the scores are below 0.5, there are still quite a few of them higher than 0.5. Top right-hand panel: the ROC curve of the model shows its sensitivity and specificity at different threshold values (from 0 to 1). Bottom left-hand panel: model scores for positive samples in the test set. The vast majority of them are between 0.8 and 1.0. The number of samples is shown in log scale to better represent the distribution. Bottom right-hand panel: a closer look at the model scores within [0.8, 1.0] for positive samples; it can be seen that most of them lie greater than 0.90, which became the selected threshold for the model. The number of samples is shown in log scale.

Open in new tab Download slide

4 RESULTS

Four metrics are used to evaluate the classification results of our model: accuracy, precision, recall rate, and F₁ score, which are defined below.

Accuracy: the percentage of correct predictions out of the total predictions, including both positive and negative samples, i.e. Accuracy = (true positives + true negatives)/(true positives + true negatives + false positives + false negatives).

Precision: the fraction of samples correctly identified as positive out of all samples predicted as positives, i.e. Precision = true positives/(true positives + false positives).

Recall rate: the ratio between the number of correct predictions of a positive sample and the total number of positive samples, which indicates the sensitivity of the classifier, i.e. Recall rate = true positives/(true positives + false negatives).

F₁ score: the harmonic mean of the precision and the recall rate, i.e. F₁ score = 2*(Precision*Recall)/(Precision + Recall).

The F₁ score favours classifiers that have a similar precision and recall rate and thus is a better measure to use when seeking a balance between precision and recall.

Our experimental results on the test set are summarized in Table 2 and the confusion matrix is shown in Fig. 8. Overall, all the evaluation metrics of our model are above 94.9 per cent with the precision a little higher than the recall. We were also able to discover a large number of new quasar Ca ii absorbers with the model. More details on the results are described in the following subsections.

Figure 8.

The confusion matrix of applying the model on the test set.

Open in new tab Download slide

Table 2.

Open in new tab

The classification metrics of our model on the test set with different prediction thresholds. A threshold of 0.90 was chosen, which gave the highest overall accuracy.

Threshold	0.90	0.80	0.65	0.30
Accuracy	0.959	0.957	0.949	0.920
Precision	0.969	0.958	0.935	0.880
Recall	0.949	0.956	0.965	0.972
F₁ score	0.959	0.957	0.950	0.924

Threshold	0.90	0.80	0.65	0.30
Accuracy	0.959	0.957	0.949	0.920
Precision	0.969	0.958	0.935	0.880
Recall	0.949	0.956	0.965	0.972
F₁ score	0.959	0.957	0.950	0.924

Table 2.

Open in new tab

The classification metrics of our model on the test set with different prediction thresholds. A threshold of 0.90 was chosen, which gave the highest overall accuracy.

Threshold	0.90	0.80	0.65	0.30
Accuracy	0.959	0.957	0.949	0.920
Precision	0.969	0.958	0.935	0.880
Recall	0.949	0.956	0.965	0.972
F₁ score	0.959	0.957	0.950	0.924

Threshold	0.90	0.80	0.65	0.30
Accuracy	0.959	0.957	0.949	0.920
Precision	0.969	0.958	0.935	0.880
Recall	0.949	0.956	0.965	0.972
F₁ score	0.959	0.957	0.950	0.924

4.1 Results on the existing catalogue

Out of the 431 spectra in our positive test set from Sardane et al. (2014)’s catalogue, our model was able to accurately identify 409 quasar spectra with Ca ii absorbers, giving a 94.9 per cent recall rate. In our model, we categorized absorption lines as real Ca ii absorption lines or candidate ones based on the S/N of the two absorption lines at λλ3934 and 3969. The real Ca ii absorption lines are defined as having an absorption line at λ3934 with an S/N > 3 and an absorption line at λ3969 with an S/N > 2.5, and candidate Ca ii absorption lines as having an absorption line at λ3934 with an S/N of 2.5–3 and an absorption line at λ3969 with an S/N of 2.0–2.5.

We conducted analysis on the remaining 22 quasar spectra that were not identified by our model as Ca ii absorber detections by measuring the EWs and EW errors of potential Ca ii absorption lines and found that only 5 quasar spectra have real Ca ii absorption lines and 4 quasar spectra have candidate Ca ii absorption lines missed by our model. Our model most likely missed these nine quasar Ca ii absorbers due to narrow lines or other factors such as the strong absorption lines nearby in the bottom right-hand spectrum of Fig. 9. The remaining 13 quasar spectra have no absorption lines meeting our definitions of Ca ii absorbers or candidate absorbers (Table 3). Their absorption lines are either not centred at 3934.78 and 3969.59 Å such as in the top left-hand spectrum of Fig. 9; or too narrow and weak, causing them to not be identified by the program, such as in the bottom left-hand spectrum of Fig. 9; or where both issues occur, such as in the top right-hand spectrum of Fig. 9. It is likely that these absorption lines were simply false detections.

Figure 9.

Examples of false negatives that are in the catalogue of Sardane et al. (2014) plotted after both spectrum and noise normalization (after pre-processing). Based on our measurements, we believe that all but the Ca ii absorption lines in the bottom right are actually real negatives. Top left-hand panel: quasar J005355.15−000309.3 where the two absorption lines are not centred at 3934.78 and 3969.59 Å, respectively, making it possible that this is in fact not a Ca ii absorber. Top right-hand panel: for quasar J012412.47−010049.7, both the λ3934 line and the λ3969 line are off-centre in different directions. Furthermore, both lines appear to be narrow and weak, which may be produced by random noises. Bottom left-hand panel: because the λ3934 line especially is rather weak, quasar J101748.68+222659.2’s absorption lines are not strong enough for it to be considered as having Ca ii absorption lines. Bottom right-hand panel: quasar J160335.78+453656.3, which is one of the real absorption lines that the model missed. It is most likely due to the fact that the two absorption lines in the middle lead the model to believe that the two lines are too weak, especially after noise normalization.

Open in new tab Download slide

Table 3.

Open in new tab

The 13 quasars in Sardane et al. (2014) that we believe may be false detections.

SDSS name
J005355.15−000309.3
J012412.47−010049.7
J094613.97+133441.2
J101748.68+222659.2
J124347.60+374512.5
J133526.01−010028.1
J134246.25−003543.7
J142119.39+313219.6
J152941.57+254815.9
J155453.30+245622.5
J103451.42+233435.4
J085223.93+565725.7
J165118.61+400124.8

SDSS name
J005355.15−000309.3
J012412.47−010049.7
J094613.97+133441.2
J101748.68+222659.2
J124347.60+374512.5
J133526.01−010028.1
J134246.25−003543.7
J142119.39+313219.6
J152941.57+254815.9
J155453.30+245622.5
J103451.42+233435.4
J085223.93+565725.7
J165118.61+400124.8

Table 3.

Open in new tab

The 13 quasars in Sardane et al. (2014) that we believe may be false detections.

SDSS name
J005355.15−000309.3
J012412.47−010049.7
J094613.97+133441.2
J101748.68+222659.2
J124347.60+374512.5
J133526.01−010028.1
J134246.25−003543.7
J142119.39+313219.6
J152941.57+254815.9
J155453.30+245622.5
J103451.42+233435.4
J085223.93+565725.7
J165118.61+400124.8

SDSS name
J005355.15−000309.3
J012412.47−010049.7
J094613.97+133441.2
J101748.68+222659.2
J124347.60+374512.5
J133526.01−010028.1
J134246.25−003543.7
J142119.39+313219.6
J152941.57+254815.9
J155453.30+245622.5
J103451.42+233435.4
J085223.93+565725.7
J165118.61+400124.8

On the precision rate side, our model predicted 422 positive predictions in total, and was able to correctly classify 409 out of the 422 positive predictions, achieving a 96.9 per cent precision. Within the 13 false positives, many had a large amount of noise or absorption lines very close to λλ3934 and 3969, which the model would mark as real absorption lines. Weak lines were also often present in these false detections.

4.2 Results on DR7 and DR12

Because our pre-processing method requires redshifts of quasar Mg ii absorbers to locate potential Ca ii absorption lines, only quasar spectra with Mg ii absorption lines were searched for Ca ii absorption lines, resulting in Ca ii absorber’s redshifts in the range of 0.36 < z_abs < 1.4. In total, there are 35 752 quasar spectra with Mg ii absorption lines in SDSS’s DR7 identified by Zhu & Ménard (2013) and 41 895 quasar spectra with Mg ii absorption lines in DR12 found by Zhao et al. (2019). We then limited to those quasar spectra that have Mg ii’s redshifts within the range that Ca ii absorption lines can be found, i.e. with 0.36 < z_abs < 1.4, which left a total of 24 827 quasars from DR7 and 18 418 quasars from DR12. The neural network model ran through these two sets separately, identifying 1267 spectra with potential Ca ii absorption lines in DR7 and 694 spectra in DR12.

The EW of the λ3934 line and the λ3969 line of every candidate was measured by normalizing each of the spectra from 3900 to 4000 Å and then fitting each of the lines individually through a Gaussian fit. Real absorption lines and candidates are those meeting the S/N requirements described in Section 4.1. A total of 399 new Ca ii absorbers were identified (137 in DR7 and 262 in DR12). Examples of these absorption lines are shown in Fig. 10. A sample of absorbers and their properties are listed in Table 4.

Figure 10.

Examples of new quasar spectra with Ca ii absorption lines that were discovered by our neural network. The examples on the left are from SDSS’s DR12 (top left-hand panel: J100806.18+234942.1, bottom left-hand panel: J120301.01+063441.5) and the examples on the right are from SDSS’s DR7 (top right-hand panel: J122016.87+112628.1, bottom right-hand panel: J065412.58+283007.2).

Open in new tab Download slide

Table 4.

Open in new tab

Some examples from our new catalogue of 399 quasars with Ca ii absorption lines, with the entire catalogue listed online as supplementary information.

SDSS name	RA (°)	Dec. (°)	z_abs	z_qso	EW[3934] (Å)	Err[3934] (Å)	EW[3969] (Å)	Err[3969] (Å)
J105220.99+183636.7	163.087	18.610	0.959	1.091	0.8	0.29	0.51	0.17
J094636.86+323949.5	146.654	32.664	0.798	1.308	0.61	0.06	0.36	0.05
J132323.78−002155.2	200.849	−0.365	0.716	1.392	1.04	0.10	0.48	0.08
J081739.19+453228.3	124.413	45.541	0.749	1.510	0.95	0.10	0.58	0.09
J090122.67+204446.5	135.344	20.746	1.019	2.103	0.48	0.06	0.27	0.04
J065412.58+283007.2	103.552	28.502	0.630	1.634	1.52	0.18	0.86	0.14
J075123.60+084248.8	117.848	8.714	0.543	1.550	0.84	0.09	0.41	0.09
J072400.03+320226.6	111.000	32.041	0.572	1.159	1.50	0.19	0.81	0.14
J094145.03+303503.6	145.438	30.584	0.937	1.226	0.93	0.12	0.64	0.08
J122016.87+112628.1	185.070	11.441	0.731	1.896	1.59	0.10	1.10	0.13

SDSS name	RA (°)	Dec. (°)	z_abs	z_qso	EW[3934] (Å)	Err[3934] (Å)	EW[3969] (Å)	Err[3969] (Å)
J105220.99+183636.7	163.087	18.610	0.959	1.091	0.8	0.29	0.51	0.17
J094636.86+323949.5	146.654	32.664	0.798	1.308	0.61	0.06	0.36	0.05
J132323.78−002155.2	200.849	−0.365	0.716	1.392	1.04	0.10	0.48	0.08
J081739.19+453228.3	124.413	45.541	0.749	1.510	0.95	0.10	0.58	0.09
J090122.67+204446.5	135.344	20.746	1.019	2.103	0.48	0.06	0.27	0.04
J065412.58+283007.2	103.552	28.502	0.630	1.634	1.52	0.18	0.86	0.14
J075123.60+084248.8	117.848	8.714	0.543	1.550	0.84	0.09	0.41	0.09
J072400.03+320226.6	111.000	32.041	0.572	1.159	1.50	0.19	0.81	0.14
J094145.03+303503.6	145.438	30.584	0.937	1.226	0.93	0.12	0.64	0.08
J122016.87+112628.1	185.070	11.441	0.731	1.896	1.59	0.10	1.10	0.13

Table 4.

Open in new tab

Some examples from our new catalogue of 399 quasars with Ca ii absorption lines, with the entire catalogue listed online as supplementary information.

SDSS name	RA (°)	Dec. (°)	z_abs	z_qso	EW[3934] (Å)	Err[3934] (Å)	EW[3969] (Å)	Err[3969] (Å)
J105220.99+183636.7	163.087	18.610	0.959	1.091	0.8	0.29	0.51	0.17
J094636.86+323949.5	146.654	32.664	0.798	1.308	0.61	0.06	0.36	0.05
J132323.78−002155.2	200.849	−0.365	0.716	1.392	1.04	0.10	0.48	0.08
J081739.19+453228.3	124.413	45.541	0.749	1.510	0.95	0.10	0.58	0.09
J090122.67+204446.5	135.344	20.746	1.019	2.103	0.48	0.06	0.27	0.04
J065412.58+283007.2	103.552	28.502	0.630	1.634	1.52	0.18	0.86	0.14
J075123.60+084248.8	117.848	8.714	0.543	1.550	0.84	0.09	0.41	0.09
J072400.03+320226.6	111.000	32.041	0.572	1.159	1.50	0.19	0.81	0.14
J094145.03+303503.6	145.438	30.584	0.937	1.226	0.93	0.12	0.64	0.08
J122016.87+112628.1	185.070	11.441	0.731	1.896	1.59	0.10	1.10	0.13

SDSS name	RA (°)	Dec. (°)	z_abs	z_qso	EW[3934] (Å)	Err[3934] (Å)	EW[3969] (Å)	Err[3969] (Å)
J105220.99+183636.7	163.087	18.610	0.959	1.091	0.8	0.29	0.51	0.17
J094636.86+323949.5	146.654	32.664	0.798	1.308	0.61	0.06	0.36	0.05
J132323.78−002155.2	200.849	−0.365	0.716	1.392	1.04	0.10	0.48	0.08
J081739.19+453228.3	124.413	45.541	0.749	1.510	0.95	0.10	0.58	0.09
J090122.67+204446.5	135.344	20.746	1.019	2.103	0.48	0.06	0.27	0.04
J065412.58+283007.2	103.552	28.502	0.630	1.634	1.52	0.18	0.86	0.14
J075123.60+084248.8	117.848	8.714	0.543	1.550	0.84	0.09	0.41	0.09
J072400.03+320226.6	111.000	32.041	0.572	1.159	1.50	0.19	0.81	0.14
J094145.03+303503.6	145.438	30.584	0.937	1.226	0.93	0.12	0.64	0.08
J122016.87+112628.1	185.070	11.441	0.731	1.896	1.59	0.10	1.10	0.13

Some of the samples identified by the neural network but then discarded in the manual process have only one strong line with the other line either being too weak or non-existent. Other samples have some sort of lines at λλ3934 and 3969 that could have been caused by external noise, with a large amount of error. It is possible that the artificial training set may have contained too big of a portion of weak absorbers to accommodate the EW distribution of the real data set, causing many false positives. Examples of these false absorption lines are shown in Fig. 11.

Figure 11.

Examples of quasar spectra that were identified by the model but then manually discarded. The one on the left has a strong λ3934 absorption line but an insignificant λ3969 absorption line. Meanwhile, the one on the right has plenty of noise, making both lines insignificant.

Open in new tab Download slide

4.3 Comparison of the new catalogue with the past catalogues

Our new catalogue has a total of 399 quasar Ca ii absorbers. A summary of measurement comparisons is shown in Table 5, and redshift distribution and EW distributions are shown in Figs 12 and 13. Our redshift distribution, compared to the redshift distribution of Sardane et al. (2014), misses quasar spectra that have Ca ii absorbers with a redshift in the range of 0.03 < z_abs < 0.36, due to the reliance on Mg ii absorption lines to find Ca ii absorption lines in the rest frame (Fig. 12). This situation could be changed in our future work by doing a direct survey of Ca ii absorption lines in quasar spectra without using the Mg ii absorber’s redshift as a signpost. Nevertheless, we were also able to find a greater amount of absorption lines at higher redshifts in quasar spectra. This is most likely due to how our pre-processing method emphasizes normalizing the noise around Ca ii absorption lines, which allowed our model to be able to detect absorption lines even at higher redshifts where there is a greater amount of sky emissions at >8500 Å in the observer wavelength.

Figure 12.

The comparison between the absorber’s redshift distributions of our test set from Sardane et al. (2014), marked in blue colour, and our new catalogue, marked in red colour. Our new catalogue has a more confined redshift distribution (i.e. z_abs > 0.36) due to the dependence on Mg ii absorption lines. Overall, there are significantly more absorbers with mid to higher redshift and a greater average redshift in our new catalogue than what is found in Sardane et al. (2014).

Open in new tab Download slide

Figure 13.

The comparison between the EW distributions of the Sardane et al. (2014) set (our test set), marked in blue colour, and our new catalogue, marked in red colour. Left-hand panel: the comparison of the EW distributions for the λ3934 line. Right-hand panel: the comparison of EW distributions for the λ3969 line. Our catalogue has a broader range for both absorption lines and is more skewed to cover lower EW values, indicating that our approach is able to discover weaker absorption lines.

Open in new tab Download slide

Table 5.

Open in new tab

The comparison of the distributions of the absorption redshift, EW for the λ3934 line, and EW for the λ3969 line of Sardane et al. (2014)’s catalogue with our catalogue.

Measurement	Sardane’s catalogue distribution	Mean ± standard deviation	Our catalogue distribution	Mean ± standard deviation
Absorption redshift	0.03 < z_abs < 1.34	0.58 ± 0.30	0.36 < z_abs < 1.36	0.84 ± 0.24
EW λ3934 (Å)	0.16 < EW < 2.57	0.76 ± 0.39	0.07 < EW < 2.31	0.51 ± 0.31
EW λ3969 (Å)	0.11 < EW < 1.60	0.48 ± 0.26	0.04 < EW < 1.38	0.33 ± 0.20

Measurement	Sardane’s catalogue distribution	Mean ± standard deviation	Our catalogue distribution	Mean ± standard deviation
Absorption redshift	0.03 < z_abs < 1.34	0.58 ± 0.30	0.36 < z_abs < 1.36	0.84 ± 0.24
EW λ3934 (Å)	0.16 < EW < 2.57	0.76 ± 0.39	0.07 < EW < 2.31	0.51 ± 0.31
EW λ3969 (Å)	0.11 < EW < 1.60	0.48 ± 0.26	0.04 < EW < 1.38	0.33 ± 0.20

Table 5.

Open in new tab

The comparison of the distributions of the absorption redshift, EW for the λ3934 line, and EW for the λ3969 line of Sardane et al. (2014)’s catalogue with our catalogue.

Measurement	Sardane’s catalogue distribution	Mean ± standard deviation	Our catalogue distribution	Mean ± standard deviation
Absorption redshift	0.03 < z_abs < 1.34	0.58 ± 0.30	0.36 < z_abs < 1.36	0.84 ± 0.24
EW λ3934 (Å)	0.16 < EW < 2.57	0.76 ± 0.39	0.07 < EW < 2.31	0.51 ± 0.31
EW λ3969 (Å)	0.11 < EW < 1.60	0.48 ± 0.26	0.04 < EW < 1.38	0.33 ± 0.20

Measurement	Sardane’s catalogue distribution	Mean ± standard deviation	Our catalogue distribution	Mean ± standard deviation
Absorption redshift	0.03 < z_abs < 1.34	0.58 ± 0.30	0.36 < z_abs < 1.36	0.84 ± 0.24
EW λ3934 (Å)	0.16 < EW < 2.57	0.76 ± 0.39	0.07 < EW < 2.31	0.51 ± 0.31
EW λ3969 (Å)	0.11 < EW < 1.60	0.48 ± 0.26	0.04 < EW < 1.38	0.33 ± 0.20

On the other hand, EW distributions for both the λ3934 and the λ3969 lines cover much bigger ranges and are overall more skewed to weaker absorbers than those reported in Sardane et al. (2014). This means that our model is able to successfully discover absorption lines with lower EW values, therefore allowing it to find weaker absorbers. We are able to find weak absorption lines because our model was trained on absorption lines that have a great range in EWs, including many weak ones. These weak absorption lines are also likely harder to be examined. Examples of weak absorption lines are shown in Fig. 14. Since most strong absorption lines in both DR7 and some in DR12 [as DR9 used in Sardane et al. (2014) is part of DR12] have been identified by Sardane et al. (2014), the overall samples in our catalogue appear to show weaker absorption lines. When comparing our EW distributions for DR7 and DR12 for both the λλ3934 and 3969 lines, DR12 appears to have an overall lower EW distribution than DR7, which is shown in Fig. 15. This trend remained even after we included all confirmed samples from Sardane et al. (2014), who discovered most of the strong DR7 absorption lines. Because we have a greater amount of quasars in DR12, it is logical why our catalogue may have weaker absorption lines. This also demonstrates the power of our model to be able to find both strong and weak absorption lines in catalogues that were already searched.

Figure 14.

Examples of EW fittings of weak absorption lines that our model was able to identify. The blue solid line shows the noise from the SDSS catalogue, and the pink dotted line is the average of the noise from the 3900–4000 Å. The green lines show the locations of 3934.78 and 3969.59 Å, with the orange lines showing the centres of the EW fittings for each line. Top panel: J142106.86+533745.4, which has a pair of Ca ii absorption lines with a λ3934’s EW of 0.12 Å and a λ3969’s EW of 0.05 Å. Bottom panel: J142106.86+533745.4, which has a pair of Ca ii absorption lines with a λ3934’s EW of 0.13 Å and a λ3969’s EW of 0.05 Å.

Open in new tab Download slide

Figure 15.

The comparison between the EW distributions of the λλ3934 and 3969 lines for quasar Ca ii absorbers in our catalogue that were discovered in DR7 and those that were discovered in DR12. Left-hand panel: the EW distribution for λ3934, with DR7 in black colour and DR12 in blue colour. The means of the EWs for both the DR7 and DR12 are also shown in their respective colours, with 0.52 Å for DR7 and 0.49 Å for DR12. Right-hand panel: the EW distribution for λ3969, with DR7 in black colour and DR12 in blue colour. The means of the EWs for both the DR7 and DR12 are also shown in their respective colours, with 0.36 Å for DR7 and 0.32 Å for DR12.

Open in new tab Download slide

5 CONCLUSION AND DISCUSSION

Overall, in this work, we were able to successfully develop a neural network model that can detect Ca ii absorption lines with an accuracy of 95.9 per cent. Our proposed approach could accurately discover Ca ii absorption lines, including almost all of the absorption lines identified in previous catalogues, as well as new absorbers. Furthermore, its detection speed is much faster than traditional methods. When looking through traditional methods as we did in Fang et al. (in preparation), we found that it takes many days to a couple of weeks to look through about 10 000 spectra. However, with our method, the neural network can analyse about 10 000 spectra in only 0.7 s. This makes our method tens of thousands of times faster than traditional methods.

In our approach, a large amount of artificial quasar Ca ii absorption spectra used for neural network training has resolved the shortage issue of real Ca ii absorber samples and helped improve the accuracy of the neural network. Our technique of moving the absorption lines to the rest frame and limiting the neural network search window to a relatively small fixed wavelength window appears to have further simplified our neural network design and increased the accuracy and speed of the model. Our data pre-processing involving the noise normalization largely improves the accuracy and effectiveness in searching Ca ii absorption lines at wavelengths where there is a great amount of sky emission noise, which used to be an especially challenging situation. These techniques have not been reported in any other neural network model for discovering absorption lines to our best knowledge. This is also the first time that deep learning has been applied to discovering Ca ii absorption lines.

Our model was tested on SDSS’s DR7 and DR12, both of which were already partially searched by Sardane et al. (2014), who inspected DR7 and DR9 (which overlaps with DR12). With our developed neural network model, we have been able to confirm a total of 409 spectra with Ca ii absorption lines that were found in the past. We have also been able to find a total of 399 new Ca ii absorption lines in DR7 and DR12, which is over 100 greater than what we were able to find using a traditional method in Fang et al. (in preparation). Furthermore, 137 new Ca ii absorption lines were found in DR7, which had already been previously searched by Sardane et al. (2014). Of these 137, it is unsurprising that the majority are weak (λ3934 EW < 0.7 Å), with about 107 absorption lines being weak and 30 being strong. Most of these new absorption lines have lower EWs, indicating the capability of our model on detecting weaker Ca ii absorption lines.

When looking into the 22 absorption lines reported in Sardane et al. (2014) but missed by our model, we found that 13 of them are false signals, often having misplaced absorption lines or absorption lines that were not significant enough. This demonstrates the ability of our model to discover false absorption lines from existing work. Using our model, we have been able to confirm and discover a total of 808 Ca ii absorption lines altogether among quasar spectra from DR7 and DR12 (399 new Ca ii absorption lines and 409 previously found absorption lines), greatly adding to the currently available set of found Ca ii absorption lines that can be then analysed to understand dust absorbers and chemical abundances of recent galaxies.

As stated previously, prior studies have shown that all quasars with Mg ii absorption lines have Ca ii absorption lines for permitting S/N in the Mg ii region, and with a z_abs > 0.4. Previous studies, such as Sardane et al. (2014), have had a minimum EW of 0.08 Å for its corresponding Mg ii λ2796 line. In fact, the minimal EW of the Mg ii λ2796 line for all previously detected Ca ii absorbers has been much larger than 0.08 Å. This indicates that it is extremely possible that Ca ii absorbers can only be detected with Mg ii absorbers with a λ2796 EW of at least 0.08 Å. In our study, since we have looked at quasars with Mg ii absorption lines with a minimum EW of 0.07 Å for its λ2796 line, it is highly likely that we have searched through all quasars that Ca ii absorption lines could be in with a z_abs > 0.4, giving us reasons to believe that we have found almost all Ca ii absorption lines with z_abs > 0.4 in DR7 and DR12.

In comparison with Zhao et al. (2019), which had successfully identified over 40 000 Mg ii absorption lines in SDSS’s DR12 quasar spectra using a neural network for the first time, we were able to use a neural network to detect Ca ii absorption lines for the first time. These Ca ii absorption lines are much weaker and rarer than Mg ii absorption lines, making them more difficult to be identified in quasar spectra. Our newly developed pre-processing techniques have helped identify these Ca ii absorbers, especially those weak ones, with high accuracy and search speed.

Nevertheless, our approach with the neural network does have room for improvement. There were many false absorption lines detected by the model from both DR7 and DR12, most likely due to its misclassification of noise as absorption lines. It also tends to miss absorption lines that appear noisier and narrower. Training sets could also only include absorption lines with a good λ3934-to-λ3969 line ratio of around 2:1 to possibly weed out more of the false positives. Moreover, our processing approach currently requires the use of each quasar’s absorber redshift value before searching through it, limiting the possible spectra it is able to search, as it is necessary to find the absorber redshift from the Mg ii absorption lines in the quasar. In the future, we plan to explore a different pre-processing technique to be able to find Ca ii absorption lines in the observer frame, so that the absorber’s redshift is not required. For example, it is possible to use a combination of a sliding window and a neural network to inspect local regions of the spectrum where Ca ii absorption lines may potentially exist and solve for the absorber’s redshift this way. Alternatively, the entirety of the spectrum may be used such as with Zhao et al. (2019) as input to the neural network but then would need to deal with the extra processing and noises involved. We also plan to continue using our neural network to discover more Ca ii absorption lines in DR14 and later data releases from SDSS that have not been processed so as to continuously expand the current catalogues of Ca ii absorption lines.

SUPPORTING INFORMATION

Table 4. Some examples from our new catalogue of 399 quasars with Ca ii absorption lines.

Please note: Oxford University Press is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.

ACKNOWLEDGEMENTS

Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS website is http://www.sdss.org/.

The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, University of Cambridge, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the Max Planck Institute for Astronomy (MPIA), the Max Planck Institute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.

Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, and the U.S. Department of Energy Office of Science. The SDSS-III website is http://www.sdss3.org/.

SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration, including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofisica de Canarias, the Michigan State/Notre Dame/JINA Participation Group, Johns Hopkins University, Lawrence Berkeley National Laboratory, Max Planck Institute for Astrophysics, Max Planck Institute for Extraterrestrial Physics, New Mexico State University, New York University, Ohio State University, Pennsylvania State University, University of Portsmouth, Princeton University, the Spanish Participation Group, University of Tokyo, University of Utah, Vanderbilt University, University of Virginia, University of Washington, and Yale University.

The authors would like to thank the anonymous referee who has provided valuable suggestions that helped improve the quality of this paper.

DATA AVAILABILITY

The data underlying this paper were accessed from the Sloan Digital Sky Survey (https://classic.sdss.org/). The data generated from this research are available in the paper and in its online supplementary material.

REFERENCES

Abazajian

K.

et al. ,

2005

,

AJ

,

129

,

1755

10.1086/427544

Crossref

Search ADS

Ahn

C. P.

et al. ,

2012

,

ApJS

,

203

,

21

10.1088/0067-0049/203/2/21

Crossref

Search ADS

Alam

S.

et al. ,

2015

,

ApJS

,

219

,

12

10.1088/0067-0049/219/1/12

Crossref

Search ADS

Bertin

E.

,

Arnouts

S.

,

1996

,

A&AS

,

117

,

393

10.1051/aas:1996164

Crossref

Search ADS

Dangeti

P.

,

2017

,

Statistics for Machine Learning

.

Packt Publishing

,

Birmingham

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Graff

P. B.

,

Lien

A. Y.

,

Baker

J. G.

,

Sakamoto

T.

,

2016

,

ApJ

,

818

,

55

10.3847/0004-637X/818/1/55

Crossref

Search ADS

Hála

P.

,

2014

,

preprint (arXiv:1412.8341)

Hampton

E. J.

et al. ,

2017

,

MNRAS

,

470

,

3395

10.1093/mnras/stx1413

Crossref

Search ADS

Kim

E. J.

,

Brunner

R. J.

,

2017

,

MNRAS

,

464

,

4463

10.1093/mnras/stw2672

Crossref

Search ADS

Kingma

D. P.

,

Ba

J.

,

2015

,

preprint (arXiv:1412.6980)

Krogager

J.-K.

,

2018

,

preprint (arXiv:1803.01187)

Masko

D.

,

Hensman

P.

,

2015

,

The Impact of Imbalanced Training Data for Convolutional Neural Networks (Dissertation)

,

KTH Royal Institute of Technology

OpenURL Placeholder Text

WorldCat

Moore

C. E.

,

1970

,

NSRDS-NBS 34

,

National Bureau of Standards

,

Washington D.C.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Nestor

D. B.

,

Turnshek

D. A.

,

Rao

S. M.

,

2005

,

ApJ

,

628

,

637

10.1086/427547

Crossref

Search ADS

Nestor

D. B.

,

Pettini

M.

,

Hewett

P. C.

,

Rao

S.

,

Wild

V.

,

2008

,

MNRAS

,

367

,

1670

10.1111/j.1365-2966.2008.13857.x

Crossref

Search ADS

Parks

D.

,

Prochaska

J. X.

,

Dong

S.

,

Cai

Z.

,

2018

,

MNRAS

,

476

,

1151

10.1093/mnras/sty196

Crossref

Search ADS

Quider

A. M.

,

Nestor

D. B.

,

Turnshek

D. A.

,

Rao

S. M.

,

Monier

E. M.

,

Weyant

A. N.

,

Busche

J. R.

,

2011

,

AJ

,

141

,

137

10.1088/0004-6256/141/4/137

Crossref

Search ADS

Rimoldini

L. G.

,

2007

,

PhD thesis

, Univ. Pittsburgh

Sardane

G. M.

,

Turnshek

D. A.

,

Rao

S. M.

,

2014

,

MNRAS

,

444

,

1747

10.1093/mnras/stu1554

Crossref

Search ADS

Sardane

G. M.

,

Turnshek

D. A.

,

Rao

S. M.

,

2015

,

MNRAS

,

452

,

3192

10.1093/mnras/stv1506

Crossref

Search ADS

Savage

B. D.

,

Sembach

K. R.

,

1996

,

ARA&A

,

34

,

279

10.1146/annurev.astro.34.1.279

Crossref

Search ADS

Wild

V.

,

Hewett

P. C.

,

2005

,

MNRAS

,

361

,

L30

10.1111/j.1745-3933.2005.00058.x

Crossref

Search ADS

Wild

V.

,

Hewett

P. C.

,

Pettini

M.

,

2006

,

MNRAS

,

367

,

211

10.1111/j.1365-2966.2005.09935.x

Crossref

Search ADS

Wild

V.

,

Hewett

P. C.

,

Pettini

M.

,

2007

,

MNRAS

,

374

,

292

10.1111/j.1365-2966.2006.11146.x

Crossref

Search ADS

Zhao

Y.

et al. ,

2019

,

MNRAS

,

487

,

801

10.1093/mnras/stz1197

Crossref

Search ADS

Zhu

G.

,

Ménard

B.

,

2013

,

ApJ

,

770

,

130

10.1088/0004-637X/770/2/130

Crossref

Search ADS

Zych

B. J.

,

Murphy

M. T.

,

Pettini

M.

,

Hewett

P. C.

,

Ryan-Weber

E. V.

,

Ellison

S. L.

,

2007

,

MNRAS

,

379

,

1409

10.1111/j.1365-2966.2007.12015.x

Crossref

Search ADS

Zych

B. J.

,

Murphy

M. T.

,

Hewett

P. C.

,

Prochaska

J. X.

,

2009

,

MNRAS

,

392

,

1429

10.1111/j.1365-2966.2008.14157.x

Crossref

Search ADS

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
October 2022	4
November 2022	40
December 2022	8
January 2023	14
February 2023	4
March 2023	6
April 2023	4
May 2023	1
June 2023	4
July 2023	2
August 2023	5
October 2023	19
November 2023	9
December 2023	24
January 2024	38
February 2024	12
March 2024	25
April 2024	63
May 2024	38
June 2024	24
July 2024	43
August 2024	40
September 2024	13
October 2024	32
November 2024	26
December 2024	22
January 2025	13
February 2025	30
March 2025	34
April 2025	35
May 2025	6

Article Contents

Discovering Ca ii absorption lines with a neural network

ABSTRACT

1 INTRODUCTION

2 DATA SETS AND DATA PRE-PROCESSING

2.1 Real Ca ii absorption lines and test data set

2.2 Artificial Ca ii absorbers and training data set

2.3 Data pre-processing

3 NEURAL NETWORK MODEL STRUCTURE

3.1 Hyperparameter tuning

3.2 Details about our model design and tuning

4 RESULTS

4.1 Results on the existing catalogue

4.2 Results on DR7 and DR12

4.3 Comparison of the new catalogue with the past catalogues

5 CONCLUSION AND DISCUSSION

SUPPORTING INFORMATION

ACKNOWLEDGEMENTS

DATA AVAILABILITY

REFERENCES

Supplementary data

Citations

Views

Altmetric

Email alerts

Astrophysics Data System

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Discovering Ca ii absorption lines with a neural network

ABSTRACT

1 INTRODUCTION

2 DATA SETS AND DATA PRE-PROCESSING

2.1 Real Ca ii absorption lines and test data set

2.2 Artificial Ca ii absorbers and training data set

2.3 Data pre-processing

3 NEURAL NETWORK MODEL STRUCTURE

3.1 Hyperparameter tuning

3.2 Details about our model design and tuning

4 RESULTS

4.1 Results on the existing catalogue

4.2 Results on DR7 and DR12

4.3 Comparison of the new catalogue with the past catalogues

5 CONCLUSION AND DISCUSSION

SUPPORTING INFORMATION

ACKNOWLEDGEMENTS

DATA AVAILABILITY

REFERENCES

Supplementary data

Citations

Views

Altmetric

Email alerts

Astrophysics Data System

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only