Abstract

This article investigates the problem of estimating stellar atmospheric parameters from spectra. Feature extraction is a key procedure in estimating stellar parameters automatically. We propose a scheme for spectral feature extraction and atmospheric parameter estimation using the following three procedures: firstly, learn a set of basic structure elements (BSEs) from stellar spectra using an autoencoder; secondly, extract representative features from stellar spectra based on the learned BSEs through some procedures of convolution and pooling; thirdly, estimate stellar parameters (Teff, log g, [Fe/H]) using a back-propagation (BP) network. The proposed scheme has been evaluated on both real spectra from Sloan Digital Sky Survey (SDSS)/Sloan Extension for Galactic Understanding and Exploration (SEGUE) and synthetic spectra calculated from Kurucz's new opacity distribution function (NEWODF) models. The best mean absolute errors (MAEs) are 0.0060 dex for log Teff, 0.1978 dex for log g and 0.1770 dex for [Fe/H] for the real spectra and 0.0004 dex for log Teff, 0.0145 dex for log g and 0.0070 dex for [Fe/H] for the synthetic spectra.

1 INTRODUCTION

With the development of astronomical observation technology, more and more large-scale sky survey projects have been proposed and implemented: for example, the Sloan Digital Sky Survey (SDSS: York et al. 2000; Ahn et al. 2012), the Large Sky Area Multi-Object Fibre Spectroscopic Telescope (LAMOST/Guoshoujing Telescope: Zhao et al. 2006; Cui et al. 2012; Luo et al. 2015) and the Gaia–ESO Survey (Gilmore et al. 2012; Randich et al. 2013). Large sky survey projects can automatically observe and collect mass astronomical spectral data. However, the speed of manually processing these massive, high-dimensional data cannot adapt to the rapid growth of data. Therefore, automatic processing and analysis methods are urgently needed for astronomical data (Zhang et al. 2009). Because of this, automatic estimation of stellar atmospheric parameters from spectra is currently a hot topic (Recio-Blanco, Bijaoui & De Laverny 2006; Fiorentin et al. 2007; Jofre et al. 2010; Wu et al. 2011; Liu, Zhang & Lu 2014).

Estimating the stellar parameters (Teff, log g, [Fe/H]) from spectra can be abstracted to a process of finding a mapping from a stellar spectrum to its parameters. If there is a set of representative stellar spectra with known parameters (training data), a supervised machine-learning method can learn the mapping from training data: for example nearest neighbour (NN), artificial neural networks (ANN) and support vector machines (SVM). Based on the learned mapping, we can estimate the atmospheric parameters for a new spectrum.

An astronomical spectrum usually consists of thousands of fluxes and it is necessary to extract a small number of features to estimate stellar parameters accurately (Section 8.3). Orthogonal transforms such as wavelet transforms (WTs) and principal-component analysis (PCA) are popular choices for extracting features. They give a complete representation of data by a set of complete orthogonal bases. By selecting a small number of transform coefficients as features, dimensionality reduction is reached. The wavelet coefficients of a spectrum are used to estimate stellar parameters (Manteiga et al. 2010; Lu, Li & Li 2012; Lu et al. 2013) and classify spectra (Guo, Xing & Jiang 2004; Xing & Guo 2006). As the most popular dimension-reduction method, PCA has also been widely used in astronomical data analysis (Whitney 1983; Bailer-Jones, Irwin & Von Hippel 1998; Zhang et al. 2005, 2006; Singh et al. 2006; McGurk, Kimball & Ivezić 2010). Many parameter-estimation schemes use the projections of data/spectra on some principal components directly as spectral features and have achieved good results (Zhang et al. 2005, 2006; Singh et al. 2006; Rosalie et al. 2010).

The motivation of our research is to explore the feasibility of predicting atmospheric parameters (Teff, log g, [Fe/H]) from a small number of descriptions computed from information within a local wavelength range of a spectrum. For convenience, this kind of description is referred to as ‘local and sparse features’. This characteristic of local representation helps to trace back the potential spectral lines effective in estimation (see Section 5.3).

This work proposes a scheme based on autoencoders, convolution and pooling techniques to extract local and sparse features. Autoencoder and convolution operations give a statistical non-orthogonal decomposition, which leads to a redundant and overcomplete representation of data. The ‘overcomplete’ representation (equation 15) of a spectrum has many more dimensions than the original spectrum (equation 1). While complete representations based on orthogonal transforms are mature and popular feature-extraction methods, the redundant, ‘overcomplete’ representations (Olshausen 2001; Teh et al. 2003) have been advocated and used successfully in many fields. In literature, researchers (Yee et al. 2003) show that redundancy and overcompleteness help in computing some features in subsequent procedures, with improved robustness in the presence of noise (Simoncelli et al. 1992), more compactness and more interpretability (Mallat & Zhang 1993).

From the overcomplete representation, a sparse representation (equation 18) is computed using pooling and maximization operations. The pooling and maximization operations are actually a competition strategy. In this procedure, much redundancy is removed by the competition between multiple redundant components. This competition helps to restrain the copies with more noise and a robust representation is obtained. A typical advantage of this scheme is that it can express many suitable and meaningful structures in data in some applications. It is shown that this scheme does extract some meaningful local features (Section 5.3) for automatically estimating atmospheric parameters (Section 8).

The rest of this article is organized as follows: Section 2 describes the spectra used in this work. Section 3 presents the framework of the proposed scheme. Section 4 introduces the autoencoder network, the concept of basic structure elements (BSEs) and the BSE learning method using autoencoders. Sections 5 and 6 describe the proposed feature-extraction method based on BSEs and the estimation method back-propogation (BP) network, respectively. Section 7 investigates the optimization of the proposed scheme. Section 8 reports some experimental evaluations and discusses the rationality and robustness of the proposed scheme. Section 9 concludes this work.

2 DATA SETS AND PREPROCESSING

2.1 Real spectra

In this article, we use two spectral sets to evaluate the proposed scheme. 5000 stellar spectra selected from SDSS-DR7 (Abazajian et al. 2009) compose the real spectrum set. Each spectrum has 3000 fluxes in a logarithmic wavelength range [3.6000, 3.8999] with a sampling resolution of 0.0001. The ranges of the three parameters are [4163, 9685] K for Teff, [1.260, 4.994] dex for log g and [−3.437, 0.1820] dex for [Fe/H].

2.2 Synthetic spectra

A set of 18 969 synthetic spectra is generated with Kurucz's new opacity distribution function (NEWODF) models (Castelli & Kurucz 2003) using the package spectrum (Gray & Corbally 1994) and 830 828 atomic and molecular lines. The solar atomic abundances we used are derived from Grevesse & Sauval (1998). The grids of the synthetic stellar spectra span the parameter ranges [4000, 9750] K in Teff (45 values, step size 100 K between 4000 and 7500 K and 250 K between 7750 and 9750 K), [1, 5] dex in log g (17 values, step size 0.25 dex) and [−3.6, 0.3] dex in [Fe/H] (27 values, step size 0.2 between −3.6 and −1 dex, 0.1 between −1 and 0.3 dex).

2.3 Preprocessing and data partitioning

The proposed scheme is implemented based on ANN. Usually an ANN requires that each input component has been normalized to eliminate impacts on input data resulting from range differences from flux to flux. Suppose that
(1)
is a spectrum. The normalized spectrum |$\bar{\boldsymbol {x}}$| is calculated by
(2)
where |$\Vert \boldsymbol {x}\Vert _2 = (\sum _{i=1}^{l}{x_i^2})^{0.5}$|⁠. In addition, to reduce the dynamical range and in order better to represent the uncertainties of spectral data, Teff is replaced by log Teff in both sets (Fiorentin et al. 2007; Li et al. 2014).

The proposed scheme of this work belongs to the class of statistical learning methods. The fundamental idea is to discover the predictive relationships between stellar spectra and atmospheric parameters Teff, log g and [Fe/H] from empirical data, which constitutes a training set. At the same time, the performance of the predictive relationships discovered should also be evaluated objectively. Therefore, a separate, independent set of stellar spectra is needed for this evaluation, usually called a test set in machine learning. However, most learning methods tend to overfit the empirical data. In other words, statistical learning methods can unravel some alleged relationships from the training data that do not hold in general. In order to avoid overfitting, we need a third independent spectrum set for optimizing the parameters (Section 7) of the framework that need to be adjusted objectively when investigating the potential relationships. This third spectrum set and the reference parameters constitutes the validation set.

Therefore, in each experiment, we split the total spectrum samples into three subsets: a training set (60 per cent), a validation set (20 per cent) and a test set (20 per cent). The training set is the carrier of knowledge and the proposed scheme should learn from this set. The validation set is a mentor/instructor of the proposed scheme that can independently and objectively give some advice to the learning process. The training set and the validation set are used in establishing a spectral parametrization, while the test set acts as a referee to evaluate the performance of the established spectral parametrization objectively. The roles of these three subsets are listed in Table 1 .

Table 1.

Roles of the three data sets.

Data setRoles
Training set(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
(2) Learn in optimizing the configuration (Section 7).
(3) Train the BP network (Section 6).
Validation setEvaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test setEvaluate the performance of the proposed scheme (Section 8).
Data setRoles
Training set(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
(2) Learn in optimizing the configuration (Section 7).
(3) Train the BP network (Section 6).
Validation setEvaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test setEvaluate the performance of the proposed scheme (Section 8).
Table 1.

Roles of the three data sets.

Data setRoles
Training set(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
(2) Learn in optimizing the configuration (Section 7).
(3) Train the BP network (Section 6).
Validation setEvaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test setEvaluate the performance of the proposed scheme (Section 8).
Data setRoles
Training set(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
(2) Learn in optimizing the configuration (Section 7).
(3) Train the BP network (Section 6).
Validation setEvaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test setEvaluate the performance of the proposed scheme (Section 8).

3 FRAMEWORK

There are multiple procedures in our researches; a flowchart shown in Fig. 1 illustrates the end-to-end flow.

A flowchart to show the order of procedures in the proposed scheme. The procedure ‘Learning BSE’ is only used in investigation/training. In application/test stage, we can extract features by convolution and pooling after normalizing spectra using the learned BSEs in training.
Figure 1.

A flowchart to show the order of procedures in the proposed scheme. The procedure ‘Learning BSE’ is only used in investigation/training. In application/test stage, we can extract features by convolution and pooling after normalizing spectra using the learned BSEs in training.

Overall, our work can be divided into two stages: (1) a research stage and (2) an application stage. These two stages can be implemented automatically based on the flowchart in Fig. 1. As shown in this flowchart, in the research stage, we can obtain an optimized configuration for the proposed scheme and a spectral parametrization by which we can map a spectrum approximately to its atmospheric parameters. In the application stage, we can compute the atmospheric parameters from a spectrum. More about optimization of the configuration is discussed in Section 7.

This work used multiple acronyms and notations. To facilitate readability, we summarize them in Table 2.

Table 2.

Acronyms and notations used in this work. AN: acronym or notation.

ANMeaningFirst appearanceANMeaningFirst appearance
|$a^{(l)}_j(\boldsymbol {x})$|the output of the jth node in the lth layer for an input xSection 4.1apan atmospheric parameter log Teff, log g or [Fe/H]Section 7
|$b^{(l)}_j$|the bias of the jth node in the (l + 1)th layerSection 4.1βa regularized parameter in the objective function J(·, ·)Section 4.1
BSEbasic structure elementAbstractDCPdescription by convolution and poolingSection 5.2
f(·)an estimation methodSection 6g(·)an activation function of an autoencoderSection 4.1
hW, bthe mapping of an autoencoderSection 4.1J(·, ·)the objective function of an autoencoderSection 4.1
KL( · ∥ · )relative entropy of the average output of a hidden node and its expected outputSection 4.1λa weight decay parameter in the objective function J(·, ·)Section 4.1
|$\hat{\lambda }$|optimized value for λ. |$\hat{\beta }$|⁠, |$\hat{\rho }$|⁠, |$\hat{n}$|⁠, |$\hat{N}_p$|⁠, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarlySection 7mnumber of nodes in the input layer of an autoencoderSection 4.1
nnumber of nodes in the hidden layer of an autoencoderSection 4.1|$n^{\rm BP}_{\rm hl}$|number of hidden layers in a BP networkSection 6
|$\tilde{n}^{\rm BP}_{\rm hl}$|initialized value of |$n^{\rm BP}_{\rm hl}$|Section 7|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|number of nodes in the hidden layers of a BP networkSection 6
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|Section 7Nnumber of samples in a data set S or Str depending on its contextSection 4.1
Npnumber of poolsSection 5.2ρan desired activation levelSection 4.1
|$\bar{\rho }_j$|the average activation of the jth hidden node of an autoencoderSection 4.1RRλrestricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|⁠, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarlySection 7
Sa data set with spectra and atmospheric parametersSection 6SBSEa set of BSEsSection 4.2
Stra training set of stellar spectraSection 4.2|$S^{\rm tr\_ae}$|a training set for an autoencoder, consists of some spectral patchesSection 4.1
|$v^{(j)}_q$|the maximum convolution response with the jth BSE in the qth poolSection 5.2wdescription by convolution and pooling (DCP)Section 5.2
|$W^{(1)}_{j\cdot }$|the jth BSESection 4.1|$W^{(l)}_{ji}$|a weight between the ith node in the lth layer and the jth node in the (l + 1)th layerSection 4.1
xa spectrum or a spectral patch for an autoencoder depending on its contextSection 2.3|$\bar{\boldsymbol {x}}$|a normalized spectrumSection 2.3
youtput of an autoencoderSection 4.1za convolution response of a spectrum with BSEsSection 5.1
ANMeaningFirst appearanceANMeaningFirst appearance
|$a^{(l)}_j(\boldsymbol {x})$|the output of the jth node in the lth layer for an input xSection 4.1apan atmospheric parameter log Teff, log g or [Fe/H]Section 7
|$b^{(l)}_j$|the bias of the jth node in the (l + 1)th layerSection 4.1βa regularized parameter in the objective function J(·, ·)Section 4.1
BSEbasic structure elementAbstractDCPdescription by convolution and poolingSection 5.2
f(·)an estimation methodSection 6g(·)an activation function of an autoencoderSection 4.1
hW, bthe mapping of an autoencoderSection 4.1J(·, ·)the objective function of an autoencoderSection 4.1
KL( · ∥ · )relative entropy of the average output of a hidden node and its expected outputSection 4.1λa weight decay parameter in the objective function J(·, ·)Section 4.1
|$\hat{\lambda }$|optimized value for λ. |$\hat{\beta }$|⁠, |$\hat{\rho }$|⁠, |$\hat{n}$|⁠, |$\hat{N}_p$|⁠, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarlySection 7mnumber of nodes in the input layer of an autoencoderSection 4.1
nnumber of nodes in the hidden layer of an autoencoderSection 4.1|$n^{\rm BP}_{\rm hl}$|number of hidden layers in a BP networkSection 6
|$\tilde{n}^{\rm BP}_{\rm hl}$|initialized value of |$n^{\rm BP}_{\rm hl}$|Section 7|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|number of nodes in the hidden layers of a BP networkSection 6
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|Section 7Nnumber of samples in a data set S or Str depending on its contextSection 4.1
Npnumber of poolsSection 5.2ρan desired activation levelSection 4.1
|$\bar{\rho }_j$|the average activation of the jth hidden node of an autoencoderSection 4.1RRλrestricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|⁠, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarlySection 7
Sa data set with spectra and atmospheric parametersSection 6SBSEa set of BSEsSection 4.2
Stra training set of stellar spectraSection 4.2|$S^{\rm tr\_ae}$|a training set for an autoencoder, consists of some spectral patchesSection 4.1
|$v^{(j)}_q$|the maximum convolution response with the jth BSE in the qth poolSection 5.2wdescription by convolution and pooling (DCP)Section 5.2
|$W^{(1)}_{j\cdot }$|the jth BSESection 4.1|$W^{(l)}_{ji}$|a weight between the ith node in the lth layer and the jth node in the (l + 1)th layerSection 4.1
xa spectrum or a spectral patch for an autoencoder depending on its contextSection 2.3|$\bar{\boldsymbol {x}}$|a normalized spectrumSection 2.3
youtput of an autoencoderSection 4.1za convolution response of a spectrum with BSEsSection 5.1
Table 2.

Acronyms and notations used in this work. AN: acronym or notation.

ANMeaningFirst appearanceANMeaningFirst appearance
|$a^{(l)}_j(\boldsymbol {x})$|the output of the jth node in the lth layer for an input xSection 4.1apan atmospheric parameter log Teff, log g or [Fe/H]Section 7
|$b^{(l)}_j$|the bias of the jth node in the (l + 1)th layerSection 4.1βa regularized parameter in the objective function J(·, ·)Section 4.1
BSEbasic structure elementAbstractDCPdescription by convolution and poolingSection 5.2
f(·)an estimation methodSection 6g(·)an activation function of an autoencoderSection 4.1
hW, bthe mapping of an autoencoderSection 4.1J(·, ·)the objective function of an autoencoderSection 4.1
KL( · ∥ · )relative entropy of the average output of a hidden node and its expected outputSection 4.1λa weight decay parameter in the objective function J(·, ·)Section 4.1
|$\hat{\lambda }$|optimized value for λ. |$\hat{\beta }$|⁠, |$\hat{\rho }$|⁠, |$\hat{n}$|⁠, |$\hat{N}_p$|⁠, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarlySection 7mnumber of nodes in the input layer of an autoencoderSection 4.1
nnumber of nodes in the hidden layer of an autoencoderSection 4.1|$n^{\rm BP}_{\rm hl}$|number of hidden layers in a BP networkSection 6
|$\tilde{n}^{\rm BP}_{\rm hl}$|initialized value of |$n^{\rm BP}_{\rm hl}$|Section 7|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|number of nodes in the hidden layers of a BP networkSection 6
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|Section 7Nnumber of samples in a data set S or Str depending on its contextSection 4.1
Npnumber of poolsSection 5.2ρan desired activation levelSection 4.1
|$\bar{\rho }_j$|the average activation of the jth hidden node of an autoencoderSection 4.1RRλrestricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|⁠, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarlySection 7
Sa data set with spectra and atmospheric parametersSection 6SBSEa set of BSEsSection 4.2
Stra training set of stellar spectraSection 4.2|$S^{\rm tr\_ae}$|a training set for an autoencoder, consists of some spectral patchesSection 4.1
|$v^{(j)}_q$|the maximum convolution response with the jth BSE in the qth poolSection 5.2wdescription by convolution and pooling (DCP)Section 5.2
|$W^{(1)}_{j\cdot }$|the jth BSESection 4.1|$W^{(l)}_{ji}$|a weight between the ith node in the lth layer and the jth node in the (l + 1)th layerSection 4.1
xa spectrum or a spectral patch for an autoencoder depending on its contextSection 2.3|$\bar{\boldsymbol {x}}$|a normalized spectrumSection 2.3
youtput of an autoencoderSection 4.1za convolution response of a spectrum with BSEsSection 5.1
ANMeaningFirst appearanceANMeaningFirst appearance
|$a^{(l)}_j(\boldsymbol {x})$|the output of the jth node in the lth layer for an input xSection 4.1apan atmospheric parameter log Teff, log g or [Fe/H]Section 7
|$b^{(l)}_j$|the bias of the jth node in the (l + 1)th layerSection 4.1βa regularized parameter in the objective function J(·, ·)Section 4.1
BSEbasic structure elementAbstractDCPdescription by convolution and poolingSection 5.2
f(·)an estimation methodSection 6g(·)an activation function of an autoencoderSection 4.1
hW, bthe mapping of an autoencoderSection 4.1J(·, ·)the objective function of an autoencoderSection 4.1
KL( · ∥ · )relative entropy of the average output of a hidden node and its expected outputSection 4.1λa weight decay parameter in the objective function J(·, ·)Section 4.1
|$\hat{\lambda }$|optimized value for λ. |$\hat{\beta }$|⁠, |$\hat{\rho }$|⁠, |$\hat{n}$|⁠, |$\hat{N}_p$|⁠, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarlySection 7mnumber of nodes in the input layer of an autoencoderSection 4.1
nnumber of nodes in the hidden layer of an autoencoderSection 4.1|$n^{\rm BP}_{\rm hl}$|number of hidden layers in a BP networkSection 6
|$\tilde{n}^{\rm BP}_{\rm hl}$|initialized value of |$n^{\rm BP}_{\rm hl}$|Section 7|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|number of nodes in the hidden layers of a BP networkSection 6
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|Section 7Nnumber of samples in a data set S or Str depending on its contextSection 4.1
Npnumber of poolsSection 5.2ρan desired activation levelSection 4.1
|$\bar{\rho }_j$|the average activation of the jth hidden node of an autoencoderSection 4.1RRλrestricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|⁠, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarlySection 7
Sa data set with spectra and atmospheric parametersSection 6SBSEa set of BSEsSection 4.2
Stra training set of stellar spectraSection 4.2|$S^{\rm tr\_ae}$|a training set for an autoencoder, consists of some spectral patchesSection 4.1
|$v^{(j)}_q$|the maximum convolution response with the jth BSE in the qth poolSection 5.2wdescription by convolution and pooling (DCP)Section 5.2
|$W^{(1)}_{j\cdot }$|the jth BSESection 4.1|$W^{(l)}_{ji}$|a weight between the ith node in the lth layer and the jth node in the (l + 1)th layerSection 4.1
xa spectrum or a spectral patch for an autoencoder depending on its contextSection 2.3|$\bar{\boldsymbol {x}}$|a normalized spectrumSection 2.3
youtput of an autoencoderSection 4.1za convolution response of a spectrum with BSEsSection 5.1

4 LEARNING BSEs USING AN AUTOENCODER

4.1 An autoencoder

An autoencoder is a special kind of ANN, initially proposed as a data dimensionality reduction scheme (Hinton & Salakhutdinov 2006) and now widely used in image analysis (Tan & Eswaran 2010; Shin et al. 2011) and speech processing (Vishnubhotla, Fernandez & Ramabhadran 2010; Deng et al. 2013).

An autoencoder adopts the framework shown in Fig. 2 and is usually used to extract features by unsupervised learning. In an autoencoder, there is only one hidden layer and the number of nodes in the output layer is equal to that in the input layer. The output yi is an approximation of the corresponding input xi,
(3)
The learning objective of an autoencoder is to find a set of weights, labelled with lines in Fig. 2, between the nodes in different layers based on a set of empirical data (a training set). In other words, the autoencoder tries to approximate an identity function on the training set.
A framework of an autoencoder. The number of output nodes is equal to the number of input nodes. The learning objective of the network is to make the output hW, b(x) as close as possible to input x.
Figure 2.

A framework of an autoencoder. The number of output nodes is equal to the number of input nodes. The learning objective of the network is to make the output hW, b(x) as close as possible to input x.

By setting the number of hidden nodes to be far smaller than that of input nodes, an autoencoder can achieve dimension reduction. For example, when an autoencoder has 200 input nodes and 20 hidden nodes (equivalently, m = 200 and n = 20 in Fig. 2), the original 200-dimensional input could be ‘reconstructed’ approximately from the 20-dimensional output of the hidden layer. If we use the output of the hidden layer as a representation of an input of the network, the autoencoder plays the role of a feature extractor.

To introduce the implementation details of the autoencoder, we utilize the notations in an online tutorial ‘UFLDL Tutorial’ (Andrew et al. 2010).

In the network of Fig. 2, let the layer labels be 1 for the input nodes, 2 for the hidden nodes and 3 for the output nodes and suppose |$W^{(l)}_{ji}$| (l = 1, 2) represents a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer, |$b^{(l)}_j$| (l = 1, 2) is the bias of the jth node in the (l + 1)th layer and |$a^{(l)}_j(\boldsymbol {x})$| (l = 1, 2, 3) is the output of the jth node in the lth layer for input x = (x1, …, xm)T. Then, |$a^{(1)}_i(\boldsymbol {x}) = x_i$| for the input nodes, |$a^{(3)}_k(\boldsymbol {x}) = y_k$| for the output nodes.

For compact expressions in equations (4) and (8), we introduce three variables s1, s2 and s3, respectively representing the number of nodes in three layers of an autoencoder network (Fig. 2). Then, s1 = m, s2 = n and s3 = m and the relationship between the nodes in different layers is
(4)
In equation (4), the g(·) is referred to in the literature as an activation function. Two common choices for the activation function are a sigmoid function,
(5)
and a hyperbolic tangent function,
(6)
This work uses the sigmoid function in equation (5).
Overall, the network in Fig. 2 implements a non-linear mapping hW, b(·) from an input x = (x1, …, xm)T to an output y = (y1, …, ym)T:
(7)
where |$W = \lbrace W^{(l)}_{ji}\rbrace$| represents the set of weights of an autoencoder network and |$b = \lbrace b^{(l)}_j\rbrace$| the set of biases.
Suppose that |$S^{\rm tr\_ae}$| is a training set for an autoencoder. To obtain the parameters W and b for an autoencoder network based on equation (3), we can minimize an objective function, J, in equation (8):
(8)
where |$\bar{\rho }_j$| is the average output of the jth hidden node,
(9)
(10)
N is the number of samples in a training set1|$S^{\rm tr\_ae}$| and λ ≥ 0, β ≥ 0 and ρ > 0 are three preset parameters of the spectral parametrization. These three preset parameters control the relative importance of the three terms in equation (8). In the literature, the λ is usually referred to as a weight decay parameter.

In equation (8), the first term represents an empirical error evaluation between the actual output and the expected output of an autoencoder and ensures a good reconstruction performance of the network. The second term, a regularization term of |$W^{(l)}_{ji}$|⁠, is used to overcome possible overfitting to the training set by reducing the scheme's complexity.

The third term with weighted coefficient β is a penalty term for sparsity in outputs of the hidden layer. |$\mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j)$| is the relative entropy of the average output, |$\bar{\rho }_j$|⁠, and a desired activation level ρ. |$\mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j)$| increases monotonically with increasing distance between |$\bar{\rho }_j$| and ρ and encourages the average activation, |$\bar{\rho }$|⁠, of the hidden layer to be close to a desired average activation ρ.

4.2 Learning BSEs

BSEs consist of a set of templates of spectral patches with a limited wavelength range. Using the BSE, we can extract local and sparse features (Section 5). This work studies the feasibility of learning BSEs through an autoencoder for automatic estimation of atmospheric parameters. In an autoencoder, the weighted sum |$\sum ^{m}_{i=1}W^{(1)}_{ji}a^{(1)}_i(\boldsymbol {x})+b^{(1)}_j$| in the jth hidden node is essentially a projection of an input x on the vector |$W^{(1)}_{j\cdot }=(W^{(1)}_{j1},W^{(1)}_{j2},\ldots ,W^{(1)}_{jm})^{\rm T}$| of weights, which is similar to the coefficients of a vector in a coordinate system. Thus, |$W^{(1)}_{j\cdot }$| can be regarded as a ‘basis’ for the input data and in this article we name them a basic structure element (BSE), where j = 1, …, n.

To learn BSEs through an autoencoder, a training set |$S^{\rm tr\_ae}$| was constructed to represent the local information of a stellar spectrum. Let Str be a training set of stellar spectra (Section 2). The BSE training set |$S^{\rm tr\_ae}$| is constructed from spectral training set Str in the following way:

  1. randomly select a spectrum, x = (x1, …, xl), from Str, where l > 0 represents the number of fluxes of a spectrum;

  2. randomly generate an integer j satisfying 1 ≤ j ≤ nm + 1, where m > 0 represents the dimension of a sample in |$S^{\rm tr\_ae}$| and is consistent with the number of input nodes of the autoencoder in Fig. 2;

  3. take (xj, xj + 1, …, xj + m − 1)T as a sample of |$S^{\rm tr\_ae}$|⁠.

By repeating the above three procedures, we obtain a BSE training set |$S^{\rm tr\_ae}$|⁠. Therefore, |$S^{\rm tr\_ae}$| actually consists of a series of spectral patches. Considering the widths of lines of a stellar spectrum, m is empirically taken as 81 in this work. In the proposed scheme, 100 000 such patches are generated to constitute the BSE training set. These patches are not generated from a specific wavelength position, therefore |$S^{\rm tr\_ae}$| expresses the general structure of all spectral patches with length m.

To learn a set of BSEs, we input the generated training set |$S^{\rm tr\_ae}$| into an autoencoder (Section 4.1) and compute a set of spectral templates, BSEs:
(11)
where every BSE, |$W^{(1)}_{j\cdot }$|⁠, is a vector representing a basic pattern of spectral patches,
(12)

5 FEATURE EXTRACTION

This work proposes to extract features by performing convolution and pooling operations on the computed BSEs and a stellar spectrum.

5.1 Convolution

Let
(13)
denote a spectrum. Using the extracted BSEs SBSE in equation (11) and a convolution operation, we filter the spectrum x:
(14)
and transform the spectrum x into
(15)
where i = 1, …, lm + 1, j = 1, …, n. Then,
(16)
is the convolution response vector of the jth BSE structure |$W^{(1)}_{j\cdot }$| in equation (11).

In this work, a BSE structure template has m = 81 components (equation 12) and a SDSS spectrum is represented with l = 3000 fluxes. Therefore, there are lm + 1 = 2920 convolution responses for any one BSE structure template and spectrum. From the n = 25 BSE structure templates in equation (11), we obtain (lm + 1) × n = 2920 × 25 convolution responses (in equation 15) for every spectrum.

5.2 Pooling

However, there are many redundancies in the convolution response description, z in equation (15), of a spectrum for physical atmospheric parameter estimation. Redundancies usually result in overfitting. To overcome this problem, a ‘pooling’ method is adopted to reduce dimension by merging the features of different positions. In this pooling method, we first equally partition the convolution response description into Np pools according to their wavelength positions, then choose the maximum response in each pool as the final spectral feature:
(17)
where q = 1, …, Np, j = 1, …, n, l/Np is the length of every pool, l is the number of fluxes in a spectrum and Np is the number of pools. The |$v_q^{(j)}$| is referred to as a maximum convolution response. Thus, the spectrum x is transformed to
(18)
For convenience, the detected features in equation (18) are named description by convolution and pooling (DCP). For a SDSS spectrum, the dimension of its description decreases from (lm + 1) × n = 2920 × 25 (convolution responses) to Np × n = 10 × 25 (DCP), if the pool number Np is 10.

5.3 Discussion on convolution and pooling

Pooling is a form of non-linear downsampling operation, which combines the responses of feature detectors at nearby locations into some statistical variables. The pooling method was proposed in Hubel and Wiesel's work on complex cells in the visual cortex (Hubel & Wiesel 1962) and is now used in a large number of applications: for example computer vision and speech recognition. Theoretical analysis (Boureau, Ponce & LeCun 2010) suggests that the pooling method works well when the features are sparse. In image- and speech-processing fields, convolution and pooling based on local bases has been proven to be an effective feature-extraction method (Nagi et al. 2011; Shin et al. 2011; Ashish, Murat & Anthony 2013).

In theory, the convolution operation can reduce negative effects from noise and the pooling operation can restrain the negative influence from some imperfections (e.g. sky lines and cosmic-ray removal residuals, residual calibration defects) by competing between multiple convolution responses of the extracted BSEs and a spectrum (the maximization procedure). Some effects of convolution and pooling are investigated in the following part of this subsection and further discussed in Sections 8.3 and 8.4 based on experimental results.

Using SDSS training data (Section 2.1) and the scheme proposed in Section 4.2, we learned 25 BSE structures (Fig. 3) by setting the number of hidden nodes n = 25 in Fig. 2 and equation (11), where the optimality of n = 25 are investigated in Section 7. It is shown that the BSE structure templates characterize the local structure of spectra. Using the learned BSE structures and the convolution-pooling scheme described in this section, we can extract features from every spectrum.

A visualization of BSEs when the hidden layer has 25 nodes.
Figure 3.

A visualization of BSEs when the hidden layer has 25 nodes.

For example, Fig. 4(a) presents a spectrum from SDSS; there is a stitching error near 5580 Å. In Fig. 4(b), the curve shows the convolution response vector of the third BSE structure |$W^{(1)}_{3\cdot }$| on the spectrum in Fig. 4(a); the points labelled with quadrangles are the pooling responses |$\lbrace v_q^{(3)}, q = 1, \ldots , N_{\rm p}\rbrace$| in equation (17). The results in Fig. 4(b) and (c) show that the responses of the spuriously strong signal of the stitching error are reduced.

Convolution responses of a spectrum and a histogram of the maximum convolution responses in a pooling process with pool number Np = 10. (a) A normalized SDSS spectrum. (b) Convolution responses of the spectrum shown in (a) and the third BSE shown in Fig. 3. In (b), nine vertical dashed lines are boundaries of each pool (the pooling is implemented in logarithmic wavelength space, so the sizes of multiple pools are unequal in a wavelength space) and ten quadrangles indicate the maximum convolution response in every pool. (c) The statistical histogram of the maximum convolution responses in the pooling process for the third BSE in Fig. 3 on 5000 SDSS spectra (Section 2.1).
Figure 4.

Convolution responses of a spectrum and a histogram of the maximum convolution responses in a pooling process with pool number Np = 10. (a) A normalized SDSS spectrum. (b) Convolution responses of the spectrum shown in (a) and the third BSE shown in Fig. 3. In (b), nine vertical dashed lines are boundaries of each pool (the pooling is implemented in logarithmic wavelength space, so the sizes of multiple pools are unequal in a wavelength space) and ten quadrangles indicate the maximum convolution response in every pool. (c) The statistical histogram of the maximum convolution responses in the pooling process for the third BSE in Fig. 3 on 5000 SDSS spectra (Section 2.1).

To investigate the characteristics of the extracted features further using convolution and pooling, we calculate the statistical histogram of the maximum convolution responses on 5000 SDSS spectra (Section 2.1). Every maximum convolution response |$v_q^{(j)}$| has a specific wavelength position (Fig. 4b, equation 17). The statistical histogram of the maximum convolution responses is obtained by cumulating the number of pooling responses |$v_q^{(j)}$| of 5000 SDSS spectra at each wavelength position. Fig. 4(c) shows the statistical histogram of the maximum convolution responses corresponding to the third BSE structure |$W^{(1)}_{3\cdot }$| and more statistical histograms of the maximum convolution responses are presented in Fig. 5. The results in Fig. 5 also show that the effects of stitching errors near 5580 Å in Fig. 4(a) are negligible in the pooling response. More experimental investigations on the stitching error are conducted in Section 8.4.

The cumulative histogram of the local maximum response position for the third, fourth, sixth, 13th, 14th, 16th and 24th BSE structures. The dashed lines indicate eight significant local maximum cumulative responses of pooling operations on 5000 SDSS spectra.
Figure 5.

The cumulative histogram of the local maximum response position for the third, fourth, sixth, 13th, 14th, 16th and 24th BSE structures. The dashed lines indicate eight significant local maximum cumulative responses of pooling operations on 5000 SDSS spectra.

The statistical histograms of the maximum convolution responses (Figs 4c and 5) on 5000 SDSS spectra also show that the pooling responses are statistically sparse: only at a few wavelength positions are there non-zero responses. The sparseness helps to trace back the physical interpretation of the extracted features.

That is to say, most of the cumulative responses are close to zero except for a few wavelength patches with large cumulative responses. As can be seen in Fig. 4, these local maximum responses appear near the wavelength of evident fluctuation in a spectrum. In the pooling process, the responses of these wavelengths are frequently taken as spectral features. Some of the new features characterize such local structures in spectra and they are local and sparse.

Considering that a pooling response |$v_q^{(j)}$| is obtained by a convolution calculation (equations 17 and 14), each pooling feature is actually calculated from some spectral fluxes in a wavelength range. Therefore, we present the wavelength ranges in the second column of Table 3 for those features with large cumulative responses in Fig. 5.

Table 3.

The wavelengths of the eight local-maximum cumulative responses and some potential lines near them. WPT: wavelength position of typical local-maximum cumulative response of a pooling operation on 5000 SDSS spectra, this position is represented by a three-dimensional vector (a b c), where a, b, c are respectively the starting wavelength, central wavelength and ending wavelength and log10b = (log10a + log10c)/2. PL: potential lines probably related to the description of the detected feature. Wavelength b is the position of the local maximum cumulative and fluxes from range [a c] determine the response on b in a convolution process.

No.WPT (Å)PL
1(4000 4074 4150)Ca i, Hδ, He i
2(4199 4276 4356)Ca i, Hγ
3(4233 4311 4391)Fe ii, Na ii, O i, Fe i, Ca iii
4(4744 4831 4922)Fe i, O iii, Na i, Na ii, O ii, Fe ii
5(5047 5141 5237)Fe ii, He i, Na i, Ca iii, O iii, Fe i
6(5757 5864 5973)Na i, Fe i, O ii
7(6401 6520 6641)Hα, CaH
8(7473 7612 7754)Fe i, Fe ii, O iii, O i
No.WPT (Å)PL
1(4000 4074 4150)Ca i, Hδ, He i
2(4199 4276 4356)Ca i, Hγ
3(4233 4311 4391)Fe ii, Na ii, O i, Fe i, Ca iii
4(4744 4831 4922)Fe i, O iii, Na i, Na ii, O ii, Fe ii
5(5047 5141 5237)Fe ii, He i, Na i, Ca iii, O iii, Fe i
6(5757 5864 5973)Na i, Fe i, O ii
7(6401 6520 6641)Hα, CaH
8(7473 7612 7754)Fe i, Fe ii, O iii, O i
Table 3.

The wavelengths of the eight local-maximum cumulative responses and some potential lines near them. WPT: wavelength position of typical local-maximum cumulative response of a pooling operation on 5000 SDSS spectra, this position is represented by a three-dimensional vector (a b c), where a, b, c are respectively the starting wavelength, central wavelength and ending wavelength and log10b = (log10a + log10c)/2. PL: potential lines probably related to the description of the detected feature. Wavelength b is the position of the local maximum cumulative and fluxes from range [a c] determine the response on b in a convolution process.

No.WPT (Å)PL
1(4000 4074 4150)Ca i, Hδ, He i
2(4199 4276 4356)Ca i, Hγ
3(4233 4311 4391)Fe ii, Na ii, O i, Fe i, Ca iii
4(4744 4831 4922)Fe i, O iii, Na i, Na ii, O ii, Fe ii
5(5047 5141 5237)Fe ii, He i, Na i, Ca iii, O iii, Fe i
6(5757 5864 5973)Na i, Fe i, O ii
7(6401 6520 6641)Hα, CaH
8(7473 7612 7754)Fe i, Fe ii, O iii, O i
No.WPT (Å)PL
1(4000 4074 4150)Ca i, Hδ, He i
2(4199 4276 4356)Ca i, Hγ
3(4233 4311 4391)Fe ii, Na ii, O i, Fe i, Ca iii
4(4744 4831 4922)Fe i, O iii, Na i, Na ii, O ii, Fe ii
5(5047 5141 5237)Fe ii, He i, Na i, Ca iii, O iii, Fe i
6(5757 5864 5973)Na i, Fe i, O ii
7(6401 6520 6641)Hα, CaH
8(7473 7612 7754)Fe i, Fe ii, O iii, O i

Although the features are not extracted by identifying spectral lines, the results in Fig. 5 do show that the detected features are near some spectral lines potentially existing in a specific stellar spectrum (the third column of Table 3).

6 ESTIMATING ATMOSPHERIC PARAMETERS

As a typical non-linear learning method, ANN has been widely used in automated estimation of stellar atmosphere parameters (Bailer-Jones 2000; Willemsen et al. 2003; Ordóñez et al. 2007, 2008; Zhao, Luo & Zhao 2008; Pan & Tu 2009; Manteiga et al. 2010; Giridhar et al. 2012) and spectrum classification (Gulati et al. 1994; Hippel et al. 1994; Vieira & Pons 1995; Weaver & Torres-Dodgen 1995; Schierscher & Paunzen 2011). For example, Manteiga et al. (2010) extracted spectral features using fast Fourier transforms (FFTs) and discrete wavelet transform (DWT) and estimated the parameters Teff, log g, [Fe/H] and [α/Fe] by an ANN with one hidden layer from FFT coefficients and wavelet coefficients. Giridhar et al. (2012) studied how to estimate atmospheric parameters directly from spectra by a BP neural network. Pan & Tu (2009) proposed to parametrize a stellar spectrum using an ANN from a few PCA features.

In this article, we use BP networks to learn the mapping from extracted DCP features (Section 5) to stellar parameters Teff, log g and [Fe/H]. Training of BP networks is an iterative process and in each iteration the estimated errors are calculated on two sets: the training set and the validation set. A BP network optimizes its parameters by minimizing the difference between the network's output and the expected output (e.g. stellar atmospheric parameters) according to the estimated errors in the training set and this process will stop when the estimated errors in the validation set have no improvement in successive iteration steps. This can avoid overfitting.

In a BP network, there are two preset parameters: the number of hidden layers and number of nodes in each hidden layer. For convenience, the former is denoted by |$n^{\rm BP}_{\rm hl}$|⁠, the latter |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|⁠. In this work, we investigated the cases |$n^{\rm BP}_{\rm hl} = 1$| and |$n^{\rm BP}_{\rm hl} = 2$| for computational feasibility. If |$n^{\rm BP}_{\rm hl} = 1$|⁠, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| is a positive integer. If |$n^{\rm BP}_{\rm hl} = 2$|⁠, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| is a two-dimensional row vector consisting of the numbers of nodes in the first and second hidden layers.

Suppose that S = {(x, y)} is a data set, where x represents the information of a spectrum and y is an atmospheric parameter. In this work, the accuracy of a spectral parametrization f(·) on a data set S is evaluated by the mean absolute error (MAE):
(19)
where N is the number of samples in S.

7 OPTIMIZING THE CONFIGURATION

The proposed scheme consists of four steps (Fig. 1, Section 3). In the second step, ‘Learning BSE’, there are four parameters, λ, β, ρ and n, to be preset, where n denotes the number of nodes in the hidden layer of an autoencoder (Fig. 2). In the third step, ‘Extract features by convolution and pooling’, there is a preset parameter, Np, representing the number of pools (Section 5). In the estimation method, BP network, there are two preset parameters |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| (Section 6).

To optimize the configuration of the proposed scheme, the spectrum-parametrization scheme was estimated from a training set from SDSS. The performance of the estimated spectrum-parametrization scheme was evaluated on a validation set from SDSS (Section 2).

Therefore, the optimal configuration
(20)
can be found by
(21)
where, ap = Teff, log g or [Fe/H] and MAE is the prediction error for the SDSS validation set. The spectral parametrization is learned from a SDSS training set with a specific configuration of λ, β, ρ, n, Np, |$n^{\rm BP}_{\rm hl}$|⁠, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| and ap.

Because there is no analytical expression for the objective function in equation (21), we can obtain the optional configuration |$(\hat{\lambda }, \hat{\beta }, \hat{\rho }, \hat{n}, \hat{N_{\rm p}}, \hat{n}^{\rm BP}_{\rm hl}, \hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl})_{\rm ap}$| in theory by repeating the four procedures of the proposed scheme (Fig. 1, Section 3) with every possible configuration of |$\lambda , \beta , \rho , n, N_{\rm p}, n^{\rm BP}_{\rm hl}, \boldsymbol {n}^{\rm BP}_{\rm nhl}$| and choosing the one with minimal MAE error as the optimal configuration.

However, this theoretical optimization scheme is infeasible as regards computational burden. Therefore, instead of obtaining an optimal configuration, we propose to find an excellent/acceptable configuration, a suboptimal solution.

To find a suboptimal solution, we restrict the search ranges for λ, β, ρ, n and Np empirically, as follows:
(22)
(23)
(24)
(25)
and |$\hat{\beta } =3$|⁠, where RRλ represents a restricted search range for λ; the other symbols RRρ, RRn and |$RR_{N_{\rm p}}$| are defined similarly. The suboptimal solutions of λ, ρ, n and Np can be found by optimizing
(26)
based on the framework in Fig. 1, where |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| are initialized with |$\tilde{n}_{\rm hl} =1$| and |$\tilde{\boldsymbol {n}}_{\rm nhl} = 6$|⁠. The |$\tilde{n}_{\rm hl}$| and |$\tilde{\boldsymbol {n}}_{\rm nhl}$| are determined empirically based on considerations of balance between computational burden and estimated accuracy. The configurations obtained are presented in Table 4 .
Table 4.

The suboptional configuration obtained for the proposed scheme (Fig. 1, Section 3). |$\hat{\lambda }$|⁠, |$\hat{\beta }$|⁠, |$\hat{\rho }$| and |$\hat{n}$| are the optimized values of four parameters in the second step, ‘Learning BSE’ (Fig. 1), where |$\hat{n}$| denotes the number of nodes in the hidden layer (Fig. 2). |$\hat{N}_p$| represents the number of pools (Section 5) in the third step, ‘Extract features by convolution and pooling’ (Fig. 1). |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\boldsymbol {\hat{n}}^{\rm BP}_{\rm nhl}$| are the optimized values of two preset parameters in the estimation method, BP network (Section 6).

Parameters|$\hat{\lambda }$||$\hat{\beta }$||$\hat{\rho }$||$\hat{n}$||$\hat{N}_p$||$\hat{n}^{\rm BP}_{\rm hl}$||$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|
log Teff0.007430.005025102(14,10)
log g0.008730.11425122(16,4)
[Fe/H]0.008730.11425152(22,6)
Parameters|$\hat{\lambda }$||$\hat{\beta }$||$\hat{\rho }$||$\hat{n}$||$\hat{N}_p$||$\hat{n}^{\rm BP}_{\rm hl}$||$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|
log Teff0.007430.005025102(14,10)
log g0.008730.11425122(16,4)
[Fe/H]0.008730.11425152(22,6)
Table 4.

The suboptional configuration obtained for the proposed scheme (Fig. 1, Section 3). |$\hat{\lambda }$|⁠, |$\hat{\beta }$|⁠, |$\hat{\rho }$| and |$\hat{n}$| are the optimized values of four parameters in the second step, ‘Learning BSE’ (Fig. 1), where |$\hat{n}$| denotes the number of nodes in the hidden layer (Fig. 2). |$\hat{N}_p$| represents the number of pools (Section 5) in the third step, ‘Extract features by convolution and pooling’ (Fig. 1). |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\boldsymbol {\hat{n}}^{\rm BP}_{\rm nhl}$| are the optimized values of two preset parameters in the estimation method, BP network (Section 6).

Parameters|$\hat{\lambda }$||$\hat{\beta }$||$\hat{\rho }$||$\hat{n}$||$\hat{N}_p$||$\hat{n}^{\rm BP}_{\rm hl}$||$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|
log Teff0.007430.005025102(14,10)
log g0.008730.11425122(16,4)
[Fe/H]0.008730.11425152(22,6)
Parameters|$\hat{\lambda }$||$\hat{\beta }$||$\hat{\rho }$||$\hat{n}$||$\hat{N}_p$||$\hat{n}^{\rm BP}_{\rm hl}$||$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$|
log Teff0.007430.005025102(14,10)
log g0.008730.11425122(16,4)
[Fe/H]0.008730.11425152(22,6)
Based on the configurations of λ, β, n and Np, the suboptimal solutions of |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| are computed by
(27)
where
If |$n^{\rm BP}_{\rm hl} = 1$|⁠, then
If |$n^{\rm BP}_{\rm hl} = 2$|⁠, then
where
(28)
represents possible numbers of nodes in a hidden layer of a network investigated in this work. The computed |$(\hat{n}^{\rm BP}_{\rm hl}, \hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl})$| are presented in Table 4.

Note, however, that the restricted search ranges are selected empirically based on performance on the validation set, the solutions obtained are not usually global minimum ones, but some acceptable values with a feasible computational burden. If we expand the restricted ranges, it is possible that a better configuration can be obtained at greater computational cost.

8 EXPERIMENTS AND DISCUSSION

8.1 Performance for SDSS spectra

From the training set and the validation set consisting of SDSS spectra (Section 2), we obtain a spectral parametrization using the proposed scheme (Section 3, Fig. 1) and the proposed optimization scheme (Section 7, Table 4).

On a SDSS test set, the MAE errors of this spectral parametrization are 0.0060 dex for log Teff, 0.1978 dex for log g and 0.1770 dex for [Fe/H] (the row with index 1 in Table 5). Similarly, on real spectra from SDSS, Fiorentin et al. (2007) investigated the stellar parameter estimation problem using PCA and ANN and obtained accuracies of 0.0126 dex for log Teff, 0.3644 dex for log g and 0.1949 dex for [Fe/H] on a SDSS test set based on 50 PCA features. Liu et al. (2014) studied the spectrum-parametrization problem using the least absolute shrinkage and selection operator (LASSO) algorithm and support vector regression (SVR) method and the optimal MAE errors are 0.0094 dex for log Teff, 0.2672 dex for log g and 0.2728 dex for [Fe/H].

Table 5.

Performance of the proposed scheme on test data sets.

IndexData sourcelog Teff (dex)log g (dex)[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1SDSS spectra0.00600.19780.1770
2synthetic spectra0.00040.01450.0070
(b) Rationality to delete some data components in application (Section 8.3).
3SDSS spectra0.00800.29940.2163
(c) Robustness to stitching error (Section 8.4).
4SDSS spectra0.00630.23710.1827
IndexData sourcelog Teff (dex)log g (dex)[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1SDSS spectra0.00600.19780.1770
2synthetic spectra0.00040.01450.0070
(b) Rationality to delete some data components in application (Section 8.3).
3SDSS spectra0.00800.29940.2163
(c) Robustness to stitching error (Section 8.4).
4SDSS spectra0.00630.23710.1827
Table 5.

Performance of the proposed scheme on test data sets.

IndexData sourcelog Teff (dex)log g (dex)[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1SDSS spectra0.00600.19780.1770
2synthetic spectra0.00040.01450.0070
(b) Rationality to delete some data components in application (Section 8.3).
3SDSS spectra0.00800.29940.2163
(c) Robustness to stitching error (Section 8.4).
4SDSS spectra0.00630.23710.1827
IndexData sourcelog Teff (dex)log g (dex)[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1SDSS spectra0.00600.19780.1770
2synthetic spectra0.00040.01450.0070
(b) Rationality to delete some data components in application (Section 8.3).
3SDSS spectra0.00800.29940.2163
(c) Robustness to stitching error (Section 8.4).
4SDSS spectra0.00630.23710.1827

8.2 Performance for synthetic spectra

To investigate the effectiveness of the proposed scheme further, we also evaluated it on synthetic spectra. The synthetic spectra used in this work and those used in Fiorentin et al. (2007) are all calculated from Kurucz's model. The synthetic data set is described in Section 2.2. This experiment shares the same parameters as the experiment on SDSS data (Section 8.1) and the BP estimation is learned from the synthetic training set (Section 2).

On the synthetic test set, the MAE accuracies of the spectral parametrization learned from the synthetic training set are 0.0004 dex for log Teff, 0.0145 dex for log g and 0.0070 dex for [Fe/H] (the row with index 2 in Table 5). In Fiorentin et al. (2007), the best consistencies on synthetic spectra are obtained based on 100 principal components and the MAEs are 0.0030 dex for log Teff, 0.0251 for log g and 0.0269 for [Fe/H] (table 1 in Fiorentin et al. 2007). Using the LASSO algorithm and SVR method, Li et al. (2014) reached MAE errors of 0.0008 dex for log Teff, 0.0179 dex for log g and 0.0131 dex for [Fe/H]. On the synthetic spectrum, the mean absolute errors in Manteiga et al. 2010 are 0.07 dex for log g and 0.06 dex for [Fe/H].

8.3 Effective data components, unwanted influences and their balances

The proposed scheme extracts features by throwing away some information. In this procedure, it is very probable that some useful (at least in theory) components are discarded. Is this positive or negative?

Actually, besides the useful data components in spectra, there is also redundancy and/or noise and pre-processing imperfections (e.g. sky lines and cosmic-ray removal residuals, residual calibration defects). Redundancy means that some duplication of some components in a system exists. Multiple components are probably usually different from each other regarding the amount of duplication. Therefore, redundancy can disturb the weights of different components, which usually results in an erroneous evaluation and reduces the quality of learning. Noise and pre-processing imperfections can mask off the effects of some important spectral information, e.g. weak lines. Therefore, in theory, it is possible that we can improve the estimates of atmospheric parameters by throwing away some data components.

We conducted an experiment to investigate this theoretical possibility. In this experiment, we estimate atmospheric parameters by deleting procedures 2 and 3 (Fig. 1) from the experiments in Section 8.1. That is to say, we estimated the atmospheric parameters using a BP network directly from a spectrum without throwing away any information in the pooling procedure and the BP estimation network shared the same configurations as in the experiment of Section 8.1. The results are presented in Table 5, part (b) (the row with index 3). The results of the experiments with index 1 and 3 in Table 5 do not show any evident evidence of losing any spectrum parametrization performance when throwing away some data components using the proposed scheme.

8.4 Effects of band stitching errors on parameter estimation

There is a significant oscillation in some SDSS spectra near 5580 Å (e.g. Fig. 4a), which is caused by errors in stitching the red and blue bands. Figs 4(b), 4(c) and 5 show that the response of stitching errors is evidently reduced in the procedures of convolution and pooling.

We also performed one experiment to analyse the effects of these ‘noises’ on performance of the proposed scheme. In this experiment, the BP estimator is trained with features excluding the convolution responses related to stitching errors. For example, in SDSS data, the stitching error appears in the fifth pool in our experiments, so we removed the pooling response of the fifth pool for every BSE structure. The new MAEs of the three parameters in the SDSS test set are presented in Table 5, part (c).

Actually, in the wavelength range containing stitching error, there are both useful components and disturbances from noise and stitching errors. However, experiments show that the useful components outperform the disturbances in the proposed scheme (the experiments with indexes 1 and 4 in Table 5).

9 CONCLUSION

In this work, we propose a novel scheme for spectral feature extraction and stellar atmospheric parameter estimation. In the commonly used methods like PCA and Fourier transforms, every feature is obtained by incorporating nearly all of the fluxes of a spectrum. Differently from these, the characteristics of our proposed scheme are localization and sparseness of the extracted features. ‘Localization’ means that each extracted feature is calculated from some spectral fluxes within a local wavelength range (Fig. 4b). ‘Sparseness’ says that the atmospheric parameters can be estimated using a small number of features (Fig. 4c). The ‘localization’ and ‘sparseness’ signify that many data components are thrown away, especially the redundancies, weak informative components (Section 8.3). However, these weak informative components are easily corrupted by noise and pre-processing imperfections (Section 8.3). It is shown that the proposed scheme has excellent performance in estimating atmospheric parameters (Sections 8.1 and 8.2, Table 5).

The authors thank the reviewer and editor for their instructive comments and extend their thanks to Professor Ali Luo for his support and invaluable discussions. This work is supported by the National Natural Science Foundation of China (grant No: 61273248, 61075033, 61202315), the Natural Science Foundation of Guangdong Province (2014A030313425,S2011010003348), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (201001060) and the high-performance computing platform of South China Normal University.

1

Here, ‘tr_ae’ is the abbreviation of a training set for an autoencoder.

REFERENCES

Abazajian
K. N.
et al.
ApJS
2009
182
543

Ahn
C. P.
et al.
ApJS
2012
203
21

Andrew
N.
Ngiam
J.
Foo
C. Y.
Mai
Y.
Suen
C.
UFLDL Tutorial
2010

Ashish
G.
Murat
A.
Anthony
M.
Dasgupta
S.
McAllester
D.
Proc. 30th International Conference on Machine Learning
2013
Brookline
Microtome Publishing
987

Bailer-Jones
C. A. L.
A&A
2000
357
197

Bailer-Jones
C. A. L.
Irwin
M.
Von Hippel
T.
MNRAS
1998
298
361

Boureau
Y.
Ponce
J.
LeCun
Y.
Fürnkranz
J.
Joachims
T.
International Conference on Machine Learning
2010
Madison, Wisconsin
Omnipress
111

Castelli
F.
Kurucz
R. L.
Piskunov
N.
Weiss
W. W.
Gray
D.F.
IAU Symp. 210, Modelling of Stellar Atmospheres
2003
Dordrecht
Kluwer
A20

Cui
X.
et al.
Res. Astron. Astrophys.
2012
12
1197

Deng
J.
Zhang
Z. X.
Marchi
E.
Schuller
B.
Guerrero
J. E.
Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva
2013
Los Alamitos, CA
IEEE Computer Society
511

Gilmore
G.
et al.
Messenger
2012
147
25

Giridhar
S.
Goswami
A.
Kunder
A.
Muneer
S.
Kumar
G. S.
Prugniel
Ph.
Singh
H. P.
2011 ASI Conf. Ser., Vol. 6. V Astronomical Society of India, Bengaluru, India
2012
137

Gray
R. O.
Corbally
C. J.
AJ
1994
107
742

Grevesse
N.
Sauval
A. J.
Sov. Sci. Rev.
1998
85
161

Gulati
R. K.
Gupta
R.
Gothoskar
P.
Khobragade
S.
ApJ
1994
426
340

Guo
P.
Xing
F.
Jiang
Y. G.
IEEE International Conference on Systems, Man and Cybernetics
2004
6
Piscataway, NJ
IEEE
5894

Hinton
G. E.
Salakhutdinov
R. R.
Sci
2006
313
504

Hubel
D. H.
Wiesel
T. N.
J. Physiol.
1962
160
106

Jofre
P.
Panter
B.
Hansen
C. J.
Weiss
A.
A&A
2010
517
57

Li
X.
Wu
Q. M. J.
Luo
A.
Zhao
Y.
Lu
Y.
Zuo
F.
Yang
T.
Wang
Y.
ApJ
2014
790
105

Liu
C. X.
Zhang
P. A.
Lu
Y.
Res. Astron. Astrophys.
2014
14
423

Lu
Y.
Li
C. L.
Li
X. R.
Spectros. Spectral Anal.
2012
32
2583

Lu
Y.
Li
X. R.
Wang
Y. J.
Yang
T.
Spectros. Spectral Anal.
2013
33
2010

Luo
A.
et al.
2015
preprint (arXiv:1505.01570)

Mallat
S. G.
Zhang
Z.
IEEE Trans. Signal Processing
1993
41
3397

Manteiga
M.
Ordóñez
D.
Dafonte
C.
Arcay
B.
PASP
2010
122
608

McGurk
R. C.
Kimball
A. E.
Ivezić
Ž
AJ
2010
139
1261

Nagi
J.
et al.
ICSIPA
2011
2011
Piscataway, NJ
IEEE Press
342

Olshausen
B. A.
Sparse codes and spikes. In Probabilistic Models Of The Brain: Perceptron Aand Neural Function
2001
Cambridge, MA
MIT Press

Ordóñez
D.
Dafonte
C.
Manteiga
M.
Arcay
B.
Lecture Notes in Computer Science
2008
5271
212

Ordóñez-Blanco
D.
Dafonte-Vázquez
C.
Arcay-Varela
B.
Manteiga
M.
Lecture Notes and Essays in Astrophysics
2007
3
225

Pan
Y. C.
Tu
L. P.
J. Univ. Sci. Technol. Liaoning
2009
32
21

Randich
S.
Gilmore
G.
Gaia–ESO Consortium
Messenger
2013
154
47

Re Fiorentin
P.
Bailer-Jones
C. A. L.
Lee
Y. S.
Beers
T. C.
Sivarani
T.
Wilhelm
R.
Allende Prieto
C.
Norris
J. E.
A&A
2007
467
1373

Recio-Blanco
A.
Bijaoui
A.
De Laverny
P.
MNRAS
2006
370
141

Schierscher
F.
Paunzen
E.
Astron. Nachr.
2011
332
597

Shin
H..C.
Orton
M.
Collins
D. J.
Doran
S.
Leach
M. O.
Chen
X.
Dillon
T.
Ishbuchi
H.
Pei
J.
Wang
H.
Wani
M. A.
10th International Conference on Machine Learning and Applications and Workshops, 1
2011
Washington, DC
IEEE Computer Society
259

Simoncelli
E. P.
Freeman
W. T.
Adelson
E. H.
Heeger
D. J.
IEEE Trans. Information Theory
1992
38
587

Singh
H. P.
Manabu
Y.
Nawo
Y.
Gupta
R.
PASJ
2006
58
177

Tan
C. C.
Eswaran
C.
Neural Computing and Applications
2010
19
1069

Teh
Y. W.
Welling
M.
Osindero
S.
Hinton
G. E.
J. Machine Learning Research
2003
4
1235

Vieira
E. F.
Pons
J. D.
A&AS
1995
111
393

Vishnubhotla
S.
Fernandez
R.
Ramabhadran
B.
Kehtarnavaz
N.
IEEE International Conference on Acoustics Speech and Signal Processing, Texas
2010
Piscataway, NJ
IEEE Press
4614

von Hippel
T.
Storrie-Lombardi
L.
Storrie-Lombardi
M. C.
Irwin
M.
MNRAS
1994
269
97

Weaver
W. B.
Torres-Dodgen
A. V.
ApJ
1995
446
300

Whitney
C. A.
A&A
1983
51
443

Willemsen
P. G.
Bailer-Jones
C. A. L.
Kaempf
T. A.
deBoer
K. S.
A&A
2003
401
1203

Wu
Y.
et al.
Res. Astron. Astrophys.
2011
11
924

Xing
F.
Guo
P.
Spectros. Spectral Anal.
2006
26
1368

Yee
W. T.
Max
W.
Simon
O.
Geoffrey
E. H.
J. Machine Learning Research
2003
4
1235

York
D. G.
et al.
AJ
2000
120
1579

Zhang
J.
Wu
F.
Luo
A.
Zhao
Y.
Acta Astron. Sin.
2005
46
406

Zhang
J.
Wu
F.
Luo
A.
Zhao
Y.
Chin. Astron. Astrophys.
2006
30
176

Zhang
J.
Jiang
Y.
Chang
K.H.
Zhang
S.
Cai
J.
Hu
L.
Pattern Recognition Letters
2009
30
1434

Zhao
G.
Chen
Y.
Shi
J.
Liang
Y.
Hou
J.
Chen
L.
Zhang
H.
Li
A.
Chin. J. Astron. Astrophys.
2006
6
265

Zhao
J.
Luo
A.
Zhao
G.
Guo
M.
Zhao
L.
Wang
L.
Fourth International Conference on Natural Computation, Jinan
2008
Los Alamitos, CA
IEEE Computer Society
150