-
PDF
- Split View
-
Views
-
Cite
Cite
Tan Yang, Xiangru Li, An autoencoder of stellar spectra and its application in automatically estimating atmospheric parameters, Monthly Notices of the Royal Astronomical Society, Volume 452, Issue 1, 01 September 2015, Pages 158–168, https://doi.org/10.1093/mnras/stv1210
- Share Icon Share
Abstract
This article investigates the problem of estimating stellar atmospheric parameters from spectra. Feature extraction is a key procedure in estimating stellar parameters automatically. We propose a scheme for spectral feature extraction and atmospheric parameter estimation using the following three procedures: firstly, learn a set of basic structure elements (BSEs) from stellar spectra using an autoencoder; secondly, extract representative features from stellar spectra based on the learned BSEs through some procedures of convolution and pooling; thirdly, estimate stellar parameters (Teff, log g, [Fe/H]) using a back-propagation (BP) network. The proposed scheme has been evaluated on both real spectra from Sloan Digital Sky Survey (SDSS)/Sloan Extension for Galactic Understanding and Exploration (SEGUE) and synthetic spectra calculated from Kurucz's new opacity distribution function (NEWODF) models. The best mean absolute errors (MAEs) are 0.0060 dex for log Teff, 0.1978 dex for log g and 0.1770 dex for [Fe/H] for the real spectra and 0.0004 dex for log Teff, 0.0145 dex for log g and 0.0070 dex for [Fe/H] for the synthetic spectra.
1 INTRODUCTION
With the development of astronomical observation technology, more and more large-scale sky survey projects have been proposed and implemented: for example, the Sloan Digital Sky Survey (SDSS: York et al. 2000; Ahn et al. 2012), the Large Sky Area Multi-Object Fibre Spectroscopic Telescope (LAMOST/Guoshoujing Telescope: Zhao et al. 2006; Cui et al. 2012; Luo et al. 2015) and the Gaia–ESO Survey (Gilmore et al. 2012; Randich et al. 2013). Large sky survey projects can automatically observe and collect mass astronomical spectral data. However, the speed of manually processing these massive, high-dimensional data cannot adapt to the rapid growth of data. Therefore, automatic processing and analysis methods are urgently needed for astronomical data (Zhang et al. 2009). Because of this, automatic estimation of stellar atmospheric parameters from spectra is currently a hot topic (Recio-Blanco, Bijaoui & De Laverny 2006; Fiorentin et al. 2007; Jofre et al. 2010; Wu et al. 2011; Liu, Zhang & Lu 2014).
Estimating the stellar parameters (Teff, log g, [Fe/H]) from spectra can be abstracted to a process of finding a mapping from a stellar spectrum to its parameters. If there is a set of representative stellar spectra with known parameters (training data), a supervised machine-learning method can learn the mapping from training data: for example nearest neighbour (NN), artificial neural networks (ANN) and support vector machines (SVM). Based on the learned mapping, we can estimate the atmospheric parameters for a new spectrum.
An astronomical spectrum usually consists of thousands of fluxes and it is necessary to extract a small number of features to estimate stellar parameters accurately (Section 8.3). Orthogonal transforms such as wavelet transforms (WTs) and principal-component analysis (PCA) are popular choices for extracting features. They give a complete representation of data by a set of complete orthogonal bases. By selecting a small number of transform coefficients as features, dimensionality reduction is reached. The wavelet coefficients of a spectrum are used to estimate stellar parameters (Manteiga et al. 2010; Lu, Li & Li 2012; Lu et al. 2013) and classify spectra (Guo, Xing & Jiang 2004; Xing & Guo 2006). As the most popular dimension-reduction method, PCA has also been widely used in astronomical data analysis (Whitney 1983; Bailer-Jones, Irwin & Von Hippel 1998; Zhang et al. 2005, 2006; Singh et al. 2006; McGurk, Kimball & Ivezić 2010). Many parameter-estimation schemes use the projections of data/spectra on some principal components directly as spectral features and have achieved good results (Zhang et al. 2005, 2006; Singh et al. 2006; Rosalie et al. 2010).
The motivation of our research is to explore the feasibility of predicting atmospheric parameters (Teff, log g, [Fe/H]) from a small number of descriptions computed from information within a local wavelength range of a spectrum. For convenience, this kind of description is referred to as ‘local and sparse features’. This characteristic of local representation helps to trace back the potential spectral lines effective in estimation (see Section 5.3).
This work proposes a scheme based on autoencoders, convolution and pooling techniques to extract local and sparse features. Autoencoder and convolution operations give a statistical non-orthogonal decomposition, which leads to a redundant and overcomplete representation of data. The ‘overcomplete’ representation (equation 15) of a spectrum has many more dimensions than the original spectrum (equation 1). While complete representations based on orthogonal transforms are mature and popular feature-extraction methods, the redundant, ‘overcomplete’ representations (Olshausen 2001; Teh et al. 2003) have been advocated and used successfully in many fields. In literature, researchers (Yee et al. 2003) show that redundancy and overcompleteness help in computing some features in subsequent procedures, with improved robustness in the presence of noise (Simoncelli et al. 1992), more compactness and more interpretability (Mallat & Zhang 1993).
From the overcomplete representation, a sparse representation (equation 18) is computed using pooling and maximization operations. The pooling and maximization operations are actually a competition strategy. In this procedure, much redundancy is removed by the competition between multiple redundant components. This competition helps to restrain the copies with more noise and a robust representation is obtained. A typical advantage of this scheme is that it can express many suitable and meaningful structures in data in some applications. It is shown that this scheme does extract some meaningful local features (Section 5.3) for automatically estimating atmospheric parameters (Section 8).
The rest of this article is organized as follows: Section 2 describes the spectra used in this work. Section 3 presents the framework of the proposed scheme. Section 4 introduces the autoencoder network, the concept of basic structure elements (BSEs) and the BSE learning method using autoencoders. Sections 5 and 6 describe the proposed feature-extraction method based on BSEs and the estimation method back-propogation (BP) network, respectively. Section 7 investigates the optimization of the proposed scheme. Section 8 reports some experimental evaluations and discusses the rationality and robustness of the proposed scheme. Section 9 concludes this work.
2 DATA SETS AND PREPROCESSING
2.1 Real spectra
In this article, we use two spectral sets to evaluate the proposed scheme. 5000 stellar spectra selected from SDSS-DR7 (Abazajian et al. 2009) compose the real spectrum set. Each spectrum has 3000 fluxes in a logarithmic wavelength range [3.6000, 3.8999] with a sampling resolution of 0.0001. The ranges of the three parameters are [4163, 9685] K for Teff, [1.260, 4.994] dex for log g and [−3.437, 0.1820] dex for [Fe/H].
2.2 Synthetic spectra
A set of 18 969 synthetic spectra is generated with Kurucz's new opacity distribution function (NEWODF) models (Castelli & Kurucz 2003) using the package spectrum (Gray & Corbally 1994) and 830 828 atomic and molecular lines. The solar atomic abundances we used are derived from Grevesse & Sauval (1998). The grids of the synthetic stellar spectra span the parameter ranges [4000, 9750] K in Teff (45 values, step size 100 K between 4000 and 7500 K and 250 K between 7750 and 9750 K), [1, 5] dex in log g (17 values, step size 0.25 dex) and [−3.6, 0.3] dex in [Fe/H] (27 values, step size 0.2 between −3.6 and −1 dex, 0.1 between −1 and 0.3 dex).
2.3 Preprocessing and data partitioning
The proposed scheme of this work belongs to the class of statistical learning methods. The fundamental idea is to discover the predictive relationships between stellar spectra and atmospheric parameters Teff, log g and [Fe/H] from empirical data, which constitutes a training set. At the same time, the performance of the predictive relationships discovered should also be evaluated objectively. Therefore, a separate, independent set of stellar spectra is needed for this evaluation, usually called a test set in machine learning. However, most learning methods tend to overfit the empirical data. In other words, statistical learning methods can unravel some alleged relationships from the training data that do not hold in general. In order to avoid overfitting, we need a third independent spectrum set for optimizing the parameters (Section 7) of the framework that need to be adjusted objectively when investigating the potential relationships. This third spectrum set and the reference parameters constitutes the validation set.
Therefore, in each experiment, we split the total spectrum samples into three subsets: a training set (60 per cent), a validation set (20 per cent) and a test set (20 per cent). The training set is the carrier of knowledge and the proposed scheme should learn from this set. The validation set is a mentor/instructor of the proposed scheme that can independently and objectively give some advice to the learning process. The training set and the validation set are used in establishing a spectral parametrization, while the test set acts as a referee to evaluate the performance of the established spectral parametrization objectively. The roles of these three subsets are listed in Table 1 .
Data set . | Roles . |
---|---|
Training set | (1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2). |
(2) Learn in optimizing the configuration (Section 7). | |
(3) Train the BP network (Section 6). | |
Validation set | Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7). |
Test set | Evaluate the performance of the proposed scheme (Section 8). |
Data set . | Roles . |
---|---|
Training set | (1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2). |
(2) Learn in optimizing the configuration (Section 7). | |
(3) Train the BP network (Section 6). | |
Validation set | Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7). |
Test set | Evaluate the performance of the proposed scheme (Section 8). |
Data set . | Roles . |
---|---|
Training set | (1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2). |
(2) Learn in optimizing the configuration (Section 7). | |
(3) Train the BP network (Section 6). | |
Validation set | Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7). |
Test set | Evaluate the performance of the proposed scheme (Section 8). |
Data set . | Roles . |
---|---|
Training set | (1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2). |
(2) Learn in optimizing the configuration (Section 7). | |
(3) Train the BP network (Section 6). | |
Validation set | Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7). |
Test set | Evaluate the performance of the proposed scheme (Section 8). |
3 FRAMEWORK
There are multiple procedures in our researches; a flowchart shown in Fig. 1 illustrates the end-to-end flow.

A flowchart to show the order of procedures in the proposed scheme. The procedure ‘Learning BSE’ is only used in investigation/training. In application/test stage, we can extract features by convolution and pooling after normalizing spectra using the learned BSEs in training.
Overall, our work can be divided into two stages: (1) a research stage and (2) an application stage. These two stages can be implemented automatically based on the flowchart in Fig. 1. As shown in this flowchart, in the research stage, we can obtain an optimized configuration for the proposed scheme and a spectral parametrization by which we can map a spectrum approximately to its atmospheric parameters. In the application stage, we can compute the atmospheric parameters from a spectrum. More about optimization of the configuration is discussed in Section 7.
This work used multiple acronyms and notations. To facilitate readability, we summarize them in Table 2.
AN . | Meaning . | First appearance . | AN . | Meaning . | First appearance . |
---|---|---|---|---|---|
|$a^{(l)}_j(\boldsymbol {x})$| | the output of the jth node in the lth layer for an input x | Section 4.1 | ap | an atmospheric parameter log Teff, log g or [Fe/H] | Section 7 |
|$b^{(l)}_j$| | the bias of the jth node in the (l + 1)th layer | Section 4.1 | β | a regularized parameter in the objective function J(·, ·) | Section 4.1 |
BSE | basic structure element | Abstract | DCP | description by convolution and pooling | Section 5.2 |
f(·) | an estimation method | Section 6 | g(·) | an activation function of an autoencoder | Section 4.1 |
hW, b | the mapping of an autoencoder | Section 4.1 | J(·, ·) | the objective function of an autoencoder | Section 4.1 |
KL( · ∥ · ) | relative entropy of the average output of a hidden node and its expected output | Section 4.1 | λ | a weight decay parameter in the objective function J(·, ·) | Section 4.1 |
|$\hat{\lambda }$| | optimized value for λ. |$\hat{\beta }$|, |$\hat{\rho }$|, |$\hat{n}$|, |$\hat{N}_p$|, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarly | Section 7 | m | number of nodes in the input layer of an autoencoder | Section 4.1 |
n | number of nodes in the hidden layer of an autoencoder | Section 4.1 | |$n^{\rm BP}_{\rm hl}$| | number of hidden layers in a BP network | Section 6 |
|$\tilde{n}^{\rm BP}_{\rm hl}$| | initialized value of |$n^{\rm BP}_{\rm hl}$| | Section 7 | |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | number of nodes in the hidden layers of a BP network | Section 6 |
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| | an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | Section 7 | N | number of samples in a data set S or Str depending on its context | Section 4.1 |
Np | number of pools | Section 5.2 | ρ | an desired activation level | Section 4.1 |
|$\bar{\rho }_j$| | the average activation of the jth hidden node of an autoencoder | Section 4.1 | RRλ | restricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarly | Section 7 |
S | a data set with spectra and atmospheric parameters | Section 6 | SBSE | a set of BSEs | Section 4.2 |
Str | a training set of stellar spectra | Section 4.2 | |$S^{\rm tr\_ae}$| | a training set for an autoencoder, consists of some spectral patches | Section 4.1 |
|$v^{(j)}_q$| | the maximum convolution response with the jth BSE in the qth pool | Section 5.2 | w | description by convolution and pooling (DCP) | Section 5.2 |
|$W^{(1)}_{j\cdot }$| | the jth BSE | Section 4.1 | |$W^{(l)}_{ji}$| | a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer | Section 4.1 |
x | a spectrum or a spectral patch for an autoencoder depending on its context | Section 2.3 | |$\bar{\boldsymbol {x}}$| | a normalized spectrum | Section 2.3 |
y | output of an autoencoder | Section 4.1 | z | a convolution response of a spectrum with BSEs | Section 5.1 |
AN . | Meaning . | First appearance . | AN . | Meaning . | First appearance . |
---|---|---|---|---|---|
|$a^{(l)}_j(\boldsymbol {x})$| | the output of the jth node in the lth layer for an input x | Section 4.1 | ap | an atmospheric parameter log Teff, log g or [Fe/H] | Section 7 |
|$b^{(l)}_j$| | the bias of the jth node in the (l + 1)th layer | Section 4.1 | β | a regularized parameter in the objective function J(·, ·) | Section 4.1 |
BSE | basic structure element | Abstract | DCP | description by convolution and pooling | Section 5.2 |
f(·) | an estimation method | Section 6 | g(·) | an activation function of an autoencoder | Section 4.1 |
hW, b | the mapping of an autoencoder | Section 4.1 | J(·, ·) | the objective function of an autoencoder | Section 4.1 |
KL( · ∥ · ) | relative entropy of the average output of a hidden node and its expected output | Section 4.1 | λ | a weight decay parameter in the objective function J(·, ·) | Section 4.1 |
|$\hat{\lambda }$| | optimized value for λ. |$\hat{\beta }$|, |$\hat{\rho }$|, |$\hat{n}$|, |$\hat{N}_p$|, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarly | Section 7 | m | number of nodes in the input layer of an autoencoder | Section 4.1 |
n | number of nodes in the hidden layer of an autoencoder | Section 4.1 | |$n^{\rm BP}_{\rm hl}$| | number of hidden layers in a BP network | Section 6 |
|$\tilde{n}^{\rm BP}_{\rm hl}$| | initialized value of |$n^{\rm BP}_{\rm hl}$| | Section 7 | |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | number of nodes in the hidden layers of a BP network | Section 6 |
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| | an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | Section 7 | N | number of samples in a data set S or Str depending on its context | Section 4.1 |
Np | number of pools | Section 5.2 | ρ | an desired activation level | Section 4.1 |
|$\bar{\rho }_j$| | the average activation of the jth hidden node of an autoencoder | Section 4.1 | RRλ | restricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarly | Section 7 |
S | a data set with spectra and atmospheric parameters | Section 6 | SBSE | a set of BSEs | Section 4.2 |
Str | a training set of stellar spectra | Section 4.2 | |$S^{\rm tr\_ae}$| | a training set for an autoencoder, consists of some spectral patches | Section 4.1 |
|$v^{(j)}_q$| | the maximum convolution response with the jth BSE in the qth pool | Section 5.2 | w | description by convolution and pooling (DCP) | Section 5.2 |
|$W^{(1)}_{j\cdot }$| | the jth BSE | Section 4.1 | |$W^{(l)}_{ji}$| | a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer | Section 4.1 |
x | a spectrum or a spectral patch for an autoencoder depending on its context | Section 2.3 | |$\bar{\boldsymbol {x}}$| | a normalized spectrum | Section 2.3 |
y | output of an autoencoder | Section 4.1 | z | a convolution response of a spectrum with BSEs | Section 5.1 |
AN . | Meaning . | First appearance . | AN . | Meaning . | First appearance . |
---|---|---|---|---|---|
|$a^{(l)}_j(\boldsymbol {x})$| | the output of the jth node in the lth layer for an input x | Section 4.1 | ap | an atmospheric parameter log Teff, log g or [Fe/H] | Section 7 |
|$b^{(l)}_j$| | the bias of the jth node in the (l + 1)th layer | Section 4.1 | β | a regularized parameter in the objective function J(·, ·) | Section 4.1 |
BSE | basic structure element | Abstract | DCP | description by convolution and pooling | Section 5.2 |
f(·) | an estimation method | Section 6 | g(·) | an activation function of an autoencoder | Section 4.1 |
hW, b | the mapping of an autoencoder | Section 4.1 | J(·, ·) | the objective function of an autoencoder | Section 4.1 |
KL( · ∥ · ) | relative entropy of the average output of a hidden node and its expected output | Section 4.1 | λ | a weight decay parameter in the objective function J(·, ·) | Section 4.1 |
|$\hat{\lambda }$| | optimized value for λ. |$\hat{\beta }$|, |$\hat{\rho }$|, |$\hat{n}$|, |$\hat{N}_p$|, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarly | Section 7 | m | number of nodes in the input layer of an autoencoder | Section 4.1 |
n | number of nodes in the hidden layer of an autoencoder | Section 4.1 | |$n^{\rm BP}_{\rm hl}$| | number of hidden layers in a BP network | Section 6 |
|$\tilde{n}^{\rm BP}_{\rm hl}$| | initialized value of |$n^{\rm BP}_{\rm hl}$| | Section 7 | |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | number of nodes in the hidden layers of a BP network | Section 6 |
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| | an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | Section 7 | N | number of samples in a data set S or Str depending on its context | Section 4.1 |
Np | number of pools | Section 5.2 | ρ | an desired activation level | Section 4.1 |
|$\bar{\rho }_j$| | the average activation of the jth hidden node of an autoencoder | Section 4.1 | RRλ | restricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarly | Section 7 |
S | a data set with spectra and atmospheric parameters | Section 6 | SBSE | a set of BSEs | Section 4.2 |
Str | a training set of stellar spectra | Section 4.2 | |$S^{\rm tr\_ae}$| | a training set for an autoencoder, consists of some spectral patches | Section 4.1 |
|$v^{(j)}_q$| | the maximum convolution response with the jth BSE in the qth pool | Section 5.2 | w | description by convolution and pooling (DCP) | Section 5.2 |
|$W^{(1)}_{j\cdot }$| | the jth BSE | Section 4.1 | |$W^{(l)}_{ji}$| | a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer | Section 4.1 |
x | a spectrum or a spectral patch for an autoencoder depending on its context | Section 2.3 | |$\bar{\boldsymbol {x}}$| | a normalized spectrum | Section 2.3 |
y | output of an autoencoder | Section 4.1 | z | a convolution response of a spectrum with BSEs | Section 5.1 |
AN . | Meaning . | First appearance . | AN . | Meaning . | First appearance . |
---|---|---|---|---|---|
|$a^{(l)}_j(\boldsymbol {x})$| | the output of the jth node in the lth layer for an input x | Section 4.1 | ap | an atmospheric parameter log Teff, log g or [Fe/H] | Section 7 |
|$b^{(l)}_j$| | the bias of the jth node in the (l + 1)th layer | Section 4.1 | β | a regularized parameter in the objective function J(·, ·) | Section 4.1 |
BSE | basic structure element | Abstract | DCP | description by convolution and pooling | Section 5.2 |
f(·) | an estimation method | Section 6 | g(·) | an activation function of an autoencoder | Section 4.1 |
hW, b | the mapping of an autoencoder | Section 4.1 | J(·, ·) | the objective function of an autoencoder | Section 4.1 |
KL( · ∥ · ) | relative entropy of the average output of a hidden node and its expected output | Section 4.1 | λ | a weight decay parameter in the objective function J(·, ·) | Section 4.1 |
|$\hat{\lambda }$| | optimized value for λ. |$\hat{\beta }$|, |$\hat{\rho }$|, |$\hat{n}$|, |$\hat{N}_p$|, |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| are defined similarly | Section 7 | m | number of nodes in the input layer of an autoencoder | Section 4.1 |
n | number of nodes in the hidden layer of an autoencoder | Section 4.1 | |$n^{\rm BP}_{\rm hl}$| | number of hidden layers in a BP network | Section 6 |
|$\tilde{n}^{\rm BP}_{\rm hl}$| | initialized value of |$n^{\rm BP}_{\rm hl}$| | Section 7 | |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | number of nodes in the hidden layers of a BP network | Section 6 |
|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| | an initialization of |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| | Section 7 | N | number of samples in a data set S or Str depending on its context | Section 4.1 |
Np | number of pools | Section 5.2 | ρ | an desired activation level | Section 4.1 |
|$\bar{\rho }_j$| | the average activation of the jth hidden node of an autoencoder | Section 4.1 | RRλ | restricted (search) range for λ. RRβ, RRρ, RRn, |$RR_{{N}_p}$|, |$RR_{{n}^{\rm BP}_{\rm hl}}$| and |$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$| are defined similarly | Section 7 |
S | a data set with spectra and atmospheric parameters | Section 6 | SBSE | a set of BSEs | Section 4.2 |
Str | a training set of stellar spectra | Section 4.2 | |$S^{\rm tr\_ae}$| | a training set for an autoencoder, consists of some spectral patches | Section 4.1 |
|$v^{(j)}_q$| | the maximum convolution response with the jth BSE in the qth pool | Section 5.2 | w | description by convolution and pooling (DCP) | Section 5.2 |
|$W^{(1)}_{j\cdot }$| | the jth BSE | Section 4.1 | |$W^{(l)}_{ji}$| | a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer | Section 4.1 |
x | a spectrum or a spectral patch for an autoencoder depending on its context | Section 2.3 | |$\bar{\boldsymbol {x}}$| | a normalized spectrum | Section 2.3 |
y | output of an autoencoder | Section 4.1 | z | a convolution response of a spectrum with BSEs | Section 5.1 |
4 LEARNING BSEs USING AN AUTOENCODER
4.1 An autoencoder
An autoencoder is a special kind of ANN, initially proposed as a data dimensionality reduction scheme (Hinton & Salakhutdinov 2006) and now widely used in image analysis (Tan & Eswaran 2010; Shin et al. 2011) and speech processing (Vishnubhotla, Fernandez & Ramabhadran 2010; Deng et al. 2013).

A framework of an autoencoder. The number of output nodes is equal to the number of input nodes. The learning objective of the network is to make the output hW, b(x) as close as possible to input x.
By setting the number of hidden nodes to be far smaller than that of input nodes, an autoencoder can achieve dimension reduction. For example, when an autoencoder has 200 input nodes and 20 hidden nodes (equivalently, m = 200 and n = 20 in Fig. 2), the original 200-dimensional input could be ‘reconstructed’ approximately from the 20-dimensional output of the hidden layer. If we use the output of the hidden layer as a representation of an input of the network, the autoencoder plays the role of a feature extractor.
To introduce the implementation details of the autoencoder, we utilize the notations in an online tutorial ‘UFLDL Tutorial’ (Andrew et al. 2010).
In the network of Fig. 2, let the layer labels be 1 for the input nodes, 2 for the hidden nodes and 3 for the output nodes and suppose |$W^{(l)}_{ji}$| (l = 1, 2) represents a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer, |$b^{(l)}_j$| (l = 1, 2) is the bias of the jth node in the (l + 1)th layer and |$a^{(l)}_j(\boldsymbol {x})$| (l = 1, 2, 3) is the output of the jth node in the lth layer for input x = (x1, …, xm)T. Then, |$a^{(1)}_i(\boldsymbol {x}) = x_i$| for the input nodes, |$a^{(3)}_k(\boldsymbol {x}) = y_k$| for the output nodes.
In equation (8), the first term represents an empirical error evaluation between the actual output and the expected output of an autoencoder and ensures a good reconstruction performance of the network. The second term, a regularization term of |$W^{(l)}_{ji}$|, is used to overcome possible overfitting to the training set by reducing the scheme's complexity.
The third term with weighted coefficient β is a penalty term for sparsity in outputs of the hidden layer. |$\mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j)$| is the relative entropy of the average output, |$\bar{\rho }_j$|, and a desired activation level ρ. |$\mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j)$| increases monotonically with increasing distance between |$\bar{\rho }_j$| and ρ and encourages the average activation, |$\bar{\rho }$|, of the hidden layer to be close to a desired average activation ρ.
4.2 Learning BSEs
BSEs consist of a set of templates of spectral patches with a limited wavelength range. Using the BSE, we can extract local and sparse features (Section 5). This work studies the feasibility of learning BSEs through an autoencoder for automatic estimation of atmospheric parameters. In an autoencoder, the weighted sum |$\sum ^{m}_{i=1}W^{(1)}_{ji}a^{(1)}_i(\boldsymbol {x})+b^{(1)}_j$| in the jth hidden node is essentially a projection of an input x on the vector |$W^{(1)}_{j\cdot }=(W^{(1)}_{j1},W^{(1)}_{j2},\ldots ,W^{(1)}_{jm})^{\rm T}$| of weights, which is similar to the coefficients of a vector in a coordinate system. Thus, |$W^{(1)}_{j\cdot }$| can be regarded as a ‘basis’ for the input data and in this article we name them a basic structure element (BSE), where j = 1, …, n.
To learn BSEs through an autoencoder, a training set |$S^{\rm tr\_ae}$| was constructed to represent the local information of a stellar spectrum. Let Str be a training set of stellar spectra (Section 2). The BSE training set |$S^{\rm tr\_ae}$| is constructed from spectral training set Str in the following way:
randomly select a spectrum, x = (x1, …, xl), from Str, where l > 0 represents the number of fluxes of a spectrum;
randomly generate an integer j satisfying 1 ≤ j ≤ n − m + 1, where m > 0 represents the dimension of a sample in |$S^{\rm tr\_ae}$| and is consistent with the number of input nodes of the autoencoder in Fig. 2;
take (xj, xj + 1, …, xj + m − 1)T as a sample of |$S^{\rm tr\_ae}$|.
By repeating the above three procedures, we obtain a BSE training set |$S^{\rm tr\_ae}$|. Therefore, |$S^{\rm tr\_ae}$| actually consists of a series of spectral patches. Considering the widths of lines of a stellar spectrum, m is empirically taken as 81 in this work. In the proposed scheme, 100 000 such patches are generated to constitute the BSE training set. These patches are not generated from a specific wavelength position, therefore |$S^{\rm tr\_ae}$| expresses the general structure of all spectral patches with length m.
5 FEATURE EXTRACTION
This work proposes to extract features by performing convolution and pooling operations on the computed BSEs and a stellar spectrum.
5.1 Convolution
In this work, a BSE structure template has m = 81 components (equation 12) and a SDSS spectrum is represented with l = 3000 fluxes. Therefore, there are l − m + 1 = 2920 convolution responses for any one BSE structure template and spectrum. From the n = 25 BSE structure templates in equation (11), we obtain (l − m + 1) × n = 2920 × 25 convolution responses (in equation 15) for every spectrum.
5.2 Pooling
5.3 Discussion on convolution and pooling
Pooling is a form of non-linear downsampling operation, which combines the responses of feature detectors at nearby locations into some statistical variables. The pooling method was proposed in Hubel and Wiesel's work on complex cells in the visual cortex (Hubel & Wiesel 1962) and is now used in a large number of applications: for example computer vision and speech recognition. Theoretical analysis (Boureau, Ponce & LeCun 2010) suggests that the pooling method works well when the features are sparse. In image- and speech-processing fields, convolution and pooling based on local bases has been proven to be an effective feature-extraction method (Nagi et al. 2011; Shin et al. 2011; Ashish, Murat & Anthony 2013).
In theory, the convolution operation can reduce negative effects from noise and the pooling operation can restrain the negative influence from some imperfections (e.g. sky lines and cosmic-ray removal residuals, residual calibration defects) by competing between multiple convolution responses of the extracted BSEs and a spectrum (the maximization procedure). Some effects of convolution and pooling are investigated in the following part of this subsection and further discussed in Sections 8.3 and 8.4 based on experimental results.
Using SDSS training data (Section 2.1) and the scheme proposed in Section 4.2, we learned 25 BSE structures (Fig. 3) by setting the number of hidden nodes n = 25 in Fig. 2 and equation (11), where the optimality of n = 25 are investigated in Section 7. It is shown that the BSE structure templates characterize the local structure of spectra. Using the learned BSE structures and the convolution-pooling scheme described in this section, we can extract features from every spectrum.

For example, Fig. 4(a) presents a spectrum from SDSS; there is a stitching error near 5580 Å. In Fig. 4(b), the curve shows the convolution response vector of the third BSE structure |$W^{(1)}_{3\cdot }$| on the spectrum in Fig. 4(a); the points labelled with quadrangles are the pooling responses |$\lbrace v_q^{(3)}, q = 1, \ldots , N_{\rm p}\rbrace$| in equation (17). The results in Fig. 4(b) and (c) show that the responses of the spuriously strong signal of the stitching error are reduced.

Convolution responses of a spectrum and a histogram of the maximum convolution responses in a pooling process with pool number Np = 10. (a) A normalized SDSS spectrum. (b) Convolution responses of the spectrum shown in (a) and the third BSE shown in Fig. 3. In (b), nine vertical dashed lines are boundaries of each pool (the pooling is implemented in logarithmic wavelength space, so the sizes of multiple pools are unequal in a wavelength space) and ten quadrangles indicate the maximum convolution response in every pool. (c) The statistical histogram of the maximum convolution responses in the pooling process for the third BSE in Fig. 3 on 5000 SDSS spectra (Section 2.1).
To investigate the characteristics of the extracted features further using convolution and pooling, we calculate the statistical histogram of the maximum convolution responses on 5000 SDSS spectra (Section 2.1). Every maximum convolution response |$v_q^{(j)}$| has a specific wavelength position (Fig. 4b, equation 17). The statistical histogram of the maximum convolution responses is obtained by cumulating the number of pooling responses |$v_q^{(j)}$| of 5000 SDSS spectra at each wavelength position. Fig. 4(c) shows the statistical histogram of the maximum convolution responses corresponding to the third BSE structure |$W^{(1)}_{3\cdot }$| and more statistical histograms of the maximum convolution responses are presented in Fig. 5. The results in Fig. 5 also show that the effects of stitching errors near 5580 Å in Fig. 4(a) are negligible in the pooling response. More experimental investigations on the stitching error are conducted in Section 8.4.

The cumulative histogram of the local maximum response position for the third, fourth, sixth, 13th, 14th, 16th and 24th BSE structures. The dashed lines indicate eight significant local maximum cumulative responses of pooling operations on 5000 SDSS spectra.
The statistical histograms of the maximum convolution responses (Figs 4c and 5) on 5000 SDSS spectra also show that the pooling responses are statistically sparse: only at a few wavelength positions are there non-zero responses. The sparseness helps to trace back the physical interpretation of the extracted features.
That is to say, most of the cumulative responses are close to zero except for a few wavelength patches with large cumulative responses. As can be seen in Fig. 4, these local maximum responses appear near the wavelength of evident fluctuation in a spectrum. In the pooling process, the responses of these wavelengths are frequently taken as spectral features. Some of the new features characterize such local structures in spectra and they are local and sparse.
Considering that a pooling response |$v_q^{(j)}$| is obtained by a convolution calculation (equations 17 and 14), each pooling feature is actually calculated from some spectral fluxes in a wavelength range. Therefore, we present the wavelength ranges in the second column of Table 3 for those features with large cumulative responses in Fig. 5.
The wavelengths of the eight local-maximum cumulative responses and some potential lines near them. WPT: wavelength position of typical local-maximum cumulative response of a pooling operation on 5000 SDSS spectra, this position is represented by a three-dimensional vector (a b c), where a, b, c are respectively the starting wavelength, central wavelength and ending wavelength and log10b = (log10a + log10c)/2. PL: potential lines probably related to the description of the detected feature. Wavelength b is the position of the local maximum cumulative and fluxes from range [a c] determine the response on b in a convolution process.
No. . | WPT (Å) . | PL . |
---|---|---|
1 | (4000 4074 4150) | Ca i, Hδ, He i |
2 | (4199 4276 4356) | Ca i, Hγ |
3 | (4233 4311 4391) | Fe ii, Na ii, O i, Fe i, Ca iii |
4 | (4744 4831 4922) | Fe i, O iii, Na i, Na ii, O ii, Fe ii |
5 | (5047 5141 5237) | Fe ii, He i, Na i, Ca iii, O iii, Fe i |
6 | (5757 5864 5973) | Na i, Fe i, O ii |
7 | (6401 6520 6641) | Hα, CaH |
8 | (7473 7612 7754) | Fe i, Fe ii, O iii, O i |
No. . | WPT (Å) . | PL . |
---|---|---|
1 | (4000 4074 4150) | Ca i, Hδ, He i |
2 | (4199 4276 4356) | Ca i, Hγ |
3 | (4233 4311 4391) | Fe ii, Na ii, O i, Fe i, Ca iii |
4 | (4744 4831 4922) | Fe i, O iii, Na i, Na ii, O ii, Fe ii |
5 | (5047 5141 5237) | Fe ii, He i, Na i, Ca iii, O iii, Fe i |
6 | (5757 5864 5973) | Na i, Fe i, O ii |
7 | (6401 6520 6641) | Hα, CaH |
8 | (7473 7612 7754) | Fe i, Fe ii, O iii, O i |
The wavelengths of the eight local-maximum cumulative responses and some potential lines near them. WPT: wavelength position of typical local-maximum cumulative response of a pooling operation on 5000 SDSS spectra, this position is represented by a three-dimensional vector (a b c), where a, b, c are respectively the starting wavelength, central wavelength and ending wavelength and log10b = (log10a + log10c)/2. PL: potential lines probably related to the description of the detected feature. Wavelength b is the position of the local maximum cumulative and fluxes from range [a c] determine the response on b in a convolution process.
No. . | WPT (Å) . | PL . |
---|---|---|
1 | (4000 4074 4150) | Ca i, Hδ, He i |
2 | (4199 4276 4356) | Ca i, Hγ |
3 | (4233 4311 4391) | Fe ii, Na ii, O i, Fe i, Ca iii |
4 | (4744 4831 4922) | Fe i, O iii, Na i, Na ii, O ii, Fe ii |
5 | (5047 5141 5237) | Fe ii, He i, Na i, Ca iii, O iii, Fe i |
6 | (5757 5864 5973) | Na i, Fe i, O ii |
7 | (6401 6520 6641) | Hα, CaH |
8 | (7473 7612 7754) | Fe i, Fe ii, O iii, O i |
No. . | WPT (Å) . | PL . |
---|---|---|
1 | (4000 4074 4150) | Ca i, Hδ, He i |
2 | (4199 4276 4356) | Ca i, Hγ |
3 | (4233 4311 4391) | Fe ii, Na ii, O i, Fe i, Ca iii |
4 | (4744 4831 4922) | Fe i, O iii, Na i, Na ii, O ii, Fe ii |
5 | (5047 5141 5237) | Fe ii, He i, Na i, Ca iii, O iii, Fe i |
6 | (5757 5864 5973) | Na i, Fe i, O ii |
7 | (6401 6520 6641) | Hα, CaH |
8 | (7473 7612 7754) | Fe i, Fe ii, O iii, O i |
Although the features are not extracted by identifying spectral lines, the results in Fig. 5 do show that the detected features are near some spectral lines potentially existing in a specific stellar spectrum (the third column of Table 3).
6 ESTIMATING ATMOSPHERIC PARAMETERS
As a typical non-linear learning method, ANN has been widely used in automated estimation of stellar atmosphere parameters (Bailer-Jones 2000; Willemsen et al. 2003; Ordóñez et al. 2007, 2008; Zhao, Luo & Zhao 2008; Pan & Tu 2009; Manteiga et al. 2010; Giridhar et al. 2012) and spectrum classification (Gulati et al. 1994; Hippel et al. 1994; Vieira & Pons 1995; Weaver & Torres-Dodgen 1995; Schierscher & Paunzen 2011). For example, Manteiga et al. (2010) extracted spectral features using fast Fourier transforms (FFTs) and discrete wavelet transform (DWT) and estimated the parameters Teff, log g, [Fe/H] and [α/Fe] by an ANN with one hidden layer from FFT coefficients and wavelet coefficients. Giridhar et al. (2012) studied how to estimate atmospheric parameters directly from spectra by a BP neural network. Pan & Tu (2009) proposed to parametrize a stellar spectrum using an ANN from a few PCA features.
In this article, we use BP networks to learn the mapping from extracted DCP features (Section 5) to stellar parameters Teff, log g and [Fe/H]. Training of BP networks is an iterative process and in each iteration the estimated errors are calculated on two sets: the training set and the validation set. A BP network optimizes its parameters by minimizing the difference between the network's output and the expected output (e.g. stellar atmospheric parameters) according to the estimated errors in the training set and this process will stop when the estimated errors in the validation set have no improvement in successive iteration steps. This can avoid overfitting.
In a BP network, there are two preset parameters: the number of hidden layers and number of nodes in each hidden layer. For convenience, the former is denoted by |$n^{\rm BP}_{\rm hl}$|, the latter |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|. In this work, we investigated the cases |$n^{\rm BP}_{\rm hl} = 1$| and |$n^{\rm BP}_{\rm hl} = 2$| for computational feasibility. If |$n^{\rm BP}_{\rm hl} = 1$|, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| is a positive integer. If |$n^{\rm BP}_{\rm hl} = 2$|, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| is a two-dimensional row vector consisting of the numbers of nodes in the first and second hidden layers.
7 OPTIMIZING THE CONFIGURATION
The proposed scheme consists of four steps (Fig. 1, Section 3). In the second step, ‘Learning BSE’, there are four parameters, λ, β, ρ and n, to be preset, where n denotes the number of nodes in the hidden layer of an autoencoder (Fig. 2). In the third step, ‘Extract features by convolution and pooling’, there is a preset parameter, Np, representing the number of pools (Section 5). In the estimation method, BP network, there are two preset parameters |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| (Section 6).
To optimize the configuration of the proposed scheme, the spectrum-parametrization scheme was estimated from a training set from SDSS. The performance of the estimated spectrum-parametrization scheme was evaluated on a validation set from SDSS (Section 2).
Because there is no analytical expression for the objective function in equation (21), we can obtain the optional configuration |$(\hat{\lambda }, \hat{\beta }, \hat{\rho }, \hat{n}, \hat{N_{\rm p}}, \hat{n}^{\rm BP}_{\rm hl}, \hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl})_{\rm ap}$| in theory by repeating the four procedures of the proposed scheme (Fig. 1, Section 3) with every possible configuration of |$\lambda , \beta , \rho , n, N_{\rm p}, n^{\rm BP}_{\rm hl}, \boldsymbol {n}^{\rm BP}_{\rm nhl}$| and choosing the one with minimal MAE error as the optimal configuration.
However, this theoretical optimization scheme is infeasible as regards computational burden. Therefore, instead of obtaining an optimal configuration, we propose to find an excellent/acceptable configuration, a suboptimal solution.
The suboptional configuration obtained for the proposed scheme (Fig. 1, Section 3). |$\hat{\lambda }$|, |$\hat{\beta }$|, |$\hat{\rho }$| and |$\hat{n}$| are the optimized values of four parameters in the second step, ‘Learning BSE’ (Fig. 1), where |$\hat{n}$| denotes the number of nodes in the hidden layer (Fig. 2). |$\hat{N}_p$| represents the number of pools (Section 5) in the third step, ‘Extract features by convolution and pooling’ (Fig. 1). |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\boldsymbol {\hat{n}}^{\rm BP}_{\rm nhl}$| are the optimized values of two preset parameters in the estimation method, BP network (Section 6).
Parameters . | |$\hat{\lambda }$| . | |$\hat{\beta }$| . | |$\hat{\rho }$| . | |$\hat{n}$| . | |$\hat{N}_p$| . | |$\hat{n}^{\rm BP}_{\rm hl}$| . | |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| . |
---|---|---|---|---|---|---|---|
log Teff | 0.0074 | 3 | 0.0050 | 25 | 10 | 2 | (14,10) |
log g | 0.0087 | 3 | 0.114 | 25 | 12 | 2 | (16,4) |
[Fe/H] | 0.0087 | 3 | 0.114 | 25 | 15 | 2 | (22,6) |
Parameters . | |$\hat{\lambda }$| . | |$\hat{\beta }$| . | |$\hat{\rho }$| . | |$\hat{n}$| . | |$\hat{N}_p$| . | |$\hat{n}^{\rm BP}_{\rm hl}$| . | |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| . |
---|---|---|---|---|---|---|---|
log Teff | 0.0074 | 3 | 0.0050 | 25 | 10 | 2 | (14,10) |
log g | 0.0087 | 3 | 0.114 | 25 | 12 | 2 | (16,4) |
[Fe/H] | 0.0087 | 3 | 0.114 | 25 | 15 | 2 | (22,6) |
The suboptional configuration obtained for the proposed scheme (Fig. 1, Section 3). |$\hat{\lambda }$|, |$\hat{\beta }$|, |$\hat{\rho }$| and |$\hat{n}$| are the optimized values of four parameters in the second step, ‘Learning BSE’ (Fig. 1), where |$\hat{n}$| denotes the number of nodes in the hidden layer (Fig. 2). |$\hat{N}_p$| represents the number of pools (Section 5) in the third step, ‘Extract features by convolution and pooling’ (Fig. 1). |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\boldsymbol {\hat{n}}^{\rm BP}_{\rm nhl}$| are the optimized values of two preset parameters in the estimation method, BP network (Section 6).
Parameters . | |$\hat{\lambda }$| . | |$\hat{\beta }$| . | |$\hat{\rho }$| . | |$\hat{n}$| . | |$\hat{N}_p$| . | |$\hat{n}^{\rm BP}_{\rm hl}$| . | |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| . |
---|---|---|---|---|---|---|---|
log Teff | 0.0074 | 3 | 0.0050 | 25 | 10 | 2 | (14,10) |
log g | 0.0087 | 3 | 0.114 | 25 | 12 | 2 | (16,4) |
[Fe/H] | 0.0087 | 3 | 0.114 | 25 | 15 | 2 | (22,6) |
Parameters . | |$\hat{\lambda }$| . | |$\hat{\beta }$| . | |$\hat{\rho }$| . | |$\hat{n}$| . | |$\hat{N}_p$| . | |$\hat{n}^{\rm BP}_{\rm hl}$| . | |$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$| . |
---|---|---|---|---|---|---|---|
log Teff | 0.0074 | 3 | 0.0050 | 25 | 10 | 2 | (14,10) |
log g | 0.0087 | 3 | 0.114 | 25 | 12 | 2 | (16,4) |
[Fe/H] | 0.0087 | 3 | 0.114 | 25 | 15 | 2 | (22,6) |
Note, however, that the restricted search ranges are selected empirically based on performance on the validation set, the solutions obtained are not usually global minimum ones, but some acceptable values with a feasible computational burden. If we expand the restricted ranges, it is possible that a better configuration can be obtained at greater computational cost.
8 EXPERIMENTS AND DISCUSSION
8.1 Performance for SDSS spectra
From the training set and the validation set consisting of SDSS spectra (Section 2), we obtain a spectral parametrization using the proposed scheme (Section 3, Fig. 1) and the proposed optimization scheme (Section 7, Table 4).
On a SDSS test set, the MAE errors of this spectral parametrization are 0.0060 dex for log Teff, 0.1978 dex for log g and 0.1770 dex for [Fe/H] (the row with index 1 in Table 5). Similarly, on real spectra from SDSS, Fiorentin et al. (2007) investigated the stellar parameter estimation problem using PCA and ANN and obtained accuracies of 0.0126 dex for log Teff, 0.3644 dex for log g and 0.1949 dex for [Fe/H] on a SDSS test set based on 50 PCA features. Liu et al. (2014) studied the spectrum-parametrization problem using the least absolute shrinkage and selection operator (LASSO) algorithm and support vector regression (SVR) method and the optimal MAE errors are 0.0094 dex for log Teff, 0.2672 dex for log g and 0.2728 dex for [Fe/H].
Index . | Data source . | log Teff (dex) . | log g (dex) . | [Fe/H] (dex) . |
---|---|---|---|---|
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2. | ||||
1 | SDSS spectra | 0.0060 | 0.1978 | 0.1770 |
2 | synthetic spectra | 0.0004 | 0.0145 | 0.0070 |
(b) Rationality to delete some data components in application (Section 8.3). | ||||
3 | SDSS spectra | 0.0080 | 0.2994 | 0.2163 |
(c) Robustness to stitching error (Section 8.4). | ||||
4 | SDSS spectra | 0.0063 | 0.2371 | 0.1827 |
Index . | Data source . | log Teff (dex) . | log g (dex) . | [Fe/H] (dex) . |
---|---|---|---|---|
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2. | ||||
1 | SDSS spectra | 0.0060 | 0.1978 | 0.1770 |
2 | synthetic spectra | 0.0004 | 0.0145 | 0.0070 |
(b) Rationality to delete some data components in application (Section 8.3). | ||||
3 | SDSS spectra | 0.0080 | 0.2994 | 0.2163 |
(c) Robustness to stitching error (Section 8.4). | ||||
4 | SDSS spectra | 0.0063 | 0.2371 | 0.1827 |
Index . | Data source . | log Teff (dex) . | log g (dex) . | [Fe/H] (dex) . |
---|---|---|---|---|
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2. | ||||
1 | SDSS spectra | 0.0060 | 0.1978 | 0.1770 |
2 | synthetic spectra | 0.0004 | 0.0145 | 0.0070 |
(b) Rationality to delete some data components in application (Section 8.3). | ||||
3 | SDSS spectra | 0.0080 | 0.2994 | 0.2163 |
(c) Robustness to stitching error (Section 8.4). | ||||
4 | SDSS spectra | 0.0063 | 0.2371 | 0.1827 |
Index . | Data source . | log Teff (dex) . | log g (dex) . | [Fe/H] (dex) . |
---|---|---|---|---|
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2. | ||||
1 | SDSS spectra | 0.0060 | 0.1978 | 0.1770 |
2 | synthetic spectra | 0.0004 | 0.0145 | 0.0070 |
(b) Rationality to delete some data components in application (Section 8.3). | ||||
3 | SDSS spectra | 0.0080 | 0.2994 | 0.2163 |
(c) Robustness to stitching error (Section 8.4). | ||||
4 | SDSS spectra | 0.0063 | 0.2371 | 0.1827 |
8.2 Performance for synthetic spectra
To investigate the effectiveness of the proposed scheme further, we also evaluated it on synthetic spectra. The synthetic spectra used in this work and those used in Fiorentin et al. (2007) are all calculated from Kurucz's model. The synthetic data set is described in Section 2.2. This experiment shares the same parameters as the experiment on SDSS data (Section 8.1) and the BP estimation is learned from the synthetic training set (Section 2).
On the synthetic test set, the MAE accuracies of the spectral parametrization learned from the synthetic training set are 0.0004 dex for log Teff, 0.0145 dex for log g and 0.0070 dex for [Fe/H] (the row with index 2 in Table 5). In Fiorentin et al. (2007), the best consistencies on synthetic spectra are obtained based on 100 principal components and the MAEs are 0.0030 dex for log Teff, 0.0251 for log g and 0.0269 for [Fe/H] (table 1 in Fiorentin et al. 2007). Using the LASSO algorithm and SVR method, Li et al. (2014) reached MAE errors of 0.0008 dex for log Teff, 0.0179 dex for log g and 0.0131 dex for [Fe/H]. On the synthetic spectrum, the mean absolute errors in Manteiga et al. 2010 are 0.07 dex for log g and 0.06 dex for [Fe/H].
8.3 Effective data components, unwanted influences and their balances
The proposed scheme extracts features by throwing away some information. In this procedure, it is very probable that some useful (at least in theory) components are discarded. Is this positive or negative?
Actually, besides the useful data components in spectra, there is also redundancy and/or noise and pre-processing imperfections (e.g. sky lines and cosmic-ray removal residuals, residual calibration defects). Redundancy means that some duplication of some components in a system exists. Multiple components are probably usually different from each other regarding the amount of duplication. Therefore, redundancy can disturb the weights of different components, which usually results in an erroneous evaluation and reduces the quality of learning. Noise and pre-processing imperfections can mask off the effects of some important spectral information, e.g. weak lines. Therefore, in theory, it is possible that we can improve the estimates of atmospheric parameters by throwing away some data components.
We conducted an experiment to investigate this theoretical possibility. In this experiment, we estimate atmospheric parameters by deleting procedures 2 and 3 (Fig. 1) from the experiments in Section 8.1. That is to say, we estimated the atmospheric parameters using a BP network directly from a spectrum without throwing away any information in the pooling procedure and the BP estimation network shared the same configurations as in the experiment of Section 8.1. The results are presented in Table 5, part (b) (the row with index 3). The results of the experiments with index 1 and 3 in Table 5 do not show any evident evidence of losing any spectrum parametrization performance when throwing away some data components using the proposed scheme.
8.4 Effects of band stitching errors on parameter estimation
There is a significant oscillation in some SDSS spectra near 5580 Å (e.g. Fig. 4a), which is caused by errors in stitching the red and blue bands. Figs 4(b), 4(c) and 5 show that the response of stitching errors is evidently reduced in the procedures of convolution and pooling.
We also performed one experiment to analyse the effects of these ‘noises’ on performance of the proposed scheme. In this experiment, the BP estimator is trained with features excluding the convolution responses related to stitching errors. For example, in SDSS data, the stitching error appears in the fifth pool in our experiments, so we removed the pooling response of the fifth pool for every BSE structure. The new MAEs of the three parameters in the SDSS test set are presented in Table 5, part (c).
Actually, in the wavelength range containing stitching error, there are both useful components and disturbances from noise and stitching errors. However, experiments show that the useful components outperform the disturbances in the proposed scheme (the experiments with indexes 1 and 4 in Table 5).
9 CONCLUSION
In this work, we propose a novel scheme for spectral feature extraction and stellar atmospheric parameter estimation. In the commonly used methods like PCA and Fourier transforms, every feature is obtained by incorporating nearly all of the fluxes of a spectrum. Differently from these, the characteristics of our proposed scheme are localization and sparseness of the extracted features. ‘Localization’ means that each extracted feature is calculated from some spectral fluxes within a local wavelength range (Fig. 4b). ‘Sparseness’ says that the atmospheric parameters can be estimated using a small number of features (Fig. 4c). The ‘localization’ and ‘sparseness’ signify that many data components are thrown away, especially the redundancies, weak informative components (Section 8.3). However, these weak informative components are easily corrupted by noise and pre-processing imperfections (Section 8.3). It is shown that the proposed scheme has excellent performance in estimating atmospheric parameters (Sections 8.1 and 8.2, Table 5).
The authors thank the reviewer and editor for their instructive comments and extend their thanks to Professor Ali Luo for his support and invaluable discussions. This work is supported by the National Natural Science Foundation of China (grant No: 61273248, 61075033, 61202315), the Natural Science Foundation of Guangdong Province (2014A030313425,S2011010003348), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (201001060) and the high-performance computing platform of South China Normal University.
Here, ‘tr_ae’ is the abbreviation of a training set for an autoencoder.
REFERENCES