An autoencoder of stellar spectra and its application in automatically estimating atmospheric parameters

Roles of the three data sets.

Data set	Roles
Training set	(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
	(2) Learn in optimizing the configuration (Section 7).
	(3) Train the BP network (Section 6).
Validation set	Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test set	Evaluate the performance of the proposed scheme (Section 8).

Data set	Roles
Training set	(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
	(2) Learn in optimizing the configuration (Section 7).
	(3) Train the BP network (Section 6).
Validation set	Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test set	Evaluate the performance of the proposed scheme (Section 8).

Table 1.

Roles of the three data sets.

Data set	Roles
Training set	(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
	(2) Learn in optimizing the configuration (Section 7).
	(3) Train the BP network (Section 6).
Validation set	Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test set	Evaluate the performance of the proposed scheme (Section 8).

Data set	Roles
Training set	(1) Generate spectral patches to construct a training set for an autoencoder (Subsection 4.2).
	(2) Learn in optimizing the configuration (Section 7).
	(3) Train the BP network (Section 6).
Validation set	Evaluate the performance of the learned spectral parametrization in optimizing the configuration (Section 7).
Test set	Evaluate the performance of the proposed scheme (Section 8).

3 FRAMEWORK

There are multiple procedures in our researches; a flowchart shown in Fig. 1 illustrates the end-to-end flow.

Figure 1.

A flowchart to show the order of procedures in the proposed scheme. The procedure ‘Learning BSE’ is only used in investigation/training. In application/test stage, we can extract features by convolution and pooling after normalizing spectra using the learned BSEs in training.

Overall, our work can be divided into two stages: (1) a research stage and (2) an application stage. These two stages can be implemented automatically based on the flowchart in Fig. 1. As shown in this flowchart, in the research stage, we can obtain an optimized configuration for the proposed scheme and a spectral parametrization by which we can map a spectrum approximately to its atmospheric parameters. In the application stage, we can compute the atmospheric parameters from a spectrum. More about optimization of the configuration is discussed in Section 7.

This work used multiple acronyms and notations. To facilitate readability, we summarize them in Table 2.

Table 2.

Acronyms and notations used in this work. AN: acronym or notation.

AN	Meaning	First appearance	AN	Meaning	First appearance
\|$a^{(l)}_j(\boldsymbol {x})$\|	the output of the jth node in the lth layer for an input x	Section 4.1	ap	an atmospheric parameter log T_eff, log g or [Fe/H]	Section 7
\|$b^{(l)}_j$\|	the bias of the jth node in the (l + 1)th layer	Section 4.1	β	a regularized parameter in the objective function J(·, ·)	Section 4.1
BSE	basic structure element	Abstract	DCP	description by convolution and pooling	Section 5.2
f(·)	an estimation method	Section 6	g(·)	an activation function of an autoencoder	Section 4.1
h_{W, b}	the mapping of an autoencoder	Section 4.1	J(·, ·)	the objective function of an autoencoder	Section 4.1
KL( · ∥ · )	relative entropy of the average output of a hidden node and its expected output	Section 4.1	λ	a weight decay parameter in the objective function J(·, ·)	Section 4.1
\|$\hat{\lambda }$\|	optimized value for λ. \|$\hat{\beta }$\|⁠, \|$\hat{\rho }$\|⁠, \|$\hat{n}$\|⁠, \|$\hat{N}_p$\|⁠, \|$\hat{n}^{\rm BP}_{\rm hl}$\| and \|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\| are defined similarly	Section 7	m	number of nodes in the input layer of an autoencoder	Section 4.1
n	number of nodes in the hidden layer of an autoencoder	Section 4.1	\|$n^{\rm BP}_{\rm hl}$\|	number of hidden layers in a BP network	Section 6
\|$\tilde{n}^{\rm BP}_{\rm hl}$\|	initialized value of \|$n^{\rm BP}_{\rm hl}$\|	Section 7	\|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	number of nodes in the hidden layers of a BP network	Section 6
\|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|	an initialization of \|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	Section 7	N	number of samples in a data set S or S^tr depending on its context	Section 4.1
N_p	number of pools	Section 5.2	ρ	an desired activation level	Section 4.1
\|$\bar{\rho }_j$\|	the average activation of the jth hidden node of an autoencoder	Section 4.1	RR_λ	restricted (search) range for λ. RR_β, RR_ρ, RR_n, \|$RR_{{N}_p}$\|⁠, \|$RR_{{n}^{\rm BP}_{\rm hl}}$\| and \|$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$\| are defined similarly	Section 7
S	a data set with spectra and atmospheric parameters	Section 6	S_BSE	a set of BSEs	Section 4.2
S^tr	a training set of stellar spectra	Section 4.2	\|$S^{\rm tr\_ae}$\|	a training set for an autoencoder, consists of some spectral patches	Section 4.1
\|$v^{(j)}_q$\|	the maximum convolution response with the jth BSE in the qth pool	Section 5.2	w	description by convolution and pooling (DCP)	Section 5.2
\|$W^{(1)}_{j\cdot }$\|	the jth BSE	Section 4.1	\|$W^{(l)}_{ji}$\|	a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer	Section 4.1
x	a spectrum or a spectral patch for an autoencoder depending on its context	Section 2.3	\|$\bar{\boldsymbol {x}}$\|	a normalized spectrum	Section 2.3
y	output of an autoencoder	Section 4.1	z	a convolution response of a spectrum with BSEs	Section 5.1

AN	Meaning	First appearance	AN	Meaning	First appearance
\|$a^{(l)}_j(\boldsymbol {x})$\|	the output of the jth node in the lth layer for an input x	Section 4.1	ap	an atmospheric parameter log T_eff, log g or [Fe/H]	Section 7
\|$b^{(l)}_j$\|	the bias of the jth node in the (l + 1)th layer	Section 4.1	β	a regularized parameter in the objective function J(·, ·)	Section 4.1
BSE	basic structure element	Abstract	DCP	description by convolution and pooling	Section 5.2
f(·)	an estimation method	Section 6	g(·)	an activation function of an autoencoder	Section 4.1
h_{W, b}	the mapping of an autoencoder	Section 4.1	J(·, ·)	the objective function of an autoencoder	Section 4.1
KL( · ∥ · )	relative entropy of the average output of a hidden node and its expected output	Section 4.1	λ	a weight decay parameter in the objective function J(·, ·)	Section 4.1
\|$\hat{\lambda }$\|	optimized value for λ. \|$\hat{\beta }$\|⁠, \|$\hat{\rho }$\|⁠, \|$\hat{n}$\|⁠, \|$\hat{N}_p$\|⁠, \|$\hat{n}^{\rm BP}_{\rm hl}$\| and \|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\| are defined similarly	Section 7	m	number of nodes in the input layer of an autoencoder	Section 4.1
n	number of nodes in the hidden layer of an autoencoder	Section 4.1	\|$n^{\rm BP}_{\rm hl}$\|	number of hidden layers in a BP network	Section 6
\|$\tilde{n}^{\rm BP}_{\rm hl}$\|	initialized value of \|$n^{\rm BP}_{\rm hl}$\|	Section 7	\|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	number of nodes in the hidden layers of a BP network	Section 6
\|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|	an initialization of \|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	Section 7	N	number of samples in a data set S or S^tr depending on its context	Section 4.1
N_p	number of pools	Section 5.2	ρ	an desired activation level	Section 4.1
\|$\bar{\rho }_j$\|	the average activation of the jth hidden node of an autoencoder	Section 4.1	RR_λ	restricted (search) range for λ. RR_β, RR_ρ, RR_n, \|$RR_{{N}_p}$\|⁠, \|$RR_{{n}^{\rm BP}_{\rm hl}}$\| and \|$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$\| are defined similarly	Section 7
S	a data set with spectra and atmospheric parameters	Section 6	S_BSE	a set of BSEs	Section 4.2
S^tr	a training set of stellar spectra	Section 4.2	\|$S^{\rm tr\_ae}$\|	a training set for an autoencoder, consists of some spectral patches	Section 4.1
\|$v^{(j)}_q$\|	the maximum convolution response with the jth BSE in the qth pool	Section 5.2	w	description by convolution and pooling (DCP)	Section 5.2
\|$W^{(1)}_{j\cdot }$\|	the jth BSE	Section 4.1	\|$W^{(l)}_{ji}$\|	a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer	Section 4.1
x	a spectrum or a spectral patch for an autoencoder depending on its context	Section 2.3	\|$\bar{\boldsymbol {x}}$\|	a normalized spectrum	Section 2.3
y	output of an autoencoder	Section 4.1	z	a convolution response of a spectrum with BSEs	Section 5.1

Table 2.

Acronyms and notations used in this work. AN: acronym or notation.

AN	Meaning	First appearance	AN	Meaning	First appearance
\|$a^{(l)}_j(\boldsymbol {x})$\|	the output of the jth node in the lth layer for an input x	Section 4.1	ap	an atmospheric parameter log T_eff, log g or [Fe/H]	Section 7
\|$b^{(l)}_j$\|	the bias of the jth node in the (l + 1)th layer	Section 4.1	β	a regularized parameter in the objective function J(·, ·)	Section 4.1
BSE	basic structure element	Abstract	DCP	description by convolution and pooling	Section 5.2
f(·)	an estimation method	Section 6	g(·)	an activation function of an autoencoder	Section 4.1
h_{W, b}	the mapping of an autoencoder	Section 4.1	J(·, ·)	the objective function of an autoencoder	Section 4.1
KL( · ∥ · )	relative entropy of the average output of a hidden node and its expected output	Section 4.1	λ	a weight decay parameter in the objective function J(·, ·)	Section 4.1
\|$\hat{\lambda }$\|	optimized value for λ. \|$\hat{\beta }$\|⁠, \|$\hat{\rho }$\|⁠, \|$\hat{n}$\|⁠, \|$\hat{N}_p$\|⁠, \|$\hat{n}^{\rm BP}_{\rm hl}$\| and \|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\| are defined similarly	Section 7	m	number of nodes in the input layer of an autoencoder	Section 4.1
n	number of nodes in the hidden layer of an autoencoder	Section 4.1	\|$n^{\rm BP}_{\rm hl}$\|	number of hidden layers in a BP network	Section 6
\|$\tilde{n}^{\rm BP}_{\rm hl}$\|	initialized value of \|$n^{\rm BP}_{\rm hl}$\|	Section 7	\|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	number of nodes in the hidden layers of a BP network	Section 6
\|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|	an initialization of \|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	Section 7	N	number of samples in a data set S or S^tr depending on its context	Section 4.1
N_p	number of pools	Section 5.2	ρ	an desired activation level	Section 4.1
\|$\bar{\rho }_j$\|	the average activation of the jth hidden node of an autoencoder	Section 4.1	RR_λ	restricted (search) range for λ. RR_β, RR_ρ, RR_n, \|$RR_{{N}_p}$\|⁠, \|$RR_{{n}^{\rm BP}_{\rm hl}}$\| and \|$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$\| are defined similarly	Section 7
S	a data set with spectra and atmospheric parameters	Section 6	S_BSE	a set of BSEs	Section 4.2
S^tr	a training set of stellar spectra	Section 4.2	\|$S^{\rm tr\_ae}$\|	a training set for an autoencoder, consists of some spectral patches	Section 4.1
\|$v^{(j)}_q$\|	the maximum convolution response with the jth BSE in the qth pool	Section 5.2	w	description by convolution and pooling (DCP)	Section 5.2
\|$W^{(1)}_{j\cdot }$\|	the jth BSE	Section 4.1	\|$W^{(l)}_{ji}$\|	a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer	Section 4.1
x	a spectrum or a spectral patch for an autoencoder depending on its context	Section 2.3	\|$\bar{\boldsymbol {x}}$\|	a normalized spectrum	Section 2.3
y	output of an autoencoder	Section 4.1	z	a convolution response of a spectrum with BSEs	Section 5.1

AN	Meaning	First appearance	AN	Meaning	First appearance
\|$a^{(l)}_j(\boldsymbol {x})$\|	the output of the jth node in the lth layer for an input x	Section 4.1	ap	an atmospheric parameter log T_eff, log g or [Fe/H]	Section 7
\|$b^{(l)}_j$\|	the bias of the jth node in the (l + 1)th layer	Section 4.1	β	a regularized parameter in the objective function J(·, ·)	Section 4.1
BSE	basic structure element	Abstract	DCP	description by convolution and pooling	Section 5.2
f(·)	an estimation method	Section 6	g(·)	an activation function of an autoencoder	Section 4.1
h_{W, b}	the mapping of an autoencoder	Section 4.1	J(·, ·)	the objective function of an autoencoder	Section 4.1
KL( · ∥ · )	relative entropy of the average output of a hidden node and its expected output	Section 4.1	λ	a weight decay parameter in the objective function J(·, ·)	Section 4.1
\|$\hat{\lambda }$\|	optimized value for λ. \|$\hat{\beta }$\|⁠, \|$\hat{\rho }$\|⁠, \|$\hat{n}$\|⁠, \|$\hat{N}_p$\|⁠, \|$\hat{n}^{\rm BP}_{\rm hl}$\| and \|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\| are defined similarly	Section 7	m	number of nodes in the input layer of an autoencoder	Section 4.1
n	number of nodes in the hidden layer of an autoencoder	Section 4.1	\|$n^{\rm BP}_{\rm hl}$\|	number of hidden layers in a BP network	Section 6
\|$\tilde{n}^{\rm BP}_{\rm hl}$\|	initialized value of \|$n^{\rm BP}_{\rm hl}$\|	Section 7	\|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	number of nodes in the hidden layers of a BP network	Section 6
\|$\tilde{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|	an initialization of \|$\boldsymbol {n}^{\rm BP}_{\rm nhl}$\|	Section 7	N	number of samples in a data set S or S^tr depending on its context	Section 4.1
N_p	number of pools	Section 5.2	ρ	an desired activation level	Section 4.1
\|$\bar{\rho }_j$\|	the average activation of the jth hidden node of an autoencoder	Section 4.1	RR_λ	restricted (search) range for λ. RR_β, RR_ρ, RR_n, \|$RR_{{N}_p}$\|⁠, \|$RR_{{n}^{\rm BP}_{\rm hl}}$\| and \|$RR_ {{\boldsymbol {n}}^{\rm BP}_{\rm nhl}}$\| are defined similarly	Section 7
S	a data set with spectra and atmospheric parameters	Section 6	S_BSE	a set of BSEs	Section 4.2
S^tr	a training set of stellar spectra	Section 4.2	\|$S^{\rm tr\_ae}$\|	a training set for an autoencoder, consists of some spectral patches	Section 4.1
\|$v^{(j)}_q$\|	the maximum convolution response with the jth BSE in the qth pool	Section 5.2	w	description by convolution and pooling (DCP)	Section 5.2
\|$W^{(1)}_{j\cdot }$\|	the jth BSE	Section 4.1	\|$W^{(l)}_{ji}$\|	a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer	Section 4.1
x	a spectrum or a spectral patch for an autoencoder depending on its context	Section 2.3	\|$\bar{\boldsymbol {x}}$\|	a normalized spectrum	Section 2.3
y	output of an autoencoder	Section 4.1	z	a convolution response of a spectrum with BSEs	Section 5.1

4 LEARNING BSEs USING AN AUTOENCODER

4.1 An autoencoder

An autoencoder is a special kind of ANN, initially proposed as a data dimensionality reduction scheme (Hinton & Salakhutdinov 2006) and now widely used in image analysis (Tan & Eswaran 2010; Shin et al. 2011) and speech processing (Vishnubhotla, Fernandez & Ramabhadran 2010; Deng et al. 2013).

An autoencoder adopts the framework shown in Fig. 2 and is usually used to extract features by unsupervised learning. In an autoencoder, there is only one hidden layer and the number of nodes in the output layer is equal to that in the input layer. The output y_i is an approximation of the corresponding input x_i,

\begin{equation} y_i \approx x_i, i = 1, \ldots , m. \end{equation}

(3)

The learning objective of an autoencoder is to find a set of weights, labelled with lines in Fig. 2, between the nodes in different layers based on a set of empirical data (a training set). In other words, the autoencoder tries to approximate an identity function on the training set.

Figure 2.

A framework of an autoencoder. The number of output nodes is equal to the number of input nodes. The learning objective of the network is to make the output h_{W, b}(x) as close as possible to input x.

By setting the number of hidden nodes to be far smaller than that of input nodes, an autoencoder can achieve dimension reduction. For example, when an autoencoder has 200 input nodes and 20 hidden nodes (equivalently, m = 200 and n = 20 in Fig. 2), the original 200-dimensional input could be ‘reconstructed’ approximately from the 20-dimensional output of the hidden layer. If we use the output of the hidden layer as a representation of an input of the network, the autoencoder plays the role of a feature extractor.

To introduce the implementation details of the autoencoder, we utilize the notations in an online tutorial ‘UFLDL Tutorial’ (Andrew et al. 2010).

In the network of Fig. 2, let the layer labels be 1 for the input nodes, 2 for the hidden nodes and 3 for the output nodes and suppose |$W^{(l)}_{ji}$| (l = 1, 2) represents a weight between the ith node in the lth layer and the jth node in the (l + 1)th layer, |$b^{(l)}_j$| (l = 1, 2) is the bias of the jth node in the (l + 1)th layer and |$a^{(l)}_j(\boldsymbol {x})$| (l = 1, 2, 3) is the output of the jth node in the lth layer for input x = (x₁, …, x_m)^T. Then, |$a^{(1)}_i(\boldsymbol {x}) = x_i$| for the input nodes, |$a^{(3)}_k(\boldsymbol {x}) = y_k$| for the output nodes.

For compact expressions in equations (4) and (8), we introduce three variables s₁, s₂ and s₃, respectively representing the number of nodes in three layers of an autoencoder network (Fig. 2). Then, s₁ = m, s₂ = n and s₃ = m and the relationship between the nodes in different layers is

\begin{equation} a^{(l+1)}_j=g\left(\sum ^{s_l}_{i=1}W^{(l)}_{ji}a^{(l)}_i+b^{(l)}_j\right), \qquad l =1, 2. \end{equation}

(4)

In equation (4), the g(·) is referred to in the literature as an activation function. Two common choices for the activation function are a sigmoid function,

\begin{equation} g(z) = \frac{1}{1+{\rm e}^{-z}}, \end{equation}

(5)

and a hyperbolic tangent function,

\begin{equation} g(z) = \frac{{\rm e}^z - {\rm e}^{-z}}{{\rm e}^z + {\rm e}^{-z}}. \end{equation}

(6)

This work uses the sigmoid function in equation (5).

Overall, the network in Fig. 2 implements a non-linear mapping h_{W, b}(·) from an input x = (x₁, …, x_m)^T to an output y = (y₁, …, y_m)^T:

\begin{equation} \boldsymbol {y} = h_{W,b}(\boldsymbol {x}), \end{equation}

(7)

where |$W = \lbrace W^{(l)}_{ji}\rbrace$| represents the set of weights of an autoencoder network and |$b = \lbrace b^{(l)}_j\rbrace$| the set of biases.

Suppose that |$S^{\rm tr\_ae}$| is a training set for an autoencoder. To obtain the parameters W and b for an autoencoder network based on equation (3), we can minimize an objective function, J, in equation (8):

\begin{eqnarray} J(W,b)&=&\frac{1}{N}\sum _{\boldsymbol {x} \in S^{\rm tr\_ae}}\left(\frac{1}{2}\Vert h_{w,b}(\boldsymbol {x})-\boldsymbol {x}\Vert ^2\right) \nonumber \\ &&+\,\frac{\lambda }{2}\sum ^2_{l=1}\sum ^{s_l}_{i=1}\sum ^{s_{l+1}}_{j=1}\left(W^{(l)}_{ji}\right)^2+\beta \sum ^{s_2}_{j=1}\mathbf {KL}(\rho \!\parallel \!\bar{\rho }_j ), \end{eqnarray}

(8)

where |$\bar{\rho }_j$| is the average output of the jth hidden node,

\begin{equation} \bar{\rho }_j=\frac{1}{N}\sum _{\boldsymbol {x} \in S^{\rm tr\_ae}}\left[a^{(2)}_j(\boldsymbol {x})\right], \end{equation}

(9)

\begin{equation} \mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j) = \rho \log {\frac{\rho }{\bar{\rho }}} + (1 - \rho ) \log \frac{1- \rho }{1- \bar{\rho }_j}, \end{equation}

(10)

N is the number of samples in a training set¹|$S^{\rm tr\_ae}$| and λ ≥ 0, β ≥ 0 and ρ > 0 are three preset parameters of the spectral parametrization. These three preset parameters control the relative importance of the three terms in equation (8). In the literature, the λ is usually referred to as a weight decay parameter.

In equation (8), the first term represents an empirical error evaluation between the actual output and the expected output of an autoencoder and ensures a good reconstruction performance of the network. The second term, a regularization term of |$W^{(l)}_{ji}$|⁠, is used to overcome possible overfitting to the training set by reducing the scheme's complexity.

The third term with weighted coefficient β is a penalty term for sparsity in outputs of the hidden layer. |$\mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j)$| is the relative entropy of the average output, |$\bar{\rho }_j$|⁠, and a desired activation level ρ. |$\mathbf {KL}(\rho \!\parallel\! \bar{\rho }_j)$| increases monotonically with increasing distance between |$\bar{\rho }_j$| and ρ and encourages the average activation, |$\bar{\rho }$|⁠, of the hidden layer to be close to a desired average activation ρ.

4.2 Learning BSEs

BSEs consist of a set of templates of spectral patches with a limited wavelength range. Using the BSE, we can extract local and sparse features (Section 5). This work studies the feasibility of learning BSEs through an autoencoder for automatic estimation of atmospheric parameters. In an autoencoder, the weighted sum |$\sum ^{m}_{i=1}W^{(1)}_{ji}a^{(1)}_i(\boldsymbol {x})+b^{(1)}_j$| in the jth hidden node is essentially a projection of an input x on the vector |$W^{(1)}_{j\cdot }=(W^{(1)}_{j1},W^{(1)}_{j2},\ldots ,W^{(1)}_{jm})^{\rm T}$| of weights, which is similar to the coefficients of a vector in a coordinate system. Thus, |$W^{(1)}_{j\cdot }$| can be regarded as a ‘basis’ for the input data and in this article we name them a basic structure element (BSE), where j = 1, …, n.

To learn BSEs through an autoencoder, a training set |$S^{\rm tr\_ae}$| was constructed to represent the local information of a stellar spectrum. Let S^tr be a training set of stellar spectra (Section 2). The BSE training set |$S^{\rm tr\_ae}$| is constructed from spectral training set S^tr in the following way:

randomly select a spectrum, x = (x₁, …, x_l), from S^tr, where l > 0 represents the number of fluxes of a spectrum;
randomly generate an integer j satisfying 1 ≤ j ≤ n − m + 1, where m > 0 represents the dimension of a sample in |$S^{\rm tr\_ae}$| and is consistent with the number of input nodes of the autoencoder in Fig. 2;
take (x_j, x_j + 1, …, x_{j + m − 1})^T as a sample of |$S^{\rm tr\_ae}$|⁠.

By repeating the above three procedures, we obtain a BSE training set |$S^{\rm tr\_ae}$|⁠. Therefore, |$S^{\rm tr\_ae}$| actually consists of a series of spectral patches. Considering the widths of lines of a stellar spectrum, m is empirically taken as 81 in this work. In the proposed scheme, 100 000 such patches are generated to constitute the BSE training set. These patches are not generated from a specific wavelength position, therefore |$S^{\rm tr\_ae}$| expresses the general structure of all spectral patches with length m.

To learn a set of BSEs, we input the generated training set |$S^{\rm tr\_ae}$| into an autoencoder (Section 4.1) and compute a set of spectral templates, BSEs:

\begin{equation} S_{\rm BSE} = \left\lbrace W^{(1)}_{1\cdot }, \ldots , W^{(1)}_{n\cdot } \right\rbrace , \end{equation}

(11)

where every BSE, |$W^{(1)}_{j\cdot }$|⁠, is a vector representing a basic pattern of spectral patches,

\begin{equation} W^{(1)}_{j\cdot }=\left(W^{(1)}_{j1},W^{(1)}_{j2},\ldots ,W^{(1)}_{jm}\right)^{\rm T}. \end{equation}

(12)

5 FEATURE EXTRACTION

This work proposes to extract features by performing convolution and pooling operations on the computed BSEs and a stellar spectrum.

5.1 Convolution

Let

\begin{equation} \boldsymbol {x}= (x_1,\ldots ,x_l)^{\rm T} \end{equation}

(13)

denote a spectrum. Using the extracted BSEs S_BSE in equation (11) and a convolution operation, we filter the spectrum x:

\begin{equation} z_i^{(j)} = \sum _{p = 1}^{m}{ W^{(1)}_{jp} x_{p+i-1} } \end{equation}

(14)

and transform the spectrum x into

\begin{eqnarray} \boldsymbol {z} &=& ({(z^{(1)})}^{\rm T}, \ldots , {(z^{(n)})}^{\rm T})^{\rm T}\nonumber\\ &=& \left(z_1^{(1)}, \ldots , z_{l-m+1}^{(1)}, \ldots , z_1^{(n)}, \ldots , z_{l-m+1}^{(n)}\right)^{\rm T}, \end{eqnarray}

(15)

where i = 1, …, l − m + 1, j = 1, …, n. Then,

\begin{equation} z^{(j)} = {\left(z_1^{(j)}, \ldots , z_{l-m+1}^{(j)}\right)}^{\rm T} \end{equation}

(16)

is the convolution response vector of the jth BSE structure |$W^{(1)}_{j\cdot }$| in equation (11).

In this work, a BSE structure template has m = 81 components (equation 12) and a SDSS spectrum is represented with l = 3000 fluxes. Therefore, there are l − m + 1 = 2920 convolution responses for any one BSE structure template and spectrum. From the n = 25 BSE structure templates in equation (11), we obtain (l − m + 1) × n = 2920 × 25 convolution responses (in equation 15) for every spectrum.

5.2 Pooling

However, there are many redundancies in the convolution response description, z in equation (15), of a spectrum for physical atmospheric parameter estimation. Redundancies usually result in overfitting. To overcome this problem, a ‘pooling’ method is adopted to reduce dimension by merging the features of different positions. In this pooling method, we first equally partition the convolution response description into N_p pools according to their wavelength positions, then choose the maximum response in each pool as the final spectral feature:

\begin{equation} v_q^{(j)} = \max {\left\lbrace z_i^{(j)}, \frac{l}{N_{\rm p}} \times (q - 1) + 1 \le i \le \frac{l}{N_{\rm p}} \times q\right\rbrace }, \end{equation}

(17)

where q = 1, …, N_p, j = 1, …, n, l/N_p is the length of every pool, l is the number of fluxes in a spectrum and N_p is the number of pools. The |$v_q^{(j)}$| is referred to as a maximum convolution response. Thus, the spectrum x is transformed to

\begin{equation} \boldsymbol {w} = \left(v_1^{(1)}, \ldots , v_{N_{\rm p}}^{(1)}, \ldots , v_1^{(n)}, \ldots , v_{N_{\rm p}}^{(n)}\right). \end{equation}

(18)

For convenience, the detected features in equation (18) are named description by convolution and pooling (DCP). For a SDSS spectrum, the dimension of its description decreases from (l − m + 1) × n = 2920 × 25 (convolution responses) to N_p × n = 10 × 25 (DCP), if the pool number N_p is 10.

5.3 Discussion on convolution and pooling

Pooling is a form of non-linear downsampling operation, which combines the responses of feature detectors at nearby locations into some statistical variables. The pooling method was proposed in Hubel and Wiesel's work on complex cells in the visual cortex (Hubel & Wiesel 1962) and is now used in a large number of applications: for example computer vision and speech recognition. Theoretical analysis (Boureau, Ponce & LeCun 2010) suggests that the pooling method works well when the features are sparse. In image- and speech-processing fields, convolution and pooling based on local bases has been proven to be an effective feature-extraction method (Nagi et al. 2011; Shin et al. 2011; Ashish, Murat & Anthony 2013).

In theory, the convolution operation can reduce negative effects from noise and the pooling operation can restrain the negative influence from some imperfections (e.g. sky lines and cosmic-ray removal residuals, residual calibration defects) by competing between multiple convolution responses of the extracted BSEs and a spectrum (the maximization procedure). Some effects of convolution and pooling are investigated in the following part of this subsection and further discussed in Sections 8.3 and 8.4 based on experimental results.

Using SDSS training data (Section 2.1) and the scheme proposed in Section 4.2, we learned 25 BSE structures (Fig. 3) by setting the number of hidden nodes n = 25 in Fig. 2 and equation (11), where the optimality of n = 25 are investigated in Section 7. It is shown that the BSE structure templates characterize the local structure of spectra. Using the learned BSE structures and the convolution-pooling scheme described in this section, we can extract features from every spectrum.

Figure 3.

A visualization of BSEs when the hidden layer has 25 nodes.

For example, Fig. 4(a) presents a spectrum from SDSS; there is a stitching error near 5580 Å. In Fig. 4(b), the curve shows the convolution response vector of the third BSE structure |$W^{(1)}_{3\cdot }$| on the spectrum in Fig. 4(a); the points labelled with quadrangles are the pooling responses |$\lbrace v_q^{(3)}, q = 1, \ldots , N_{\rm p}\rbrace$| in equation (17). The results in Fig. 4(b) and (c) show that the responses of the spuriously strong signal of the stitching error are reduced.

Figure 4.

Convolution responses of a spectrum and a histogram of the maximum convolution responses in a pooling process with pool number N_p = 10. (a) A normalized SDSS spectrum. (b) Convolution responses of the spectrum shown in (a) and the third BSE shown in Fig. 3. In (b), nine vertical dashed lines are boundaries of each pool (the pooling is implemented in logarithmic wavelength space, so the sizes of multiple pools are unequal in a wavelength space) and ten quadrangles indicate the maximum convolution response in every pool. (c) The statistical histogram of the maximum convolution responses in the pooling process for the third BSE in Fig. 3 on 5000 SDSS spectra (Section 2.1).

To investigate the characteristics of the extracted features further using convolution and pooling, we calculate the statistical histogram of the maximum convolution responses on 5000 SDSS spectra (Section 2.1). Every maximum convolution response |$v_q^{(j)}$| has a specific wavelength position (Fig. 4b, equation 17). The statistical histogram of the maximum convolution responses is obtained by cumulating the number of pooling responses |$v_q^{(j)}$| of 5000 SDSS spectra at each wavelength position. Fig. 4(c) shows the statistical histogram of the maximum convolution responses corresponding to the third BSE structure |$W^{(1)}_{3\cdot }$| and more statistical histograms of the maximum convolution responses are presented in Fig. 5. The results in Fig. 5 also show that the effects of stitching errors near 5580 Å in Fig. 4(a) are negligible in the pooling response. More experimental investigations on the stitching error are conducted in Section 8.4.

Figure 5.

The cumulative histogram of the local maximum response position for the third, fourth, sixth, 13th, 14th, 16th and 24th BSE structures. The dashed lines indicate eight significant local maximum cumulative responses of pooling operations on 5000 SDSS spectra.

The statistical histograms of the maximum convolution responses (Figs 4c and 5) on 5000 SDSS spectra also show that the pooling responses are statistically sparse: only at a few wavelength positions are there non-zero responses. The sparseness helps to trace back the physical interpretation of the extracted features.

That is to say, most of the cumulative responses are close to zero except for a few wavelength patches with large cumulative responses. As can be seen in Fig. 4, these local maximum responses appear near the wavelength of evident fluctuation in a spectrum. In the pooling process, the responses of these wavelengths are frequently taken as spectral features. Some of the new features characterize such local structures in spectra and they are local and sparse.

Considering that a pooling response |$v_q^{(j)}$| is obtained by a convolution calculation (equations 17 and 14), each pooling feature is actually calculated from some spectral fluxes in a wavelength range. Therefore, we present the wavelength ranges in the second column of Table 3 for those features with large cumulative responses in Fig. 5.

Table 3.

The wavelengths of the eight local-maximum cumulative responses and some potential lines near them. WPT: wavelength position of typical local-maximum cumulative response of a pooling operation on 5000 SDSS spectra, this position is represented by a three-dimensional vector (a b c), where a, b, c are respectively the starting wavelength, central wavelength and ending wavelength and log₁₀b = (log₁₀a + log₁₀c)/2. PL: potential lines probably related to the description of the detected feature. Wavelength b is the position of the local maximum cumulative and fluxes from range [a c] determine the response on b in a convolution process.

No.	WPT (Å)	PL
1	(4000 4074 4150)	Ca i, Hδ, He i
2	(4199 4276 4356)	Ca i, Hγ
3	(4233 4311 4391)	Fe ii, Na ii, O i, Fe i, Ca iii
4	(4744 4831 4922)	Fe i, O iii, Na i, Na ii, O ii, Fe ii
5	(5047 5141 5237)	Fe ii, He i, Na i, Ca iii, O iii, Fe i
6	(5757 5864 5973)	Na i, Fe i, O ii
7	(6401 6520 6641)	Hα, CaH
8	(7473 7612 7754)	Fe i, Fe ii, O iii, O i

No.	WPT (Å)	PL
1	(4000 4074 4150)	Ca i, Hδ, He i
2	(4199 4276 4356)	Ca i, Hγ
3	(4233 4311 4391)	Fe ii, Na ii, O i, Fe i, Ca iii
4	(4744 4831 4922)	Fe i, O iii, Na i, Na ii, O ii, Fe ii
5	(5047 5141 5237)	Fe ii, He i, Na i, Ca iii, O iii, Fe i
6	(5757 5864 5973)	Na i, Fe i, O ii
7	(6401 6520 6641)	Hα, CaH
8	(7473 7612 7754)	Fe i, Fe ii, O iii, O i

Table 3.

The wavelengths of the eight local-maximum cumulative responses and some potential lines near them. WPT: wavelength position of typical local-maximum cumulative response of a pooling operation on 5000 SDSS spectra, this position is represented by a three-dimensional vector (a b c), where a, b, c are respectively the starting wavelength, central wavelength and ending wavelength and log₁₀b = (log₁₀a + log₁₀c)/2. PL: potential lines probably related to the description of the detected feature. Wavelength b is the position of the local maximum cumulative and fluxes from range [a c] determine the response on b in a convolution process.

No.	WPT (Å)	PL
1	(4000 4074 4150)	Ca i, Hδ, He i
2	(4199 4276 4356)	Ca i, Hγ
3	(4233 4311 4391)	Fe ii, Na ii, O i, Fe i, Ca iii
4	(4744 4831 4922)	Fe i, O iii, Na i, Na ii, O ii, Fe ii
5	(5047 5141 5237)	Fe ii, He i, Na i, Ca iii, O iii, Fe i
6	(5757 5864 5973)	Na i, Fe i, O ii
7	(6401 6520 6641)	Hα, CaH
8	(7473 7612 7754)	Fe i, Fe ii, O iii, O i

No.	WPT (Å)	PL
1	(4000 4074 4150)	Ca i, Hδ, He i
2	(4199 4276 4356)	Ca i, Hγ
3	(4233 4311 4391)	Fe ii, Na ii, O i, Fe i, Ca iii
4	(4744 4831 4922)	Fe i, O iii, Na i, Na ii, O ii, Fe ii
5	(5047 5141 5237)	Fe ii, He i, Na i, Ca iii, O iii, Fe i
6	(5757 5864 5973)	Na i, Fe i, O ii
7	(6401 6520 6641)	Hα, CaH
8	(7473 7612 7754)	Fe i, Fe ii, O iii, O i

Although the features are not extracted by identifying spectral lines, the results in Fig. 5 do show that the detected features are near some spectral lines potentially existing in a specific stellar spectrum (the third column of Table 3).

6 ESTIMATING ATMOSPHERIC PARAMETERS

As a typical non-linear learning method, ANN has been widely used in automated estimation of stellar atmosphere parameters (Bailer-Jones 2000; Willemsen et al. 2003; Ordóñez et al. 2007, 2008; Zhao, Luo & Zhao 2008; Pan & Tu 2009; Manteiga et al. 2010; Giridhar et al. 2012) and spectrum classification (Gulati et al. 1994; Hippel et al. 1994; Vieira & Pons 1995; Weaver & Torres-Dodgen 1995; Schierscher & Paunzen 2011). For example, Manteiga et al. (2010) extracted spectral features using fast Fourier transforms (FFTs) and discrete wavelet transform (DWT) and estimated the parameters T_eff, log g, [Fe/H] and [α/Fe] by an ANN with one hidden layer from FFT coefficients and wavelet coefficients. Giridhar et al. (2012) studied how to estimate atmospheric parameters directly from spectra by a BP neural network. Pan & Tu (2009) proposed to parametrize a stellar spectrum using an ANN from a few PCA features.

In this article, we use BP networks to learn the mapping from extracted DCP features (Section 5) to stellar parameters T_eff, log g and [Fe/H]. Training of BP networks is an iterative process and in each iteration the estimated errors are calculated on two sets: the training set and the validation set. A BP network optimizes its parameters by minimizing the difference between the network's output and the expected output (e.g. stellar atmospheric parameters) according to the estimated errors in the training set and this process will stop when the estimated errors in the validation set have no improvement in successive iteration steps. This can avoid overfitting.

In a BP network, there are two preset parameters: the number of hidden layers and number of nodes in each hidden layer. For convenience, the former is denoted by |$n^{\rm BP}_{\rm hl}$|⁠, the latter |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$|⁠. In this work, we investigated the cases |$n^{\rm BP}_{\rm hl} = 1$| and |$n^{\rm BP}_{\rm hl} = 2$| for computational feasibility. If |$n^{\rm BP}_{\rm hl} = 1$|⁠, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| is a positive integer. If |$n^{\rm BP}_{\rm hl} = 2$|⁠, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| is a two-dimensional row vector consisting of the numbers of nodes in the first and second hidden layers.

Suppose that S = {(x, y)} is a data set, where x represents the information of a spectrum and y is an atmospheric parameter. In this work, the accuracy of a spectral parametrization f(·) on a data set S is evaluated by the mean absolute error (MAE):

\begin{equation} \mbox{MAE}(f(\cdot )) = \frac{1}{N} \sum _{(\boldsymbol {x},y) \in S}{\mid\! f(\boldsymbol {x}) - y \!\mid }, \end{equation}

(19)

where N is the number of samples in S.

7 OPTIMIZING THE CONFIGURATION

The proposed scheme consists of four steps (Fig. 1, Section 3). In the second step, ‘Learning BSE’, there are four parameters, λ, β, ρ and n, to be preset, where n denotes the number of nodes in the hidden layer of an autoencoder (Fig. 2). In the third step, ‘Extract features by convolution and pooling’, there is a preset parameter, N_p, representing the number of pools (Section 5). In the estimation method, BP network, there are two preset parameters |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| (Section 6).

To optimize the configuration of the proposed scheme, the spectrum-parametrization scheme was estimated from a training set from SDSS. The performance of the estimated spectrum-parametrization scheme was evaluated on a validation set from SDSS (Section 2).

Therefore, the optimal configuration

\begin{equation} \left(\hat{\lambda }, \hat{\beta }, \hat{\rho }, \hat{n}, \hat{N_{\rm p}}, \hat{n}^{\rm BP}_{\rm hl}, \hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}\right)_{\rm ap} \end{equation}

(20)

can be found by

\begin{equation} \min \limits _{\lambda , \beta , \rho , n, N_{\rm p}, n^{\rm BP}_{\rm hl}, \boldsymbol {n}^{\rm BP}_{\rm nhl}}{\mbox{MAE}\left(\lambda , \beta , \rho , n, N_{\rm p}, n^{\rm BP}_{\rm hl}, \boldsymbol {n}^{\rm BP}_{\rm nhl}, {ap}\right)} \end{equation}

(21)

where, ap = T_eff, log g or [Fe/H] and MAE is the prediction error for the SDSS validation set. The spectral parametrization is learned from a SDSS training set with a specific configuration of λ, β, ρ, n, N_p, |$n^{\rm BP}_{\rm hl}$|⁠, |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| and ap.

Because there is no analytical expression for the objective function in equation (21), we can obtain the optional configuration |$(\hat{\lambda }, \hat{\beta }, \hat{\rho }, \hat{n}, \hat{N_{\rm p}}, \hat{n}^{\rm BP}_{\rm hl}, \hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl})_{\rm ap}$| in theory by repeating the four procedures of the proposed scheme (Fig. 1, Section 3) with every possible configuration of |$\lambda , \beta , \rho , n, N_{\rm p}, n^{\rm BP}_{\rm hl}, \boldsymbol {n}^{\rm BP}_{\rm nhl}$| and choosing the one with minimal MAE error as the optimal configuration.

However, this theoretical optimization scheme is infeasible as regards computational burden. Therefore, instead of obtaining an optimal configuration, we propose to find an excellent/acceptable configuration, a suboptimal solution.

To find a suboptimal solution, we restrict the search ranges for λ, β, ρ, n and N_p empirically, as follows:

\begin{eqnarray} RR_{\lambda } &=& \lbrace 0.001, 0.0023, 0.0036, 0.0049,\nonumber\\ && 0.0061, 0.0074, 0.0087, 0.01 \rbrace , \end{eqnarray}

(22)

\begin{eqnarray} RR_{\rho } &=& \lbrace 0.005, 0.0114, 0.0179, 0.0243,\nonumber\\ && 0.0307, 0.0371, 0.0436, 0.05 \rbrace , \end{eqnarray}

(23)

\begin{equation} RR_{n} = \lbrace 15, 20, 25, 30, 35, 40 \rbrace , \end{equation}

(24)

\begin{equation} RR_{N_{\rm p}} = \lbrace 15, 12, 10, 8, 6, 5, 4\rbrace \end{equation}

(25)

and |$\hat{\beta } =3$|⁠, where RR_λ represents a restricted search range for λ; the other symbols RR_ρ, RR_n and |$RR_{N_{\rm p}}$| are defined similarly. The suboptimal solutions of λ, ρ, n and N_p can be found by optimizing

\begin{eqnarray} \min \limits _{\lambda \in RR_{\lambda }, \rho \in RR_{\rho }, n \in RR_{n}, N_{\rm p} \in RR_{N_{\rm p}}}{\mbox{MAE}(\lambda , \hat{\beta }, \rho , n, N_{\rm p}, \tilde{n}_{\rm hl}, \tilde{\boldsymbol {n}}_{\rm nhl}, ap)}\nonumber\\ \end{eqnarray}

(26)

based on the framework in Fig. 1, where |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| are initialized with |$\tilde{n}_{\rm hl} =1$| and |$\tilde{\boldsymbol {n}}_{\rm nhl} = 6$|⁠. The |$\tilde{n}_{\rm hl}$| and |$\tilde{\boldsymbol {n}}_{\rm nhl}$| are determined empirically based on considerations of balance between computational burden and estimated accuracy. The configurations obtained are presented in Table 4 .

Table 4.

The suboptional configuration obtained for the proposed scheme (Fig. 1, Section 3). |$\hat{\lambda }$|⁠, |$\hat{\beta }$|⁠, |$\hat{\rho }$| and |$\hat{n}$| are the optimized values of four parameters in the second step, ‘Learning BSE’ (Fig. 1), where |$\hat{n}$| denotes the number of nodes in the hidden layer (Fig. 2). |$\hat{N}_p$| represents the number of pools (Section 5) in the third step, ‘Extract features by convolution and pooling’ (Fig. 1). |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\boldsymbol {\hat{n}}^{\rm BP}_{\rm nhl}$| are the optimized values of two preset parameters in the estimation method, BP network (Section 6).

Parameters	\|$\hat{\lambda }$\|	\|$\hat{\beta }$\|	\|$\hat{\rho }$\|	\|$\hat{n}$\|	\|$\hat{N}_p$\|	\|$\hat{n}^{\rm BP}_{\rm hl}$\|	\|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|
log T_eff	0.0074	3	0.0050	25	10	2	(14,10)
log g	0.0087	3	0.114	25	12	2	(16,4)
[Fe/H]	0.0087	3	0.114	25	15	2	(22,6)

Parameters	\|$\hat{\lambda }$\|	\|$\hat{\beta }$\|	\|$\hat{\rho }$\|	\|$\hat{n}$\|	\|$\hat{N}_p$\|	\|$\hat{n}^{\rm BP}_{\rm hl}$\|	\|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|
log T_eff	0.0074	3	0.0050	25	10	2	(14,10)
log g	0.0087	3	0.114	25	12	2	(16,4)
[Fe/H]	0.0087	3	0.114	25	15	2	(22,6)

Table 4.

The suboptional configuration obtained for the proposed scheme (Fig. 1, Section 3). |$\hat{\lambda }$|⁠, |$\hat{\beta }$|⁠, |$\hat{\rho }$| and |$\hat{n}$| are the optimized values of four parameters in the second step, ‘Learning BSE’ (Fig. 1), where |$\hat{n}$| denotes the number of nodes in the hidden layer (Fig. 2). |$\hat{N}_p$| represents the number of pools (Section 5) in the third step, ‘Extract features by convolution and pooling’ (Fig. 1). |$\hat{n}^{\rm BP}_{\rm hl}$| and |$\boldsymbol {\hat{n}}^{\rm BP}_{\rm nhl}$| are the optimized values of two preset parameters in the estimation method, BP network (Section 6).

Parameters	\|$\hat{\lambda }$\|	\|$\hat{\beta }$\|	\|$\hat{\rho }$\|	\|$\hat{n}$\|	\|$\hat{N}_p$\|	\|$\hat{n}^{\rm BP}_{\rm hl}$\|	\|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|
log T_eff	0.0074	3	0.0050	25	10	2	(14,10)
log g	0.0087	3	0.114	25	12	2	(16,4)
[Fe/H]	0.0087	3	0.114	25	15	2	(22,6)

Parameters	\|$\hat{\lambda }$\|	\|$\hat{\beta }$\|	\|$\hat{\rho }$\|	\|$\hat{n}$\|	\|$\hat{N}_p$\|	\|$\hat{n}^{\rm BP}_{\rm hl}$\|	\|$\hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl}$\|
log T_eff	0.0074	3	0.0050	25	10	2	(14,10)
log g	0.0087	3	0.114	25	12	2	(16,4)
[Fe/H]	0.0087	3	0.114	25	15	2	(22,6)

Based on the configurations of λ, β, n and N_p, the suboptimal solutions of |$n^{\rm BP}_{\rm hl}$| and |$\boldsymbol {n}^{\rm BP}_{\rm nhl}$| are computed by

\begin{equation} \min \limits _{n^{\rm BP}_{\rm hl} \in RR_{n^{\rm BP}_{\rm hl}}, \boldsymbol {n}^{\rm BP}_{\rm nhl} \in RR_{\boldsymbol {n}^{\rm BP}_{\rm nhl}}}{\mbox{MAE}\left(\hat{\lambda }, \hat{\beta }, \hat{n}, \hat{N_{\rm p}}, n^{\rm BP}_{\rm hl}, \boldsymbol {n}^{\rm BP}_{\rm nhl}, ap\right)}, \end{equation}

(27)

where

\begin{equation*} RR_{n^{\rm BP}_{\rm hl}} = \lbrace 1, 2 \rbrace . \end{equation*}

If |$n^{\rm BP}_{\rm hl} = 1$|⁠, then

\begin{equation*} RR_{\boldsymbol {n}^{\rm BP}_{\rm nhl}} = RR_{nn}. \end{equation*}

If |$n^{\rm BP}_{\rm hl} = 2$|⁠, then

\begin{equation*} RR_{\boldsymbol {n}^{\rm BP}_{\rm nhl}} = \lbrace (a, b)| a \in RR_{nn}, b \in RR_{nn} \rbrace , \end{equation*}

where

\begin{equation} RR_{nn} = \lbrace 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24 \rbrace \end{equation}

(28)

represents possible numbers of nodes in a hidden layer of a network investigated in this work. The computed |$(\hat{n}^{\rm BP}_{\rm hl}, \hat{\boldsymbol {n}}^{\rm BP}_{\rm nhl})$| are presented in Table 4.

Note, however, that the restricted search ranges are selected empirically based on performance on the validation set, the solutions obtained are not usually global minimum ones, but some acceptable values with a feasible computational burden. If we expand the restricted ranges, it is possible that a better configuration can be obtained at greater computational cost.

8 EXPERIMENTS AND DISCUSSION

8.1 Performance for SDSS spectra

From the training set and the validation set consisting of SDSS spectra (Section 2), we obtain a spectral parametrization using the proposed scheme (Section 3, Fig. 1) and the proposed optimization scheme (Section 7, Table 4).

On a SDSS test set, the MAE errors of this spectral parametrization are 0.0060 dex for log T_eff, 0.1978 dex for log g and 0.1770 dex for [Fe/H] (the row with index 1 in Table 5). Similarly, on real spectra from SDSS, Fiorentin et al. (2007) investigated the stellar parameter estimation problem using PCA and ANN and obtained accuracies of 0.0126 dex for log T_eff, 0.3644 dex for log g and 0.1949 dex for [Fe/H] on a SDSS test set based on 50 PCA features. Liu et al. (2014) studied the spectrum-parametrization problem using the least absolute shrinkage and selection operator (LASSO) algorithm and support vector regression (SVR) method and the optimal MAE errors are 0.0094 dex for log T_eff, 0.2672 dex for log g and 0.2728 dex for [Fe/H].

Table 5.

Performance of the proposed scheme on test data sets.

Index	Data source	log T_eff (dex)	log g (dex)	[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1	SDSS spectra	0.0060	0.1978	0.1770
2	synthetic spectra	0.0004	0.0145	0.0070
(b) Rationality to delete some data components in application (Section 8.3).
3	SDSS spectra	0.0080	0.2994	0.2163
(c) Robustness to stitching error (Section 8.4).
4	SDSS spectra	0.0063	0.2371	0.1827

Index	Data source	log T_eff (dex)	log g (dex)	[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1	SDSS spectra	0.0060	0.1978	0.1770
2	synthetic spectra	0.0004	0.0145	0.0070
(b) Rationality to delete some data components in application (Section 8.3).
3	SDSS spectra	0.0080	0.2994	0.2163
(c) Robustness to stitching error (Section 8.4).
4	SDSS spectra	0.0063	0.2371	0.1827

Table 5.

Performance of the proposed scheme on test data sets.

Index	Data source	log T_eff (dex)	log g (dex)	[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1	SDSS spectra	0.0060	0.1978	0.1770
2	synthetic spectra	0.0004	0.0145	0.0070
(b) Rationality to delete some data components in application (Section 8.3).
3	SDSS spectra	0.0080	0.2994	0.2163
(c) Robustness to stitching error (Section 8.4).
4	SDSS spectra	0.0063	0.2371	0.1827

Index	Data source	log T_eff (dex)	log g (dex)	[Fe/H] (dex)
(a) Performance of the proposed scheme. More details are presented in Section 8.1 and 8.2.
1	SDSS spectra	0.0060	0.1978	0.1770
2	synthetic spectra	0.0004	0.0145	0.0070
(b) Rationality to delete some data components in application (Section 8.3).
3	SDSS spectra	0.0080	0.2994	0.2163
(c) Robustness to stitching error (Section 8.4).
4	SDSS spectra	0.0063	0.2371	0.1827

8.2 Performance for synthetic spectra

To investigate the effectiveness of the proposed scheme further, we also evaluated it on synthetic spectra. The synthetic spectra used in this work and those used in Fiorentin et al. (2007) are all calculated from Kurucz's model. The synthetic data set is described in Section 2.2. This experiment shares the same parameters as the experiment on SDSS data (Section 8.1) and the BP estimation is learned from the synthetic training set (Section 2).

On the synthetic test set, the MAE accuracies of the spectral parametrization learned from the synthetic training set are 0.0004 dex for log T_eff, 0.0145 dex for log g and 0.0070 dex for [Fe/H] (the row with index 2 in Table 5). In Fiorentin et al. (2007), the best consistencies on synthetic spectra are obtained based on 100 principal components and the MAEs are 0.0030 dex for log T_eff, 0.0251 for log g and 0.0269 for [Fe/H] (table 1 in Fiorentin et al. 2007). Using the LASSO algorithm and SVR method, Li et al. (2014) reached MAE errors of 0.0008 dex for log T_eff, 0.0179 dex for log g and 0.0131 dex for [Fe/H]. On the synthetic spectrum, the mean absolute errors in Manteiga et al. 2010 are 0.07 dex for log g and 0.06 dex for [Fe/H].

8.3 Effective data components, unwanted influences and their balances

The proposed scheme extracts features by throwing away some information. In this procedure, it is very probable that some useful (at least in theory) components are discarded. Is this positive or negative?

Actually, besides the useful data components in spectra, there is also redundancy and/or noise and pre-processing imperfections (e.g. sky lines and cosmic-ray removal residuals, residual calibration defects). Redundancy means that some duplication of some components in a system exists. Multiple components are probably usually different from each other regarding the amount of duplication. Therefore, redundancy can disturb the weights of different components, which usually results in an erroneous evaluation and reduces the quality of learning. Noise and pre-processing imperfections can mask off the effects of some important spectral information, e.g. weak lines. Therefore, in theory, it is possible that we can improve the estimates of atmospheric parameters by throwing away some data components.

We conducted an experiment to investigate this theoretical possibility. In this experiment, we estimate atmospheric parameters by deleting procedures 2 and 3 (Fig. 1) from the experiments in Section 8.1. That is to say, we estimated the atmospheric parameters using a BP network directly from a spectrum without throwing away any information in the pooling procedure and the BP estimation network shared the same configurations as in the experiment of Section 8.1. The results are presented in Table 5, part (b) (the row with index 3). The results of the experiments with index 1 and 3 in Table 5 do not show any evident evidence of losing any spectrum parametrization performance when throwing away some data components using the proposed scheme.

8.4 Effects of band stitching errors on parameter estimation

There is a significant oscillation in some SDSS spectra near 5580 Å (e.g. Fig. 4a), which is caused by errors in stitching the red and blue bands. Figs 4(b), 4(c) and 5 show that the response of stitching errors is evidently reduced in the procedures of convolution and pooling.

We also performed one experiment to analyse the effects of these ‘noises’ on performance of the proposed scheme. In this experiment, the BP estimator is trained with features excluding the convolution responses related to stitching errors. For example, in SDSS data, the stitching error appears in the fifth pool in our experiments, so we removed the pooling response of the fifth pool for every BSE structure. The new MAEs of the three parameters in the SDSS test set are presented in Table 5, part (c).

Actually, in the wavelength range containing stitching error, there are both useful components and disturbances from noise and stitching errors. However, experiments show that the useful components outperform the disturbances in the proposed scheme (the experiments with indexes 1 and 4 in Table 5).

9 CONCLUSION

In this work, we propose a novel scheme for spectral feature extraction and stellar atmospheric parameter estimation. In the commonly used methods like PCA and Fourier transforms, every feature is obtained by incorporating nearly all of the fluxes of a spectrum. Differently from these, the characteristics of our proposed scheme are localization and sparseness of the extracted features. ‘Localization’ means that each extracted feature is calculated from some spectral fluxes within a local wavelength range (Fig. 4b). ‘Sparseness’ says that the atmospheric parameters can be estimated using a small number of features (Fig. 4c). The ‘localization’ and ‘sparseness’ signify that many data components are thrown away, especially the redundancies, weak informative components (Section 8.3). However, these weak informative components are easily corrupted by noise and pre-processing imperfections (Section 8.3). It is shown that the proposed scheme has excellent performance in estimating atmospheric parameters (Sections 8.1 and 8.2, Table 5).

The authors thank the reviewer and editor for their instructive comments and extend their thanks to Professor Ali Luo for his support and invaluable discussions. This work is supported by the National Natural Science Foundation of China (grant No: 61273248, 61075033, 61202315), the Natural Science Foundation of Guangdong Province (2014A030313425,S2011010003348), the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (201001060) and the high-performance computing platform of South China Normal University.

1

Here, ‘tr_ae’ is the abbreviation of a training set for an autoencoder.

REFERENCES

Abazajian

K. N.

et al.

ApJS

2009

182

543

Ahn

C. P.

et al.

ApJS

2012

203

21

Andrew

N.

Ngiam

J.

Foo

C. Y.

Mai

Y.

Suen

C.

UFLDL Tutorial

2010

available at: http://ufldl.stanford.edu/wiki/index.php/UFLDL\_Tutorial

Ashish

G.

Murat

A.

Anthony

M.

Dasgupta

S.

McAllester

D.

Proc. 30th International Conference on Machine Learning

2013

Brookline

Microtome Publishing

987

Bailer-Jones

C. A. L.

A&A

2000

357

197

Bailer-Jones

C. A. L.

Irwin

M.

Von Hippel

T.

MNRAS

1998

298

361

Boureau

Y.

Ponce

J.

LeCun

Y.

Fürnkranz

J.

Joachims

T.

International Conference on Machine Learning

2010

Madison, Wisconsin

Omnipress

111

Castelli

F.

Kurucz

R. L.

Piskunov

N.

Weiss

W. W.

Gray

D.F.

IAU Symp. 210, Modelling of Stellar Atmospheres

2003

Dordrecht

Kluwer

A20

Cui

X.

et al.

Res. Astron. Astrophys.

2012

12

1197

Deng

J.

Zhang

Z. X.

Marchi

E.

Schuller

B.

Guerrero

J. E.

Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva

2013

Los Alamitos, CA

IEEE Computer Society

511

Gilmore

G.

et al.

Messenger

2012

147

25

Giridhar

S.

Goswami

A.

Kunder

A.

Muneer

S.

Kumar

G. S.

Prugniel

Ph.

Singh

H. P.

2011 ASI Conf. Ser., Vol. 6. V Astronomical Society of India, Bengaluru, India

2012

137

Gray

R. O.

Corbally

C. J.

AJ

1994

107

742

Grevesse

N.

Sauval

A. J.

Sov. Sci. Rev.

1998

85

161

Gulati

R. K.

Gupta

R.

Gothoskar

P.

Khobragade

S.

ApJ

1994

426

340

Guo

P.

Xing

F.

Jiang

Y. G.

IEEE International Conference on Systems, Man and Cybernetics

2004

6

Piscataway, NJ

IEEE

5894

Hinton

G. E.

Salakhutdinov

R. R.

Sci

2006

313

504

Hubel

D. H.

Wiesel

T. N.

J. Physiol.

1962

160

106

PubMed

Jofre

P.

Panter

B.

Hansen

C. J.

Weiss

A.

A&A

2010

517

57

Li

X.

Wu

Q. M. J.

Luo

A.

Zhao

Y.

Lu

Y.

Zuo

F.

Yang

T.

Wang

Y.

ApJ

2014

790

105

Liu

C. X.

Zhang

P. A.

Lu

Y.

Res. Astron. Astrophys.

2014

14

423

Lu

Y.

Li

C. L.

Li

X. R.

Spectros. Spectral Anal.

2012

32

2583

Lu

Y.

Li

X. R.

Wang

Y. J.

Yang

T.

Spectros. Spectral Anal.

2013

33

2010

Luo

A.

et al.

2015

preprint (arXiv:1505.01570)

Mallat

S. G.

Zhang

Z.

IEEE Trans. Signal Processing

1993

41

3397

Manteiga

M.

Ordóñez

D.

Dafonte

C.

Arcay

B.

PASP

2010

122

608

McGurk

R. C.

Kimball

A. E.

Ivezić

Ž

AJ

2010

139

1261

Nagi

J.

et al.

ICSIPA

2011

Piscataway, NJ

IEEE Press

342

Olshausen

B. A.

Sparse codes and spikes. In Probabilistic Models Of The Brain: Perceptron Aand Neural Function

2001

Cambridge, MA

MIT Press

Ordóñez

D.

Dafonte

C.

Manteiga

M.

Arcay

B.

Lecture Notes in Computer Science

2008

5271

212

Ordóñez-Blanco

D.

Dafonte-Vázquez

C.

Arcay-Varela

B.

Manteiga

M.

Lecture Notes and Essays in Astrophysics

2007

3

225

Pan

Y. C.

Tu

L. P.

J. Univ. Sci. Technol. Liaoning

2009

32

21

Randich

S.

Gilmore

G.

Gaia–ESO Consortium

Messenger

2013

154

47

Re Fiorentin

P.

Bailer-Jones

C. A. L.

Lee

Y. S.

Beers

T. C.

Sivarani

T.

Wilhelm

R.

Allende Prieto

C.

Norris

J. E.

A&A

2007

467

1373

Recio-Blanco

A.

Bijaoui

A.

De Laverny

P.

MNRAS

2006

370

141

Schierscher

F.

Paunzen

E.

Astron. Nachr.

2011

332

597

Shin

H..C.

Orton

M.

Collins

D. J.

Doran

S.

Leach

M. O.

Chen

X.

Dillon

T.

Ishbuchi

H.

Pei

J.

Wang

H.

Wani

M. A.

10th International Conference on Machine Learning and Applications and Workshops, 1

2011

Washington, DC

IEEE Computer Society

259

Simoncelli

E. P.

Freeman

W. T.

Adelson

E. H.

Heeger

D. J.

IEEE Trans. Information Theory

1992

38

587

Singh

H. P.

Manabu

Y.

Nawo

Y.

Gupta

R.

PASJ

2006

58

177

Tan

C. C.

Eswaran

C.

Neural Computing and Applications

2010

19

1069

Teh

Y. W.

Welling

M.

Osindero

S.

Hinton

G. E.

J. Machine Learning Research

2003

4

1235

Vieira

E. F.

Pons

J. D.

A&AS

1995

111

393

Vishnubhotla

S.

Fernandez

R.

Ramabhadran

B.

Kehtarnavaz

N.

IEEE International Conference on Acoustics Speech and Signal Processing, Texas

2010

Piscataway, NJ

IEEE Press

4614

von Hippel

T.

Storrie-Lombardi

L.

Storrie-Lombardi

M. C.

Irwin

M.

MNRAS

1994

269

97

Weaver

W. B.

Torres-Dodgen

A. V.

ApJ

1995

446

300

Whitney

C. A.

A&A

1983

51

443

Willemsen

P. G.

Bailer-Jones

C. A. L.

Kaempf

T. A.

deBoer

K. S.

A&A

2003

401

1203

Wu

Y.

et al.

Res. Astron. Astrophys.

2011

11

924

Xing

F.

Guo

P.

Spectros. Spectral Anal.

2006

26

1368

Yee

W. T.

Max

W.

Simon

O.

Geoffrey

E. H.

J. Machine Learning Research

2003

4

1235

York

D. G.

et al.

AJ

2000

120

1579

Zhang

J.

Wu

F.

Luo

A.

Zhao

Y.

Acta Astron. Sin.

2005

46

406

Zhang

J.

Wu

F.

Luo

A.

Zhao

Y.

Chin. Astron. Astrophys.

2006

30

176

Zhang

J.

Jiang

Y.

Chang

K.H.

Zhang

S.

Cai

J.

Hu

L.

Pattern Recognition Letters

2009

30

1434

Zhao

G.

Chen

Y.

Shi

J.

Liang

Y.

Hou

J.

Chen

L.

Zhang

H.

Li

A.

Chin. J. Astron. Astrophys.

2006

6

265

Zhao

J.

Luo

A.

Zhao

G.

Guo

M.

Zhao

L.

Wang

L.

Fourth International Conference on Natural Computation, Jinan

2008

Los Alamitos, CA

IEEE Computer Society

150