-
PDF
- Split View
-
Views
-
Cite
Cite
Mingru Zhang, Junping Gao, A-Li Luo, Xia Jiang, Liwen Zhang, Kuang Wu, Bo Qiu, A multimodal celestial object classification network based on 2D spectrum and photometric image, RAS Techniques and Instruments, Volume 2, Issue 1, January 2023, Pages 408–419, https://doi.org/10.1093/rasti/rzad026
- Share Icon Share
ABSTRACT
In astronomy, classifying celestial objects based on the spectral data observed by astronomical telescopes is a basic task. So far, most of the work of spectral classification is based on 1D spectral data. However, 2D spectral data, which is the predecessor of 1D spectral data, is rarely used for research. This paper proposes a multimodal celestial classification network (MAC-Net) based on 2D spectra and photometric images that introduces an attention mechanism. In this work, all 2D spectral data and photometric data were obtained from LAMOST (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope) DR6 and SDSS (Sloan Digital Sky Survey), respectively. The model extracts the features of the blue arm, red arm, and photometric images through three input branches, merges the features at the feature level and sends them to its classifiers for classification. The 2D spectral data set used in this experiment includes 1223 galaxy spectra, 466 quasar spectra, and 1202 star spectra. The same number of photometric images constitute the photometric image data set. Experimental results show that MAC-Net can classify galaxies, quasars, and stars with a classification precision of 99.2 per cent, 100 per cent, and 97.6 per cent, respectively. And the accuracy reached 98.6 per cent, it means that the similarity between this result and the results obtained by the LAMOST template matching method is 98.6 per cent. The results exceed the performance of the 1D spectrum classification network. At the same time, it also proves the feasibility and effectiveness of directly using 2D spectra to classify celestial bodies by using MAC-Net.
1 INTRODUCTION
In recent years, with the Sky Survey projects like the Sloan Digital Sky Survey (SDSS; Ahumada et al. 2020), the global astrometric interferometer for astrophysics (Gaia; Gilmore 2007) and the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST; Cui et al. 2012), the observed astronomical data has increased exponentially. How to efficiently and accurately identify and classify the collected spectral data has also become an important issue for astronomers (Manteiga et al. 2009; Yang et al. 2021). The traditional classification methods for celestial body spectra are based on the Morgan-Keenan system (Morgan & Keenan 1973). The MK classification system mainly works by calculating the similarity between the spectra to be measured and the spectra of standard celestial bodies.
In addition to the above-mentioned MK classification systems, which are template matching algorithms, in recent years, researchers have also made many attempts to solve classification problems using machine learning algorithms. Schierscher & Paunzen (2011) introduced artificial neural networks (ANN) into the field of astronomy as a spectral classification tool and achieve a good classification accuracy (Wang, Guo & Luo 2016), but the computational complexity of the ANN algorithm was very high. Singh, Gupta & Gulati (1998) applied principal component analysis (PCA) to reduce the dimensionality of the spectral data, and used a multilayer BP network (MBPN) to automate the spectral classification process. Liu et al. (2015) used Support Vector Machine (SVM) to classify the stellar spectral data of LAMOST, which confirmed the feasibility of using SVM to classify the spectrum. In recent years, with the emergence of deep learning algorithm which is famous for its powerful feature extraction ability, it brings a new solution to this problem (LeCun, Bengio & Hinton 2015).
Hon, Stello & Yu (2017) built a 1D convolutional neural network to classify red giants with an accuracy rate of 99 |${{\ \rm per\ cent}}$|. Liu et al. (2019) proposed the 1DSSCNN model and classified the spectra of three types of stars, F, G, and K. Experiments showed that the classification accuracy of this model reaches 90 |${{\ \rm per\ cent}}$|, and its effect exceeded SVM. Zheng et al. (2020) put forward the idea of combining convolutional neural networks and twin neural networks. The network model is a semisupervised classification model composed of SGAN and CNN. The spectra of some O-type stars are generated by SGAN to solve the training problem of unbalanced set data, thereby avoiding the occurrence of overfitting in the training process. Zou et al. (2020) proposed the RAC-Net network to classify three types of celestial bodies: stars, galaxies, and quasars. In order to solve the problem of gradient disappearance and gradient explosion caused by the deepening of the network depth in pursuit of stronger feature extraction capabilities, the residual module (Xie et al. 2017) was introduced; and in order to increase the model’s attention to the features useful for classification, the attention mechanism was introduced. Experiments showed that the classification accuracy of the model reaches 98 |${{\ \rm per\ cent}}$|. However, the above-mentioned spectral classification methods are all based on 1D spectroscopy. These methods do reach a high accuracy rate, but they have high requirements on the signal-to-noise ratio (SNR) of the input 1D spectra. The current analysis of spectra by astronomers is based on 1D spectroscopy data. The 1D spectral data are obtained by extracting and combining 2D spectra (Perret, Balmer & Heuberger 2010; Guangwei, Haotong & Zhongrui 2015; Bai et al. 2017). From a feature point of view, the 2D spectrum that exists in a 2D space will have more high-dimensional features, such as spatial features and texture features. The process of extracting the 2D spectrum into 1D spectrum destroys the feature diversity in the data.
2 DATA PROCESSING
2.1 Spectral data
A 2D spectral image collected by the spectrometer in the LAMOST telescope is shown in Fig. 1. Each 2D spectral imaging map includes 250 target spectra. As shown in the figure, since the interval between each spectrum is small and the spectral distribution is tight, it is easier to do cross-contamination between adjacent spectra. Moreover, the closer to the edge, the deeper the pollution.

Fig. 2 is a cut 2D spectral image with a width of 15 pixels and a length of 150 pixels. For the convenience of display, we sliced a 2D spectral image from 15*4196 to 15*150. As shown in the figure, compared to the edges, the pixels in the middle of the image are brighter, may retain more information, and are less likely to be contaminated by adjacent spectra. Therefore, in this paper, the 11 pixels in the centre position are selected as the effective data, and the length is compared with the wavelength range commonly used in the 1D spectrum classification network to take 2000 pixels. That is, the size of the 2D spectrum data input to the network is 11*2000.

2.2 SDSS metering image data
SDSS (Sloan Digital Sky Survey) uses a wide-field telescope with an aperture of 2.5 m, and the photometry system is equipped with five filters located in the u, g, r, i, and z bands to shoot celestial bodies. In addition, astronomers also selected some targets for spectral observation. In this experiment, first download the star catalogue of the observation plan corresponding to the spectral data, and then go to the official channel of SDSS to cross according to the three information of ‘class’, ‘ra’, and ‘dec’ in the star catalogue, and get the photometric image data corresponding to the 2D spectral data through ImageCutout. Then convert the three bands of g, r, and i to RGB channels according to the algorithm described in Lupton et al. (2004). Under normal circumstances, the increase of the image size will cause the amount of network calculation to increase simultaneously, so we choose the size of the metering image to be 64*64 pixels. An sample picture is given in Fig. 3.

Photometric image of SDSS, from left to right are GALAXY, QSO, and STAR.
2.3 Data normalization
Deep learning models are data-driven, so the quality of the input data will affect its results. As shown in Fig. 1, different 2D spectrum data has a certain brightness difference, and this brightness difference is likely to be learned by the network model as a ‘false feature’, which affects the network classification results. Therefore, the 2D spectrum needs to be normalized. We use the Z-score method, which is commonly used in machine learning to standardize the data. This method can make the mean of the data 0 and the standard deviation to be 1, so as to unify the standards between the data and ensure the contrast between the data. The formula of Z-score is equation (1). Among them, mean and std represent the mean and standard deviation of the batch of data (11*2000 spectral data), respectively; x is the value of a specific sample in the data, and z is the result of the normalization of the sample.
Correspondingly, before the SDSS photometric image is input, normalization processing is also required to speed up the model convergence speed and improve accuracy. In this experiment, the method of compressing the original image to the interval [−1,1] is used, and the formula is as shown in equation (2). Different from the method of normalizing the image data to [0,1], this method will make the convergence curve smoother during the training process. In the formula, x is the pixel value of each pixel in the image, and |$\bar{x}$| refers to the pixel value of each pixel after normalization.
2.4 Data set
In this study, 1202 stellar spectra, 1223 galaxy spectra, and 466 quasar spectra were used as 2D data. An equal number of photometric images and 1D spectra for each category are acquired as image data and 1D spectral data, respectively.
We download the star catalogues of the required observation plans (2013-9-14 and 2013-11-03) through the LAMOST official website. Through the matching between the spectral ID in the data set and the spectral ID recorded in the star catalogue, the specific information of each spectrum in the data set is obtained, including the class, ra, dec. Then, the spectra in the data set are classified and saved by category, so the spectral data set required in this experiment is constructed. At the same time, the ra and dec of the spectrum are input into the Imgcutout pipeline in the SDSS official website to obtain the photometry data of the celestial objects observed by the spectrum. According to the class information, the photometric images are classified and saved, and the photometric image data set required in this experiment is constructed; it can also ensure the one-to-one correspondence between the spectrum and the photometric image in the data set. The above operations are all scripted in the Python to automate the process. In the process of capturing the photometric image, it may not be retrieved. At this time, the spectral ID is recorded, and when all the spectra are classified and saved, the spectra for which the photometric image cannot be retrieved will be excluded from the data set.
It is worth noting that after following this process to get all the data, as described in Section 2.1, we sliced the raw 2D spectral data in the process of making the 2D spectral data set for research use. The 2D spectral data used for the study was changed from the original 15*4196 to 11*2000. The reason for slicing in the spatial direction has been explained in Section 2.1. And according to the prior knowledge of the laboratory, the pixel interval of 2000 is enough to contain special spectral lines for classification, and introducing all the data will add more noise instead of features. The specific operation is that the sampling interval of the pixels at the blue arm is 1000–3000 (corresponding to the wavelength range of 4100 Å–5300 Å), and 1500–3500 at the red arm (corresponding to the wavelength range of 6590 Å–8120 Å). The selection of the sampling interval is the result of multiple experiments, and the experiments will be explained in detail in Section 4.5. In addition, we choose the size of the captured photometric image to be 64*64 according to the prior knowledge, that can try our best to reduce the computational complexity of the network on the basis of capturing the whole picture of the target celestial body.
More importantly, the labels of all data in this study are from the records of LAMOST. Luo et al. (2012) obtained class labels by template matching method.
We use these three kinds of data to make all four data sets used in this study. The specific data set settings are shown in Table 1. The ‘size’ column in the table shows the size of each data set. Each data set is divided into training set, validation set, and test set according to the ratio of 8:1:1. In addition, in order to ensure fairness, the training set, validation set, and test set used in all experiments are the same. The ‘model’ column describes the network model that takes this data set as input.
. | 1D spec . | 2D spec . | Photometric image . | Size . | Split . | Model . |
---|---|---|---|---|---|---|
dataset1 | |$\checkmark$| | 2891 × (1 × 2000) | 8:1:1 | 1D SSCNN | ||
dataset2 | |$\checkmark$| | |$\checkmark$| | 2891 × (1 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | M-Net | |
dataset3 | |$\checkmark$| | 2891 × (11 × 2000)(blue) & 2891 × (11 × 2000)(red) | 8:1:1 | C-Net | ||
dataset4 | |$\checkmark$| | |$\checkmark$| | 2 × 2891 × (11 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | MC-Net & MAC-Net |
. | 1D spec . | 2D spec . | Photometric image . | Size . | Split . | Model . |
---|---|---|---|---|---|---|
dataset1 | |$\checkmark$| | 2891 × (1 × 2000) | 8:1:1 | 1D SSCNN | ||
dataset2 | |$\checkmark$| | |$\checkmark$| | 2891 × (1 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | M-Net | |
dataset3 | |$\checkmark$| | 2891 × (11 × 2000)(blue) & 2891 × (11 × 2000)(red) | 8:1:1 | C-Net | ||
dataset4 | |$\checkmark$| | |$\checkmark$| | 2 × 2891 × (11 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | MC-Net & MAC-Net |
. | 1D spec . | 2D spec . | Photometric image . | Size . | Split . | Model . |
---|---|---|---|---|---|---|
dataset1 | |$\checkmark$| | 2891 × (1 × 2000) | 8:1:1 | 1D SSCNN | ||
dataset2 | |$\checkmark$| | |$\checkmark$| | 2891 × (1 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | M-Net | |
dataset3 | |$\checkmark$| | 2891 × (11 × 2000)(blue) & 2891 × (11 × 2000)(red) | 8:1:1 | C-Net | ||
dataset4 | |$\checkmark$| | |$\checkmark$| | 2 × 2891 × (11 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | MC-Net & MAC-Net |
. | 1D spec . | 2D spec . | Photometric image . | Size . | Split . | Model . |
---|---|---|---|---|---|---|
dataset1 | |$\checkmark$| | 2891 × (1 × 2000) | 8:1:1 | 1D SSCNN | ||
dataset2 | |$\checkmark$| | |$\checkmark$| | 2891 × (1 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | M-Net | |
dataset3 | |$\checkmark$| | 2891 × (11 × 2000)(blue) & 2891 × (11 × 2000)(red) | 8:1:1 | C-Net | ||
dataset4 | |$\checkmark$| | |$\checkmark$| | 2 × 2891 × (11 × 2000) & 2891 × (64 × 64 × 3) | 8:1:1 | MC-Net & MAC-Net |
3 NETWORK STRUCTURE
3.1 MAC-NET
In this paper, we propose a new multimodal celestial classification network (MAC-NET) based on the attention mechanism. The 2D spectral data of LAMOST includes red arm data and blue arm data. Different celestial body spectra have different flow rates at different wavelengths. Therefore, the use of red arm or blue arm data alone cannot meet the requirements of celestial object classification accuracy, so this paper uses red arm data and blue arm data at the same time, built two identical spectral feature extraction networks. In addition, in the traditional 1D spectrum classification, the spectrum is first extracted for the red and blue arm data, and then spliced. This method essentially belongs to the splicing at the data level. In this experiment, after extracting features through CNN, the higher dimensional features are fused in a higher dimensional space. We first extract the red and blue features through two identical feature extraction networks, and then fuse the features to obtain a new one with richer information characteristics.
The structure diagram of MAC-NET is shown in Fig. 4 and Table 2. The specific network structure is that we built a 12-layer network to extract the red and blue arm features of the 2D spectrum. The network is mainly composed of four convolutional layers, four batch normalization layers, and four maxpooling layers. Convolution is used to extract features, the batch normalization layer is used to speed up the model convergence, and the maxpooling layer is used to downsampling the image size. It is worth mentioning that due to the extreme imbalance between the width and height of spectral data, we set the size of the convolution kernel to 3 × 20 to extract long-distance features as much as possible while ensuring the computational complexity. For the photometric image input branch, we consider that the size of the input image is small (64 × 64), which is more suitable for wide network than deep network.

The structure diagram of MAC-NET. The first two lines of network structure extract the blue arm and red arm features, respectively. Among them, convolution is used to extract features, batch normalization layer is used to speed up model convergence, and maximum pooling layer is used to achieve downsampling of image size. The third branch network structure is used to extract the features of the photometric image and introduce the improved inception module. Finally, the features extracted from the blue arm, the red arm, and the photometric image branch are flattened, feature fusion is performed, and then four fully connected layers are passed through, and finally the softmax function is used for classification. Use dropout operations between fully connected layers to prevent model overfitting.
Blue & red branch . | Image branch . | ||
---|---|---|---|
Layer . | Detail . | Layer . | Detail . |
Conv1 | 3 × 20 × 128 | Conv1 | 7 × 7 × 64 |
Normalization | BN | MP | 3 × 3 × 64 |
MP | 2 × 2 × 128 | Normalization | LRN |
Dropout | 0.5 | Conv2 | 1 × 1 × 64 |
Conv2 | 3 × 20 × 64 | Conv2 | 3 × 3 × 192 |
Normalization | BN | Normalization | LRN |
MP | 2 × 2 × 64 | MP | 3 × 3 × 192 |
Dropout | 0.5 | – | – |
Conv3 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | – | – |
CBAM | – | – | – |
Conv4 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | Inception | – |
Flatten | 1 × 8000 | Flatten | 1 × 4096 |
Concatenate (1 × 20096) | |||
Dense (1 × 4096) | |||
Dropout (0.5) | |||
Dense (1 × 1024) | |||
Dense (1 × 3) |
Blue & red branch . | Image branch . | ||
---|---|---|---|
Layer . | Detail . | Layer . | Detail . |
Conv1 | 3 × 20 × 128 | Conv1 | 7 × 7 × 64 |
Normalization | BN | MP | 3 × 3 × 64 |
MP | 2 × 2 × 128 | Normalization | LRN |
Dropout | 0.5 | Conv2 | 1 × 1 × 64 |
Conv2 | 3 × 20 × 64 | Conv2 | 3 × 3 × 192 |
Normalization | BN | Normalization | LRN |
MP | 2 × 2 × 64 | MP | 3 × 3 × 192 |
Dropout | 0.5 | – | – |
Conv3 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | – | – |
CBAM | – | – | – |
Conv4 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | Inception | – |
Flatten | 1 × 8000 | Flatten | 1 × 4096 |
Concatenate (1 × 20096) | |||
Dense (1 × 4096) | |||
Dropout (0.5) | |||
Dense (1 × 1024) | |||
Dense (1 × 3) |
Blue & red branch . | Image branch . | ||
---|---|---|---|
Layer . | Detail . | Layer . | Detail . |
Conv1 | 3 × 20 × 128 | Conv1 | 7 × 7 × 64 |
Normalization | BN | MP | 3 × 3 × 64 |
MP | 2 × 2 × 128 | Normalization | LRN |
Dropout | 0.5 | Conv2 | 1 × 1 × 64 |
Conv2 | 3 × 20 × 64 | Conv2 | 3 × 3 × 192 |
Normalization | BN | Normalization | LRN |
MP | 2 × 2 × 64 | MP | 3 × 3 × 192 |
Dropout | 0.5 | – | – |
Conv3 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | – | – |
CBAM | – | – | – |
Conv4 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | Inception | – |
Flatten | 1 × 8000 | Flatten | 1 × 4096 |
Concatenate (1 × 20096) | |||
Dense (1 × 4096) | |||
Dropout (0.5) | |||
Dense (1 × 1024) | |||
Dense (1 × 3) |
Blue & red branch . | Image branch . | ||
---|---|---|---|
Layer . | Detail . | Layer . | Detail . |
Conv1 | 3 × 20 × 128 | Conv1 | 7 × 7 × 64 |
Normalization | BN | MP | 3 × 3 × 64 |
MP | 2 × 2 × 128 | Normalization | LRN |
Dropout | 0.5 | Conv2 | 1 × 1 × 64 |
Conv2 | 3 × 20 × 64 | Conv2 | 3 × 3 × 192 |
Normalization | BN | Normalization | LRN |
MP | 2 × 2 × 64 | MP | 3 × 3 × 192 |
Dropout | 0.5 | – | – |
Conv3 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | – | – |
CBAM | – | – | – |
Conv4 | 3 × 20 × 64 | – | – |
Normalization | BN | – | – |
MP | 2 × 2 × 64 | – | – |
Dropout | 0.5 | Inception | – |
Flatten | 1 × 8000 | Flatten | 1 × 4096 |
Concatenate (1 × 20096) | |||
Dense (1 × 4096) | |||
Dropout (0.5) | |||
Dense (1 × 1024) | |||
Dense (1 × 3) |
So we chose GoogLeNet with a more uniform depth and width to extract the features of the photometric image input (Raghu et al. 2017). Szegedy et al. (2015) proposed GoogLeNet, in the GoogLeNet, three convolutional layers, two normalization layers, and two maximum pooling layers are used to achieve preliminary feature extraction before the Inception module. And then the Inception module (Ioffe & Szegedy 2015; Szegedy et al. 2016, 2017) broadens the width of the network model, and a large number of 1*1 convolution is used to speed up the calculation speed of the network. Still in order to ensure that the size of the final feature map is not downsampled to a small extent by the pooling layer, we deleted the last average pooling layer in the original GoogLeNet. The features extracted from the red and blue arms and the photometric image branch are flattened, feature fusion is performed, and then four fully connected layers are passed through, and finally classified by the softmax function. The dropout operation is used between the fully connected layers to prevent the model from overfitting.
3.2 CNN
Convolutional neural network (CNN) is currently the most widely used deep learning algorithm. The feature extraction ability of the neural network is mainly reflected in the hidden layer, which is actually a multilevel abstraction of the input features, so that the network can perform better non-linear classification of different data. Generally, the number of hidden layers of the network represents the feature extraction ability of the network to a certain extent, but the number of hidden layers is not the more the better, the number of hidden layers is mainly restricted by two factors: computational complexity and task complexity. Computational complexity increases with the number of hidden layers, and a system that is computationally slow is usually not what we expect; moreover, when our target task is simpler, too many hidden layers will not extract more Valuable information may even generate overfitting, etc., which negatively affects the network.
The nodes in each layer of the CNN are called neurons, and the non-linear activation functions in the neurons will continue to nest as the hidden layers are stacked to form a complex non-linear function. The training process of a CNN can be abstracted as a continuous process of fitting a non-linear function to a data label mapping function. The relationship between the input and output of each neuron is expressed as follows:
where Y represents the output of the neuron, b represents the bias, w represents the weight, and X represents the input of the neuron. And f represents the activation function. The activation function used in this article is ReLU (Toth 2013) as shown in equation (4). The variable x represents the result of the data after a convolution operation.
3.3 CBAM (convolutional block attention module)
The attention mechanism can allow the model MAC-Net in this experiment to better focus on some important features. Similar to humans observing the target, we pay different attention to the ‘target’ and the background. The same is true for spectral data. Different locations contain different information. The introduction of the attention mechanism allows our model to pay more attention to the positions that contribute to the classification. This paper uses the CBAM attention module (Woo et al. 2018), which realizes the fusion of channel attention and spatial attention. While saving parameters and computing power, it ensures compatibility with different networks. Fig. 5 is a block diagram of the module.

The overview of CBAM. This module has two sequential submodules: the channel module and the spatial module. The intermediate feature map is adaptively refined in the deep network by this module.
3.3.1 Channel attention
During the convolution calculation, each convolution kernel will generate a channel. The general network algorithm believes that the importance of the information in each channel is the same, but in fact, there are experiments showing that features in different channels have different effects on the performance of the model, and even features in some channels have a negative impact on the model (Hu et al. 2020). In this way, it makes sense to assign different weights to features in different channels. The channel attention mechanism calculates the weight value corresponding to each channel, and uses the weight value to multiply the feature map of the corresponding channel, so that the network highlights the representation of important features. Its structure is shown in Fig. 6.

Diagram of the channel module. The module uses a shared network in the channel dimension to utilize the maxpooling output and the avgpooling output.
First, perform maxpooling and avgpooling on the input feature map, reduce the dimensionality of the feature map into two vectors, and then send the two vectors to the perceptron (MLP) for learning. The two vectors have the same weights. The vector output in the perceptron is added and sent to the Sigmoid function for activation, and then multiplied by the original feature map to obtain the output of the channel attention module.
3.3.2 Spatial attention
The spatial attention module modifies the feature map by using the spatial relationship of the input feature map. Similar to the channel attention model, this model also needs to calculate a weight matrix, and multiply the weight matrix with the feature map of this layer to obtain the next layer of input. The difference is that the weight matrix in this model is calculated based on the spatial relationship. The structure of the model is shown in Fig. 7.

Diagram of the spatial attention module. The spatial attention module performs a global maxpooling and a global avgpooling on the output features of the channel attention, respectively. Concatenate the two results based on the channel and forward the concatenated result to the convolutional layer.
The model takes the feature map of the channel attention output as input, and performs maxpooling and avgpooling on the input feature map in the channel direction, and then concatenates the two outputs in the channel direction. The Conv layer in Fig. 7 is a convolutional layer with a convolution kernel of 7 × 7 size. After the features output by the pooling layer go through a 7 × 7 convolution kernel, the final weight matrix is obtained, and send it to the Sigmoid function for activation, and then multiply the original feature map to get the final spatial attention feature map.
3.4 Multimodal machine learning (MMML)
MMML aims to realize the ability to process and understand multisource modal information through machine learning methods. Baltrusaitis, Ahuja & Morency (2019) propose multimodal learning is to combine the input information of multiple modes and enter it into the deep learning network to classify or regress the target. The schematic diagram of multimodal learning is shown in Fig. 8. Experiments prove that the network performance of joint multimodal input is better than the network model of single-modal input. Because the mode is essentially the carrier of information. Data from multiple sources is semantically related, and the combination of the two will provide complementary information to each other, resulting in richer semantic information, which will help generate more reliable predictions (Huang et al. 2021).

The overview of MMML. Obtain features from different sources of input through different feature extraction networks, and then merged them. Use the fusion features for the final classification or regression.
For example, in this experiment, the photometric image can provide morphological semantic information for the 2D spectrum, and 2D spectrum can similarly provide spectroscopic semantic information for the photometric image. In this paper, since the 2D spectrum is a single-channel greyscale image, and the photometric image is a three-channel RGB image, there is a difference between the two at the data level. Therefore, in order to alleviate the inconsistency between the original data in each mode, this project chooses to first extract the representation of features from each mode, and then perform fusion at the feature level, that is, feature fusion. Then the new features obtained by the fusion are input into the network for target classification.
4 EXPERIMENT
Before training MAC-Net, some hyperparameters need to be configured in advance, such as batch size, learning rate (lr), epoch, etc. Table 3 shows the hyperparameter settings of MAC-Net and the hardware environment of this experiment. For fairness consideration, all experiments adopt the same hyperparameter configuration as MAC-Net. In the experiment, the early stop mechanism is introduced, which allows the model to stop the training process when validation loss does not drop for 50 consecutive epochs. This can prevent overfitting of the model. And because the data set has a serious imbalance problem, in order to avoid network overfitting, the class-weights is set to 1:3:1 during the training process, so that the model pays more attention to the samples of quasars, to make up for the problem of insufficient samples.
Configuration . | MAC-Net . |
---|---|
Optimizer | AdaGrad |
Batch-size | 4 |
Total-train-epoch | 300 |
Learning rate | 0.001 |
Loss function | CrossEntropyLoss |
Early-stop-patience | 50 |
Class-weights | 1:3:1 |
Operating system | MS Windows 10 |
Graphics processing unit (GPU) | Geforce 3070ti 8GB |
Programming language | Python 3.7 |
Development environment configuration | Keras 2.6.0 & CUDA 11.0 |
Configuration . | MAC-Net . |
---|---|
Optimizer | AdaGrad |
Batch-size | 4 |
Total-train-epoch | 300 |
Learning rate | 0.001 |
Loss function | CrossEntropyLoss |
Early-stop-patience | 50 |
Class-weights | 1:3:1 |
Operating system | MS Windows 10 |
Graphics processing unit (GPU) | Geforce 3070ti 8GB |
Programming language | Python 3.7 |
Development environment configuration | Keras 2.6.0 & CUDA 11.0 |
Configuration . | MAC-Net . |
---|---|
Optimizer | AdaGrad |
Batch-size | 4 |
Total-train-epoch | 300 |
Learning rate | 0.001 |
Loss function | CrossEntropyLoss |
Early-stop-patience | 50 |
Class-weights | 1:3:1 |
Operating system | MS Windows 10 |
Graphics processing unit (GPU) | Geforce 3070ti 8GB |
Programming language | Python 3.7 |
Development environment configuration | Keras 2.6.0 & CUDA 11.0 |
Configuration . | MAC-Net . |
---|---|
Optimizer | AdaGrad |
Batch-size | 4 |
Total-train-epoch | 300 |
Learning rate | 0.001 |
Loss function | CrossEntropyLoss |
Early-stop-patience | 50 |
Class-weights | 1:3:1 |
Operating system | MS Windows 10 |
Graphics processing unit (GPU) | Geforce 3070ti 8GB |
Programming language | Python 3.7 |
Development environment configuration | Keras 2.6.0 & CUDA 11.0 |
4.1 Evaluation standard
In order to verify the feasibility and effectiveness of this experiment in the task of celestial body classification, this experiment uses the indicators common to classification tasks, such as precision, recall, F1-score, and accuracy to evaluate the performance of the model (Lu et al. 2021). The calculation formulas for precision, recall, F1-score, and accuracy are as follows:
Among them, the data in which the model predicts to be positive, and in fact it is also a positive sample, we call it TP (true positive); the model prediction is a positive sample, but the data are actually a negative sample, which we call FP (false positive); FN (False negative) refers to the data predicted by the model as a negative sample, which is actually a positive sample; TN (true negative) refers to data that is not only predicted but actually a negative sample. According to formulas 5–8, The precision reflects the probability of being a positive sample out of all the predicted positive samples. The recall reflects the probability of being predicted to be a positive sample out of an actual positive sample. F1-score is the harmonic average of precision and recall. The larger the value, the more comprehensive and robust the model’s performance. The accuracy represents the proportion of the correctly predicted samples in all the data. It should be noted that in this experiment, all data are evaluated based on the similarity between the results obtained by this method and the labels generated by the LAMOST template matching method, and there may still be some fallacies in labels (Luo et al. 2012). Therefore, the accuracy mentioned in the following text represents the similarity between the results obtained by this method and those obtained by the LAMOST template method. For example, an accuracy of 98.6 per cent means that the similarity between the results obtained by this method and those obtained by the LAMOST template method is 98.6 per cent. For the sake of clarity, the term ‘accuracy’ will be used in the following text to refer to this similarity, without further explanation.
4.2 Performance of convolutional neural network only based on spectrum (C-Net & 1D SSCNN)
In this experiment, a CNN only based on 2D spectral data (C-Net) for classification is first built. The structure of this network is exactly the same as our MAC-Net with only the first two branches and the CBAM module removed. The parameter settings are also consistent with those in MAC-Net. Dataset 3 is used for training and testing of C-Net, and for comparison, we train and test 1D SSCNN with dataset 1. The comparison results are shown in the first and third rows of Table 4.
. | Galaxy . | Quasar . | Star . | . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Model . | P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | Accuracy . |
1D SSCNN | 0.946 | 0.964 | 0.955 | 0.884 | 0.927 | 0.905 | 0.982 | 0.933 | 0.957 | 0.942 |
M-Net | 0.964 | 0.982 | 0.973 | 0.932 | 1.0 | 0.965 | 1.0 | 0.933 | 0.966 | 0.968 |
C-Net | 0.907 | 0.817 | 0.860 | 0.722 | 0.848 | 0.780 | 0.906 | 0.935 | 0.921 | 0.872 |
MC-Net | 0.983 | 0.967 | 0.975 | 0.977 | 0.935 | 0.956 | 0.969 | 1.0 | 0.984 | 0.976 |
MAC-Net | 0.992 | 0.992 | 0.992 | 1.0 | 0.935 | 0.966 | 0.976 | 1.0 | 0.988 | 0.986 |
. | Galaxy . | Quasar . | Star . | . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Model . | P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | Accuracy . |
1D SSCNN | 0.946 | 0.964 | 0.955 | 0.884 | 0.927 | 0.905 | 0.982 | 0.933 | 0.957 | 0.942 |
M-Net | 0.964 | 0.982 | 0.973 | 0.932 | 1.0 | 0.965 | 1.0 | 0.933 | 0.966 | 0.968 |
C-Net | 0.907 | 0.817 | 0.860 | 0.722 | 0.848 | 0.780 | 0.906 | 0.935 | 0.921 | 0.872 |
MC-Net | 0.983 | 0.967 | 0.975 | 0.977 | 0.935 | 0.956 | 0.969 | 1.0 | 0.984 | 0.976 |
MAC-Net | 0.992 | 0.992 | 0.992 | 1.0 | 0.935 | 0.966 | 0.976 | 1.0 | 0.988 | 0.986 |
. | Galaxy . | Quasar . | Star . | . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Model . | P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | Accuracy . |
1D SSCNN | 0.946 | 0.964 | 0.955 | 0.884 | 0.927 | 0.905 | 0.982 | 0.933 | 0.957 | 0.942 |
M-Net | 0.964 | 0.982 | 0.973 | 0.932 | 1.0 | 0.965 | 1.0 | 0.933 | 0.966 | 0.968 |
C-Net | 0.907 | 0.817 | 0.860 | 0.722 | 0.848 | 0.780 | 0.906 | 0.935 | 0.921 | 0.872 |
MC-Net | 0.983 | 0.967 | 0.975 | 0.977 | 0.935 | 0.956 | 0.969 | 1.0 | 0.984 | 0.976 |
MAC-Net | 0.992 | 0.992 | 0.992 | 1.0 | 0.935 | 0.966 | 0.976 | 1.0 | 0.988 | 0.986 |
. | Galaxy . | Quasar . | Star . | . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Model . | P . | R . | F1 . | P . | R . | F1 . | P . | R . | F1 . | Accuracy . |
1D SSCNN | 0.946 | 0.964 | 0.955 | 0.884 | 0.927 | 0.905 | 0.982 | 0.933 | 0.957 | 0.942 |
M-Net | 0.964 | 0.982 | 0.973 | 0.932 | 1.0 | 0.965 | 1.0 | 0.933 | 0.966 | 0.968 |
C-Net | 0.907 | 0.817 | 0.860 | 0.722 | 0.848 | 0.780 | 0.906 | 0.935 | 0.921 | 0.872 |
MC-Net | 0.983 | 0.967 | 0.975 | 0.977 | 0.935 | 0.956 | 0.969 | 1.0 | 0.984 | 0.976 |
MAC-Net | 0.992 | 0.992 | 0.992 | 1.0 | 0.935 | 0.966 | 0.976 | 1.0 | 0.988 | 0.986 |
The results show that the performance of the C-Net network that relies solely on 2D spectra for celestial body classification is worse than that of the 1D SSCNN network that uses 1D spectra for classification. The analysis is due to the original 2D spectral data used in this experiment, which has not undergone information enhancement processing such as calibration and skylight reduction, resulting in serious skylight interference in the 2D spectral data, especially red arm data, which is known for its rich skylight interference. The skylight information masks the original characteristic information to a certain extent. The 1D spectrum data downloaded from the official website has undergone processing such as calibration and skylight reduction, which greatly improves the signal-to-noise ratio of the spectrum data and makes its characteristics more obvious. As a result, there is a large gap between the results of C-Net and 1D SSCNN, both in terms of precision and robustness of the model.
4.3 Performance of convolutional network based on multimodal input (MC-Net & M-Net)
The results in Table 4 show that noise interference in the 2D spectrum will have a negative impact on the classification results. In the process of trying to denoise the 2D spectral data, we took another approach and decided to introduce the photometric data of SSDS and design a multimodal network structure. The semantic information of the morphology of the photometric image is used to make up for the lack of semantic information of the undenoised 2D spectrum at the spectroscopic level. On the basis of C-Net, MC-Net is built. Compared with C-Net, MC-Net adds a photometric image input branch, which uses an improved GoogLeNet to extract features from photometric images, and finally fusion with the features extracted from the red and blue arm data of the 2D spectrum. The fusion is sent to the classifier for prediction. In addition, the same photometric image branch is added to the 1D SSCNN to build the M-Net, in order to more objectively evaluate the performance of the MC-Net. The second and fourth rows in Table 4 show the performance of the M-Net and MC-Net.
The results in Table 4 show that after the introduction of photometric image data, the classification precision of various celestial bodies has been improved to varying degrees. First of all, this shows that the morphological semantics of photometric images can indeed enrich the semantic information extracted from the entire network, and play a positive role in the classification performance of the network model. The comparison of the results of MC-Net and C-Net can prove this conclusion, as can the comparison of M-Net and 1DSSCNN.
In addition, when the photometric image data were introduced, some of the results of MC-Net surpassed that of M-Net. For example, the precision of galaxies and quasars were 1.9 per cent and 4.5 per cent higher, respectively. The overall accuracy of the test set was 0.8 per cent higher than M-Net. We think this result can prove that the 2D spectrum is more informative than the 1D spectrum.
Secondly, we found that the introduction of morphological semantics has greatly improved the classification accuracy of QSO celestial bodies. Our analysis is that the characteristics of QSO celestial bodies are concentrated in the red end of the 2D spectrum, and there is a lot of noise in the red end of the 2D spectrum data without skylight reduction. A large amount of noise interferes with the network’s extraction of its features, C-Net does not introduce morphological semantics, and the classification accuracy of QSO is also low, which can also be used as a proof of this view.
Furthermore, we attempt to justify our analysis by Class Activation Map (CAM). CAM (Selvaraju et al. 2020) is a visualization method commonly used in the field of computer vision to explain the features extracted by the neural network. It aims to show the area that the neural network pays more attention to. We can consider this area as the features extracted by the neural network.
As shown in Fig. 9, unlike the other two types of celestial bodies, for quasars, the neural network pays more attention (highlighted area) to its red arm, which means that the neural network thinks that there are more unique feature of quasars at the red arm. This also provides another support for our above analysis.

Class Activation Map (CAM) with different celestial features distributed in the red arm and blue arm data (result of MAC-Net). The horizontal axis at the bottom of each subgraph is its pixel position, and the horizontal axis at the top represents the wavelength position (Å) information corresponding to that pixel. Since the relationship between pixel and wavelength is non-linear, only the first and last wavelengths are given. The darker the colour of the position in the figure indicates the greater the contribution of the characteristics of the position to the classification. For example, the dark red area contributes the most, and the yellow area contributes less. Where as dark blue is closer to the background colour and contributes the least. Obviously, the characteristics of quasars are more concentrated in the red arm than galaxies and stars. In the blue arm, the galaxy features are more concentrated at the two ends of the pixel interval, and the star blue arm features are more concentrated in the middle of the pixel interval.
4.4 The performance of the multimodal convolutional network with the introduction of the attention mechanism (MAC-Net)
In this part, we introduce the attention mechanism and build the MAC-Net network. Table 4 shows the comparison of the classification precision, recall, F1-score, and accuracy of various celestial bodies between this network and the previous network. Fig. 11 is the performance results of the abovementioned networks in the test set represented by the confusion matrix. In order to more directly demonstrate the effectiveness of the attention mechanism, we compare the CAM diagrams before and after the introduction of the attention mechanism, and the results are shown in Fig. 10.

Before and after adding the attention mechanism, the feature distribution CAM of different celestial objects at the blue arm and the red arm. Taking ‘G–B’ as an example, it means that both a(1) and a(2) are blue arm spectra of galaxy. The horizontal axis at the bottom of each sub-graph is its pixel position, and the horizontal axis at the top represents the wavelength (Å) position information corresponding to that pixel. Since the relationship between pixel and wavelength is non-linear, only the first and last wavelengths are given.

The confusion matrix heat map of the five models. The last subplot shows the proportion of the total number of confusions in each model in the total test set. The darker the colour in the heat map, the larger the proportion, the darker the colour on the main diagonal, the better the model performance. Obviously MAC-Net has achieved the best performance.
It can be seen from the results in Table 4 that after the introduction of the attention mechanism, the network’s classification precision for the three types of celestial bodies has been improved, which is in line with our original intention of introducing the attention mechanism. The attention mechanism can make the network pay more attention to useful information. The MAC-Net achieves the best performance or is on par with the best performance on most evaluation metrics. For example, the highest precision for galaxy and quasar categories, the highest F1-score, and the best classification accuracy on the testset. It shows that the performance of the MAC-Net is comprehensive and robust.
It can be seen from Fig. 10, after the introduction of the attention mechanism, the features of various celestial bodies extracted have changed significantly. There is a clear variation in feature importance at the same pixel location for the same type of celestial body (first row). The importance of features at different pixel locations of the same type of celestial body is clearly distinguished. (left end of the two images in the sixth row). This indicates that the introduction of attention mechanism does change the weight distribution of features in the network model.
Fig. 11 shows that the classification errors of both 1D SSCNN and C-Net are mainly concentrated in the confusion of the two categories of quasars and galaxies. For example, about 4 per cent of galaxies in 1D SSCNN are considered to be quasars (the first row and second column in 1DSSCNN), and about 5 per cent of quasars are considered to be galaxies (the second row and first column in 1DSSCNN). This phenomenon is more pronounced in C-Net, about 10 per cent of galaxies are considered to be quasars (first row and second column in C-Net), and about 11 per cent of quasars are considered to be galaxies (in C-Net second row, first column). MC-Net and MAC-Net make most of the data distributed on the main diagonal, indicating that the classification effects are better. A more intuitive effect is drawn in the last subplot of Fig. 11. The ‘G-Q’ in the legend represents the proportion of the total test set where galaxies and quasars are confused (including cases of mistaking galaxies for quasars and mistaking quasars for galaxies). Obviously in the last subplot, when the photometric image is introduced, the confusion rate of each category has been significantly reduced. In particular, the confusion rate of galaxies and quasars is the most significant, because galaxies and quasars have great differences in morphology, and the morphological features of the target are added to the model after the introduction of photometric images. Similarly, for the confusion rate of stars and quasars, the benefit of introducing photometric images is small, because the morphological differences between stars and quasars are not obvious; but the confusion rate still decreases, indicating that the model has learned other features (different from morphological features). In conclusion, the above phenomena demonstrate the advantages of multimodal information fusion networks.
However, the fusion of multiple information may affect the inference speed of the model. The inference time and parameter amount of the five models used in this work will be presented as Table 5. The inference time refers to the time it takes the model to complete an inference process. As shown in Table 5: 1D SSCNN and C-Net, which only rely on spectral data for classification, have the shortest inference time, but unfortunately the performance of these two networks is not good enough, especially C-Net. Introducing photometric image data, in M-Net and MC-Net, the inference time is 6.72 and 6.56 ms longer than 1DSSCNN and C-Net, respectively, but the performance of these two networks has been greatly improved, especially MC-Net. The best performing MAC-Net is 1.18 ms slower than MC-Net. However, at the cost of 1.18 ms, it is acceptable to increase all indicators by nearly 1 per cent. And astronomical data usually does not require real-time processing. Perhaps as the amount of data increases, astronomers will require higher and higher inference speed for the model, but so far, we think that this loss is worthwhile for better performance.
Comparison table of parameter quantity and inference speed of different models.
Model . | Parameters . | Inference time . |
---|---|---|
1D SSCNN | 133 M | 9.18 ms |
M-Net | 188 M | 15.90 ms |
C-Net | 269 M | 9.57 ms |
MC-Net | 356 M | 16.13 ms |
MAC-Net | 360 M | 17.31 ms |
Model . | Parameters . | Inference time . |
---|---|---|
1D SSCNN | 133 M | 9.18 ms |
M-Net | 188 M | 15.90 ms |
C-Net | 269 M | 9.57 ms |
MC-Net | 356 M | 16.13 ms |
MAC-Net | 360 M | 17.31 ms |
Comparison table of parameter quantity and inference speed of different models.
Model . | Parameters . | Inference time . |
---|---|---|
1D SSCNN | 133 M | 9.18 ms |
M-Net | 188 M | 15.90 ms |
C-Net | 269 M | 9.57 ms |
MC-Net | 356 M | 16.13 ms |
MAC-Net | 360 M | 17.31 ms |
Model . | Parameters . | Inference time . |
---|---|---|
1D SSCNN | 133 M | 9.18 ms |
M-Net | 188 M | 15.90 ms |
C-Net | 269 M | 9.57 ms |
MC-Net | 356 M | 16.13 ms |
MAC-Net | 360 M | 17.31 ms |
4.5 Exploration of 2D spectral pixel sampling interval
In Section 4.2 of the paper, the size of the 2D spectral data input to the network is 11*2000. However, the length of the 2D spectrum data in the wavelength direction is 4196 pixels. Therefore, we set up multiple sets of control experiments with different pixel sampling intervals to try to explore the influence of different pixel sampling intervals on the classification accuracy of each celestial body. First, set the pixel sampling interval of the blue arm to 1000–3000, and then sample the four different pixel intervals of 500–2500, 1000–3000, 1500–3500, 2000–4000 of the red arm data, and input them into the C-Net, respectively. The precision of classification is shown in Fig. 12 (right-hand panel). Similarly, we fix the pixel sampling interval of the red arm to 1500–3500, and then sample the above four intervals of the blue arm separately, and input them into the C-Net respectively, and the classification precision statistics are shown in Fig. 12 (left-hand panel).

The effect of changing the pixel sampling interval at the blue (left-hand panel) and red (right-hand panel) arms on the classification precision of each celestial object (result of C-Net). Obviously, the features of different celestial bodies are distributed in different pixel positions, and the blue and red arm data of the same celestial body are also distributed in different positions. The horizontal axis at the bottom of each sub-image is a different pixel interval, and the horizontal axis at the top is the wavelength interval (Å) corresponding to this pixel interval (one-to-one correspondence).
In Fig. 12, for galaxies, when the blue arm pixel interval is shifted from 500–2500 to 1000–3000, the precision increases significantly, which can indicate that there are galaxy features 2500–3000 pixel interval in the blue arm, and no conclusion can be drawn about whether there are galaxy features in the interval of 500–1000. As the window continues to move right to 1500–3500, the precision drops. It shows that there are galaxy features in the interval of 1000–1500, which has been discarded. Similarly, when the window comes to 2000–4000 from 1500–3500, we can infer that galaxy features also exist in the interval of 1500–2000. To sum up, there are galaxy features in the three intervals of 1000–1500, 1500–2000, and 2500–3000. Compared with Fig. 10 a(2), the pixel interval shown in the figure is 1000–3000, it can be seen that there are features in the three intervals mentioned.
In the red arm, when the window slides from 500–2500 to 1000–3000, the precision of galaxies drops, indicating that the galaxy features exist in the 500–1000 pixel range of the red arm spectrum; when the window slides to 1500–3500, the precision rises, indicating that the galaxy features also exist in 3000–3500. Similarly, when the window changes from 1500–3500 to 2000–4000, the precision drops, indicating that galaxy features also exist in the 1500–2000 pixel interval. Comparing Fig. 10 b(2), there are indeed galaxy features in the two intervals of 1500–2000 and 3000–3500. According to the analysis, there should also be features in 500–1000 pixel interval of the red arm, but due to the limitations of the experimental setup, we can only visualize the 1500–3500 pixel interval participating in the training, so unfortunately the features in the 500–1000 pixel cannot be displayed. So in the following paragraphs, when we analyse the precision curves of quasars and stars, we will only briefly mention the pixel interval which can be seen in Fig. 10.
We can draw similar conclusions by observing the classification precision curves of quasars: there are quasar features in the 1500–2000 pixel interval of the blue arm spectrum, corresponding to Fig. 10 c(2). For stars, the 1000–1500 pixel range of the blue arm spectrum, the 1500–2000 and 3000–3500 pixel ranges of the red arm spectrum all have star features, corresponding to Figs 10 e(2) and f(2), respectively. To sum up, the results of Figs 12 and 10 can confirm each other. To a certain extent, it proves the correctness of the features learned by the model.
A graph of the effect of changing the pixel sampling interval on the F1 score of different celestial object classifications is added in Fig. 13. It can be seen from Fig. 13 that different pixel sampling intervals will affect the F1 score. The higher the F1 score, the better the robustness of the model. Finally, we integrated the performance of multiple pixel intervals in terms of precision and F1 score, and selected the pixel interval at the blue arm to be 1000–3000; the pixel interval of the red arm was 1500–3500.

The impact of different pixel sampling intervals on the F1 score of the three types of celestial bodies (result of C-Net). The horizontal axis at the bottom of each sub-image is a different pixel interval, and the horizontal axis at the top is the wavelength interval corresponding to this pixel interval (one-to-one correspondence).
From Fig. 12 we can see that as the pixel sampling interval changes, the classification precision of the three types of celestial bodies has also changed. After analysis, there are two reasons behind this phenomenon. The first is that the characteristics of different celestial bodies are concentrated in different pixel sampling intervals, and the second is that in the 2D spectrum of unsubtracted sky light, the sky light intensity is different in each sampling interval, this leads to different degrees of feature contamination in different pixel sampling intervals, which affects the classification precision.
However, the data set of this study is a mixture of two different observation plans, namely 2013 September 14 and 2013 November 4. We try our best to ensure that the spectral data of the two observation plans account for 50 per cent of the final data set. The test set is also obtained by mixing the results of the two observations. What needs to be considered is the probability that the skylight interference in different observation plans has the same distribution (referring to the same skylight intensity in the sampling interval of each pixel). Therefore, it can only be said that Fig. 12 reflects to a certain extent the feature distribution of different celestial bodies in each pixel sampling interval of the 2D spectrum. According to this idea, we can see that the features of galaxies are more evenly distributed at the red arm, and are more concentrated in the range of 1000–3000 at the blue arm; the features of stars are more distributed in the range of 1000–3500 at the red arm. In the blue arm, the features of stars are more concentrated on the two ends of the blue arm. In addition, even in the same pixel sampling interval, the characteristic distribution density of different celestial bodies is not the same.
5 CONCLUSIONS
In this article, we try to directly use 2D spectroscopy to perform celestial classification tasks. A Multimodal Spectral Classification Network (MAC-Net) is established by introducing photometric images and an attention mechanism into the original 2D spectral classification network. In the MAC-Net, the photometric image branch can enrich the semantic information extracted by the network model to compensate for the incomplete extraction of spectral semantic information caused by the interference of sky light in the 2D spectral data. The attention mechanism increases the MAC-Net’s attention to important information and further improves the performance of the model.
The MAC-Net model not only provides reliable spectral classification results, but also simplifies the complicated operation of converting 2D spectra to 1D spectra, and improves the utilization of spectra.
All in all, the experimental results show that MAC-Net has a certain degree of performance improvement compared with other network models, reaching SOTA (state-of-the-art). However, because the MAC-Net model uses a multimodal basic design, there are multiple input data, resulting in a large amount of network model parameters and relatively slow operation.
Additionally, the multimodal modelling approach in this paper has been proven a good way to solve image classification problems. In astronomy, there are many different types of images from different sky surveys, like the WISE, SDSS, etc. So the method in this paper has the application potential in handling the similar problems for the other telescopes. Of course, multimodal modelling approaches should have broader prospects. As is known to all, considering the cost problem, the quantity of photometric image data is far more than that of spectral data, but the spectrum has more detailed information. Therefore, we can adopt the multimodal method, and use spectral data to assist image data for training (Kumar et al. 2019), so as to improve the performance of related tasks (classification and parameter regression) with image data as input. In addition, we can even extend the visual captioning task in the multimodal modelling method to the field of astronomy (Bin et al. 2021), and realize the process of generating spectra from photometric image data. However, multimodal modelling needs multisources data sets, but it is also common that in many cases, there are not multisource data sets for the same celestial objects. Then lacking data will be a big limitation for this kind of methods.
In future work, we will try to reduce the skylight of 2D spectroscopy, in order to achieve high-precision classification of celestial bodies by using 2D spectroscopy alone. And on the 2D spectrum after subtracted skylight processing, further explore the law of the characteristic distribution of each celestial body on the 2D spectrum (Section 4.5 of this article).
ACKNOWLEDGEMENTS
This work is supported by the Joint Research Fund in Astronomy, National Natural Science Foundation of China, U1931134, and the Natural Science Foundation of Hebei, A2020202001. Cultivation Project for LAMOST Scientific Payoff and Research Achievement of CAMS-CAS.
DATA AVAILABILITY
The photometric image data underlying this article are available in SDSS DR16, at http://skyserver.sdss.org/dr16/en/help/docs/api.aspx#imgcutout and the spectrum data are available in LAMOST DR6, at http://dr6.lamost.org.