Abstract

We present a deep neural network real/bogus classifier that improves classification performance in the Tomo-e Gozen Transient survey by handling label errors in the training data. In the wide-field, high-frequency transient survey with Tomo-e Gozen, the performance of conventional convolutional neural network classifiers is not sufficient as about 106 bogus detections appear every night. In need of a better classifier, we have developed a new two-stage training method. In this training method, label errors in the training data are first detected by normal supervised learning classification, and then they are unlabeled and used for training of semi-supervised learning. For actual observed data, the classifier with this method achieves an area under the curve (AUC) of 0.9998 and a false positive rate (FPR) of 0.0002 at a true positive rate (TPR) of 0.9. This training method saves relabeling effort by humans and works better on training data with a high fraction of label errors. By implementing the developed classifier in the Tomo-e Gozen pipeline, the number of transient candidates was reduced to ∼40 objects per night, which is ∼1/130 of the previous version, while maintaining the recovery rate of real transients. This enables more efficient selection of targets for follow-up observations.

1 Introduction

Time-domain astronomy has become an active area in modern astronomy. Studies of transient phenomena such as supernovae have been rapidly developing in recent years. To observe transients efficiently, transient surveys have become more wide-field, more sensitive, and more frequent. As a result, the number of discovered transients has dramatically increased; the reported number of transients reach tens of thousands per year. In the near future, hundreds of transient objects will be discovered every night with, e.g., the Vera C. Rubin Observatory (Ivezić et al. 2019).

To detect transients from large amounts of data, most transient surveys implement the image subtraction method. This detects transients by subtracting the past reference image from the observed new image. By image subtraction, only objects that change brightness, such as transients, can be extracted. The subtraction method can efficiently detect transients blended in galaxies. However, this method also has a disadvantage; that is, it tends to generate a large amount of fake detections (hereafter called bogus objects). Therefore, the development of efficient methods to remove a large number of bogus objects has become important. In order to select the target for follow-up observations, it is necessary to extract the real transients (hereafter called real objects) from the detected candidates that include bogus objects. However, with the increase of the scale of observations, the number of bogus objects has increased to a level that is not feasible for human eyes check. For example, in the Palomar Transient Factory (PTF; Law et al. 2009), potential candidates of the order of 106 are detected per night (Brink et al. 2013). Among these, the number of bogus objects is estimated to be more than 1000 times greater than that of real ones (Mahabal et al. 2019). Thus, conventional selection methods, such as parameter cutting, are no longer able to narrow down the candidates.

Machine-learning techniques are therefore gaining attention as an alternative method. In the case of real/bogus classification, by allowing the machine to learn the relationship between the data of detected objects and their classification results, the machine can classify transient candidates. If trained in advance, classification is fast and can be performed in real time for a large amount of data. Various methods for real/bogus classification by machine learning have been proposed and implemented in many transient surveys. In the early era, classification was performed by inputting features extracted from images into random forest or neural networks (e.g., Bloom et al. 2012; Brink et al. 2013; Wright et al. 2015; Morii et al. 2016). Recently, the use of convolutional neural networks (CNN), in which image data are directly input and the machine itself learns features, has become mainstream (e.g., Gieseke et al. 2017; Turpin et al. 2020; Killestein et al. 2021; Hosenie et al. 2021). For example, In the Zwicky Transient Facility (ZTF; Bellm et al. 2019) survey, a CNN-based classifier, braai (Duev et al. 2019), is applied.

The Tomo-e Gozen transient survey is a time-domain survey project, which utilizes a wide-field Tomo-e Gozen camera with 84 CMOS sensors covering a field of view of 20 deg2 per exposure (Sako et al. 2018). The transient survey is performed with a high cadence of about 3–4 times per night with a typical sensitivity of 18 mag without filters. The survey observes at a rate of 105 images day−1, and as many as 106 transient candidates are detected every night. Although the CNN classifier was used to sort out real transients from these candidates, which are mostly bogus, the classification performance was not sufficient. There was still a large amount of false positives (of the order of 103 day−1), i.e., bogus objects classified as real ones. Thus, we needed a new classifier with a higher performance.

To obtain a higher performance in classification, one can use more complex models. In general, training of complex models requires large amounts of training data. In such cases, the training is usually done by using simulated data rather than real data. However, with millions of samples of simulated data, it becomes infeasible to check their quality of simulation data manually. As a result, the training data can be contaminated by label errors, e.g., real objects mislabeled as bogus ones. When label errors are included in the training data, they cause an adverse impact on performance (e.g., Ayyar et al. 2022).

This paper describes improvements in the real/bogus classification of the Tomo-e Gozen transient survey by using a complex machine-learning model and by handling label errors. The structure of this paper is as follows. We first introduce the observational dataset used in this work in section 2. Then, the design of the new classifiers is described in section 3. The performance of these classifiers is presented in section 4. In section 5, we discuss key factors for improving the performance and show improvements in the actual operation. Finally, we give conclusions in section 6.

2 Observational data

In this section, we describe our transient survey and dataset used to develop the machine-learning classifier. We use optical images from the Tomo-e Gozen camera mounted on the 1.05 m Kiso Schmidt telescope (Sako et al. 2018). The Tomo-e Gozen camera is a mosaic camera equipped with 84 CMOS sensors. Thanks to the fast CMOS readout, the survey data are taken at a rate of 2 frames s−1. For the transient survey, 12 or 18 consecutive images (6 or 9 s exposure) are taken with no filter, and these images are stacked. A typical limiting magnitude of the stacked image is 18 mag.

For transient detection, image subtraction between the stacked image and the reference image is performed. For image subtraction, we use the hotpants software (Becker 2015), which is based on the method by Alard and Lupton (1998). We use Pan-STARRS1 (PS1) r-band data (Waters et al. 2020; Flewelling et al. 2020) as reference images. Since PS1 data have better sensitivity and better seeing, the sensitivity of the Tomo-e Gozen image is not degraded even after image subtraction, which enables efficient transient detection. Also, as the Tomo-e Gozen camera is a newly developed camera, there were no deep stacked images at the beginning of the operation. Thus, the use of the existing PS1 data enables the transient survey even at the early stage of the operation.

Thanks to the high observing rate, as many as 105 stacked images are taken every night in the Tomo-e Gozen transient survey. The number of transient candidates can reach 106 in one night. As in other transient surveys, the detected candidates are by far dominated by bogus detection. Also, image subtraction between different telescopes/instruments tend to cause more bogus objects due to the differences in various factors of the data, for example, response functions or pixel scales. In this paper, we show that this difficulty can be overcome by developing a high-performance, complex machine-learning classifier (section 3).

We here describe the dataset used to develop the machine-learning classifier. For the real/bogus classification, we use a set of three images: observed new image, reference image, and subtracted image. For each, a cutout image around the transient candidate (29 × 29 pixels) is used as a input for the classifier. Figure 1 shows an example of cutout images.

Example of input images (new, reference, and subtracted images from left to right).
Fig. 1.

Example of input images (new, reference, and subtracted images from left to right).

Since we need a large dataset to train a complex machine, we use artificial objects as real for the training dataset. For this purpose, we constructed a point spread function (PSF) for each image by measuring the shapes of stars in the image, and then embedded the constructed PSF into the observational images with varying brightness. The artificial objects were embedded at two kinds of locations: (1) uniform distribution around galaxies, and (2) random distribution for the entire region of the images. Here, (1) and (2) mimic normal transients and hostless transients, respectively. We prepared about 6 × 105 samples for each case. When embedding artificial sources around galaxies, we randomly selected objects registered as extended sources in the Pan-STARRS catalog.

For bogus samples of training data, we used actual bogus objects that were detected in the subtracted images of Tomo-e Gozen. Figure 2 shows examples of bogus objects in our dataset. The total number of bogus samples is 2 × 106. The majority of them are false detections due to failed subtraction (figures 2a and 2b). Other cases include false detection due to the diffracted light from bright stars (figure 2c), hot pixels of the sensor (figure 2d), and artificial noise patterns due to malfunction of the data acquisition system (figures 2e and 2f). It is emphasized that a small amount of real transients can be included in the bogus samples since we assume that all the detected objects are bogus (i.e., label error; see subsection 4.2).

Examples of bogus detection.
Fig. 2.

Examples of bogus detection.

For the validation dataset, we used samples extracted from the actual observational data from 2021 January to April. The real dataset includes 125 objects reported to the Transient Name Server (TNS)1 that were observed by Tomo-e Gozen. Since some of them were detected multiple times, the total number of real samples is 363. As bogus samples, we used bogus objects in the same images where the real objects are detected. The total number of bogus objects is 255777. The actual bogus-to-real ratio is much higher than this sample ratio (see subsection 5.2). Nevertheless, this sample ratio was adopted because the size of the bogus dataset would exceed that of the training dataset if we adopt the actual ratio (∼1 : 106). The number of samples in the training and validation datasets are summarized in table 1.

Table 1.

Number of training and validation dataset.

DatasetNumberNote
TrainingReal1224773Artificial
Bogus2031193Actual
ValidationReal363TNS
Bogus255777Actual
DatasetNumberNote
TrainingReal1224773Artificial
Bogus2031193Actual
ValidationReal363TNS
Bogus255777Actual
Table 1.

Number of training and validation dataset.

DatasetNumberNote
TrainingReal1224773Artificial
Bogus2031193Actual
ValidationReal363TNS
Bogus255777Actual
DatasetNumberNote
TrainingReal1224773Artificial
Bogus2031193Actual
ValidationReal363TNS
Bogus255777Actual

Finally, we test our classifiers in the actual operation of the Tomo-e Gozen transient survey (subsection 5.2). For this purpose, we use data taken over five nights, corresponding to about 5 × 105 images and 5 × 106 detections in total.

3 Method

To improve the performance of the conventional classifier, we modify the neural network into one with a more complex structure (subsection 3.1). In addition, in order to take advantage of neural networks with complex structures, we propose a new training method2 devised for objective functions and dataset handling (subsection 3.2).

3.1 Model architecture

This subsection describes the model structure of a simple conventional model, which serves as a baseline, and the more complex model that we propose. Hereafter, we call them the “Simple model” and “Complex model,” respectively. Both models perform binary classification (real or bogus) for a detected object by using three input images. Since real class is important here, real class and bogus class are defined as positive class and negative class, respectively.

3.1.1 Simple model

As shown in figure 3, the Simple model consists of two convolutional layers in the first half and three fully connected layers in the second half. This structure follows the VGG model (Simonyan & Zisserman 2014), which is a basic model structure for image-classification tasks using deep learning.

Architecture diagram of the Simple model. Since no padding is performed in the convolution layer, the spatial size of the features decreases with each pass through the convolution layer. The model adopts kernel_size = 5 for the first convolutional layer and kernel_size = 3 for the second convolutional layer. The flatten layer collapses the input (6,6,64) features into a vector with a size of 2304. In the dropout layer, the probability of the value being 0 is set to 0.3. This figure was generated by PlotNeuralNet3 and subsequently modified.
Fig. 3.

Architecture diagram of the Simple model. Since no padding is performed in the convolution layer, the spatial size of the features decreases with each pass through the convolution layer. The model adopts kernel_size = 5 for the first convolutional layer and kernel_size = 3 for the second convolutional layer. The flatten layer collapses the input (6,6,64) features into a vector with a size of 2304. In the dropout layer, the probability of the value being 0 is set to 0.3. This figure was generated by PlotNeuralNet3 and subsequently modified.

In order to keep the range of values inside the network consistent, we normalize the input images. The normalization is performed on the original image |$\mathbf {u}_{i,c} \in \mathcal {R}^{H \times W}$| of the cth channel in the ith sample (H and W are the height and width of each image) as follows:
(1)
where max ( ) and min ( ) are functions that return the maximum and minimum values of the image, respectively. The output of the network is a two-dimensional vector. By normalizing the output vector with the softmax function, it can be interpreted as a probability that the object is real.

3.1.2 Complex model

To achieve higher performance than the Simple model, we increase the number of layers in the new Complex model. However, increasing the number of layers makes training more difficult. One famous model that addresses this problem is ResNet (He et al. 2016a, 2016b), which has various extended versions. Among the many extensions of ResNet, we have adopted SE-ResNet, which includes a channel and spatial squeeze and excitation (csSE) layer (Roy et al. 2018). Figure 4 shows an architecture diagram of the Complex model.

Architecture diagram of the Complex model. In the convolution layer, the kernel size is 5 only in the first layer after input and is 3 in the remaining layers. Also, padding is performed to make the size in the spatial direction the same for the input and output. Each region with a light blue background is a residual block. The third residual block from the input side has a convolution layer with stride = 2 in the skip connection to adjust the feature size. The global averaging pooling calculates averages in the spatial direction for each channel and downsamples features of size (4, 4, 256) into a 256-dimensional vector. This figure was generated by PlotNeuralNet and subsequently modified.
Fig. 4.

Architecture diagram of the Complex model. In the convolution layer, the kernel size is 5 only in the first layer after input and is 3 in the remaining layers. Also, padding is performed to make the size in the spatial direction the same for the input and output. Each region with a light blue background is a residual block. The third residual block from the input side has a convolution layer with stride = 2 in the skip connection to adjust the feature size. The global averaging pooling calculates averages in the spatial direction for each channel and downsamples features of size (4, 4, 256) into a 256-dimensional vector. This figure was generated by PlotNeuralNet and subsequently modified.

ResNet is a network structure consisting of stacked residual blocks, each of which consists of convolutional layers and a skip connection between the input and the output of the block. The skip connection eliminates the vanishing gradient problem and this facilitates the training performance. Since it is difficult to learn the identity map only with convolutional layers, there is a problem that too many convolutional layers degrade the performance. The skip connection makes it easier to learn the identity map in the entire residual block. Therefore, high performance can be achieved even in a deep network of stacked residual blocks. Since the csSE layer emphasizes the effective parts of the features for classification, SE-ResNet is expected to improve the classification performance compared to the original ResNet.

One component of the residual block is the batch normalization layer (Ioffe & Szegedy 2015). This has the effect of stabilizing and speeding up training, which contributes to the success in learning deep networks. However, batch normalization has several weaknesses, one of which is its performance degradation when the samples in a batch are highly correlated. This is a concern when there are many similar images with a high correlation, such as astronomical images. Therefore, instead of using batch normalization, we used filter response normalization (Singh & Krishnan 2020), which is not affected by the correlation between samples because the normalization is done per channel and per sample. The input to the network, its normalization method, and the format of the output are the same as for the Simple model.

3.2 Training methods

This subsection describes the training methods for the conventional and new classifiers. We test multiple methods for three phases of training: (1) how to treat the training data, (2) which objective function is used, and (3) how to handle label errors.

3.2.1 Treatment of training data

For the treatment of training data, two methods are tested: the first one is to prepare the training data for each CMOS sensor and prepare a classifier for each sensor; the other is to combine the training data of all the sensors and train a single classifier to classify the data of all the sensors. These tests are performed because it is not obvious which option gives a better performance, i.e, having each classifier specialized for each sensor to incorporate sensor diversity or having a unified classifier for all the sensors.

3.2.2 Objective functions

For objective functions for training, three types of functions are used: the cross-entropy function, exp-Cross-hinge function (Kurora et al. 2020), and local distributional smoothness function (Miyato et al. 2016). They are used for the purpose of a loss function for individual samples, a loss function for the entire dataset, and a loss function to make the classifier robust to input perturbation, respectively. Each training image |$\mathbf {x}$| prepared in section 2 is paired with a teacher label y indicating whether the image is positive class or negative class. In this work, we dare to ignore the labels of some of the training data (we call this procedure “unlabel”). We describe why we ignore the labels and how to select them from the dataset in sub-subsection 3.2.3. The objective function is defined as follows:
(2)
Here, |$\left\lbrace \mathbf {x}_l, y\right\rbrace$| is a set of pairs of labeled image and its label, |$\left\lbrace \mathbf {x}_u\right\rbrace$| is a set of unlabeled images, |$\boldsymbol{\theta }$| is a set of trainable variables of the neural network, and Lce, Lech, and Llds are cross-entropy loss, exp-Cross-hinge loss, and local distributional smoothness (LDS) loss, respectively. The exp-Cross-hinge function is related to area under the curve (AUC) maximization. The local distribution smoothness function is a key component of the virtual adversarial training (VAT). The details of each term are described in the Appendix.

Three scalar hyper-parameters λce, λech, and λlds control the effect of each term. By setting one or two λ values in equation (2) to 0, we can create several variations of the objective functions. The objective function of the Simple model corresponds to the case (λce, λech, λlds)T = (1, 0, 0)T, which is equal to the cross-entropy function. In subsection 4.2, we test four patterns of objective functions for training the Complex model, where λce is always non-zero and λech and λlds can be zero or non-zero. It is necessary to tune the non-zero element of the hyper-parameter (λce, λech, λlds)T. This tuning is performed by a grid search, and multiple trials are done at each grid point to avoid possible variations in the results due to model initial values and other factors. The best combination of λ values is the one that produces the model with the highest performance among the results of these trials.

In training, |$\boldsymbol{\theta }$| is updated using the stochastic gradient descent to minimize the value of the objective function defined above.

3.2.3 Handling of label errors

The training data that we prepare are not always perfect. Some of them have incorrect labels or are difficult to label. It would be better to correct these label errors and train the classifiers with a clean training dataset. However, for large datasets, it is not practical to manually check and correct all of these samples as it requires an enormous amount of human effort. Therefore, we split the training into two stages to handle label errors. In the first stage of training, the machine finds samples that are likely to be mislabeled. Then, the second stage of training is performed by handling the samples found in the first stage. In both stages, the Complex model is used.

To find label errors, we first train the classifier with the original training data by utilizing the fact that the ratio of label errors to the training data is sufficiently small. The classifier then classifies the training data themselves and identifies label errors. Specifically, the training dataset is divided into five categories: one is used for evaluation and the remaining four are used to train the classifier. Since there are five different ways to select data for evaluation, all the training data can be evaluated by five training and evaluation cycles in total. This method is the same as that in Northcutt, Jiang, and Chuang (2021).

After classifying the training data themselves, we determine which samples are likely to have label errors based on the output of the classifier. Unlike Northcutt, Jiang, and Chuang (2021), we simply set the classification boundary to a probability of 0.5, and regard all misclassified samples as samples with potential label errors. Although the threshold value of 0.5 causes more samples to be potential label errors than the value used by Northcutt, Jiang, and Chuang (2021), the number of potential label errors is small (∼1%) relative to the entire dataset in our case (subsection 4.2). Since the samples with potential label errors are used in semi-supervised learning as unlabeled samples, the effect of the overestimation is minor in the second stage of training described below.

In the second stage of training, we try two different methods to handle label errors. The first method is simply to remove the samples with potential label errors from the training dataset. The second method is setting the samples with potential label errors as “unlabeled” and then performing semi-supervised learning. The VAT does not use the labels in the computation of the objective function, which allows semi-supervised learning. Semi-supervised learning avoids the adverse effects of samples with label errors, while effectively utilizing them as training data.

4 Results

In this section, we summarize the real/bogus classification performance for the validation dataset with the various combinations of the models and training methods described in the previous sections.

4.1 Effects of training data treatments

We compare the performance of the three cases of the Simple models: (1) the model trained for each CMOS senor (Simple-each), (2) the model that has the same total number of samples as (1) but trained with samples of all sensors (Simple-mix), and (3) the model trained with all of the data for all sensors (Simple-all).

First, we examine the impact of the sample diversity from multiple sensors on performance. Conventional classifiers adopt the Simple-each approach and are trained using only cross-entropy loss as the objective function. Classification based on a dataset from a single sensor can cause overfitting, in which the unique “habits” of the dataset are used to classify the data. By training on datasets from multiple sensors, it is possible to learn more essential features that do not depend on unique “habits,” and thus to obtain the effect of data augmentation.

Secondly, we study the effect of the size of the training dataset. In the Simple-all case, the total number of actual training data is larger than that of the Simple-each case by a factor of 84 (the number of sensors).

The performance of the classifier can be evaluated by the receiver operating characteristic (ROC) curve.4 The AUC of the ROC curve indicates the overall performance. For the Simple-each case, since a classifier is prepared and tested for each sensor, we measure the performance of these cases by combining the results of all sensors. Figures 5 and 6 show the AUC and FPR for each Simple model, respectively. To see the variation of the results, we plot the results of five training runs with different initial seed values for each case. The FPR in figure 6 is defined at a threshold when the TPR is 0.9. Comparison between the Simple-each case and the Simple-mix case shows that data-mixing with different sensors gives better results thanks to the data augmentation effect for the same amount of training data. Furthermore, the Simple-all case achieved better results. This means that both the larger amount of training data and the mixing of data from multiple sensors contribute to the improvement.

Comparison of AUC for the Simple models. The five points in each model represent the performance variation with different initial seed values.
Fig. 5.

Comparison of AUC for the Simple models. The five points in each model represent the performance variation with different initial seed values.

Same as figure 5, but for FPR at TPR = 0.9.
Fig. 6.

Same as figure 5, but for FPR at TPR = 0.9.

4.2 Effects of label error handling

We here investigate whether treatment of label errors improves the performance of the Complex model. For the baseline of the performance comparison, we use the performance of the Simple-all case. When multiple objective functions are combined, optimal values for the weight of the functions, i.e., λce, λech, and λlds, are obtained from the parameter search for each case. The optimal values in each case are summarized in table 2.

Table 2.

Summary of classification performance.

Model typeData typeCEAUCVATλceλechλldsAUCFPR@TPR = 0.9
SimpleEach|$\checkmark$|1.00.00.00.99169.868 × 10−3
SimpleMix|$\checkmark$|1.00.00.00.99645.028 × 10−3
SimpleAll|$\checkmark$|1.00.00.00.99973.323 × 10−4
ComplexAll|$\checkmark$|1.00.00.00.99937.272 × 10−4
ComplexRemoved|$\checkmark$|1.00.00.00.99938.367 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.30.70.00.99921.306 × 10−3
ComplexRemoved|$\checkmark$||$\checkmark$|0.40.60.00.99965.083 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.40.00.60.99971.603 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$|0.60.00.40.99971.838 × 10−4
ComplexAll|$\checkmark$||$\checkmark$||$\checkmark$|0.30.30.40.99972.541 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$||$\checkmark$|0.90.050.050.99981.994 × 10−4
Model typeData typeCEAUCVATλceλechλldsAUCFPR@TPR = 0.9
SimpleEach|$\checkmark$|1.00.00.00.99169.868 × 10−3
SimpleMix|$\checkmark$|1.00.00.00.99645.028 × 10−3
SimpleAll|$\checkmark$|1.00.00.00.99973.323 × 10−4
ComplexAll|$\checkmark$|1.00.00.00.99937.272 × 10−4
ComplexRemoved|$\checkmark$|1.00.00.00.99938.367 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.30.70.00.99921.306 × 10−3
ComplexRemoved|$\checkmark$||$\checkmark$|0.40.60.00.99965.083 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.40.00.60.99971.603 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$|0.60.00.40.99971.838 × 10−4
ComplexAll|$\checkmark$||$\checkmark$||$\checkmark$|0.30.30.40.99972.541 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$||$\checkmark$|0.90.050.050.99981.994 × 10−4
Table 2.

Summary of classification performance.

Model typeData typeCEAUCVATλceλechλldsAUCFPR@TPR = 0.9
SimpleEach|$\checkmark$|1.00.00.00.99169.868 × 10−3
SimpleMix|$\checkmark$|1.00.00.00.99645.028 × 10−3
SimpleAll|$\checkmark$|1.00.00.00.99973.323 × 10−4
ComplexAll|$\checkmark$|1.00.00.00.99937.272 × 10−4
ComplexRemoved|$\checkmark$|1.00.00.00.99938.367 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.30.70.00.99921.306 × 10−3
ComplexRemoved|$\checkmark$||$\checkmark$|0.40.60.00.99965.083 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.40.00.60.99971.603 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$|0.60.00.40.99971.838 × 10−4
ComplexAll|$\checkmark$||$\checkmark$||$\checkmark$|0.30.30.40.99972.541 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$||$\checkmark$|0.90.050.050.99981.994 × 10−4
Model typeData typeCEAUCVATλceλechλldsAUCFPR@TPR = 0.9
SimpleEach|$\checkmark$|1.00.00.00.99169.868 × 10−3
SimpleMix|$\checkmark$|1.00.00.00.99645.028 × 10−3
SimpleAll|$\checkmark$|1.00.00.00.99973.323 × 10−4
ComplexAll|$\checkmark$|1.00.00.00.99937.272 × 10−4
ComplexRemoved|$\checkmark$|1.00.00.00.99938.367 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.30.70.00.99921.306 × 10−3
ComplexRemoved|$\checkmark$||$\checkmark$|0.40.60.00.99965.083 × 10−4
ComplexAll|$\checkmark$||$\checkmark$|0.40.00.60.99971.603 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$|0.60.00.40.99971.838 × 10−4
ComplexAll|$\checkmark$||$\checkmark$||$\checkmark$|0.30.30.40.99972.541 × 10−4
ComplexUnlabeled|$\checkmark$||$\checkmark$||$\checkmark$|0.90.050.050.99981.994 × 10−4

First of all, we estimate the fraction of label errors in the training data. For the training data, we use the total amount of data from all the sensors, as in the Simple-all case. The left panel of figure 7 shows the score distribution when the training data themselves are classified using a Complex model trained with cross-entropy loss for the label error identification. The score here is defined by the score function in equation (A5), not probability. We pay particular attention to samples that the classifier misclassifies, i.e., those in the tail of the real and bogus distributions. For these samples, we check the images by visual inspection. Since there are a large number of samples even in the tail, we conducted a sample survey to estimate the fraction of label errors. The fraction is evaluated by visually counting the label errors from randomly selected samples for each score bin. The right panel of figure 7 shows the distribution of the scores and the fraction of label errors in each score bin. In fact, we find that the fraction of label errors also increases at the edges of the distributions. Based on the estimated fractions of label errors, the contamination ratio of label errors in the training data is about 0.6% for bogus objects and 1% for real ones.

Left: Score distribution of the training dataset. Right: Score distribution (left axis) and fraction of label errors estimated from a sample survey by human eyes (right axis). The sample survey is performed at the edges of the score distribution.
Fig. 7.

Left: Score distribution of the training dataset. Right: Score distribution (left axis) and fraction of label errors estimated from a sample survey by human eyes (right axis). The sample survey is performed at the edges of the score distribution.

Figure 8 shows examples with label errors at the edge of the score distribution. Among these, those mislabeled as bogus clearly show objects that appear to be transients. On the other hand, those mislabeled as real have embedded artificial stars, but with a distinct bogus detection. In other words, the labels are not correct for these samples and the machine actually classifies them correctly.

Examples of samples with label errors.
Fig. 8.

Examples of samples with label errors.

Then, we investigate the effect on the classification performance in two different ways to handle label errors. First, we examine the method in which we simply remove the samples with potential label errors. The AUC and FPR with potential label errors (indices 2 and 4) and without potential label errors (indices 3 and 5) are shown in figures 9 and 10, respectively. Two classifiers are used in the comparison: one with CE only and the other with CE + AUC as the objective function. In figures 9 and 10, the CE, AUC, and VAT columns indicate whether or not Lce, Lech, and Llds are used, respectively. Both classifiers perform better when the samples with potential label errors are removed (indices 3 and 5). The FPR is significantly lower when the samples with potential label errors are removed while the AUC shows no significant difference due to high variation in each seed value. However, in all the cases, the classifiers removing potential label errors do not perform better than the Simple-all case.

Summary of AUC for all the classifiers. Five points in each case represent the performance variation with different initial seed values. The filled points indicate the training with the highest AUC for each case. The CE, AUC, and VAT columns indicate whether or not each term of the objective function is used.
Fig. 9.

Summary of AUC for all the classifiers. Five points in each case represent the performance variation with different initial seed values. The filled points indicate the training with the highest AUC for each case. The CE, AUC, and VAT columns indicate whether or not each term of the objective function is used.

Same as figure 9, but for FPR at TPR = 0.9.
Fig. 10.

Same as figure 9, but for FPR at TPR = 0.9.

Next, we examine the semi-supervised learning method in which all samples with potential label errors are unlabeled. The AUC and FPR for the model with CE + AUC + VAT as the objective function in each case are shown as indices 8 and 9 in figures 9 and 10. In this comparison, the method that handles potential label errors by unlabeling and by performing semi-supervised learning (index 9) shows better results than the supervised learning with potential label errors remaining (index 8). Furthermore, the semi-supervised learning method (index 9) yields a lower FPR at TPR = 0.9 than the method that removes potential label errors (indices 3 and 5) and the Simple-all case (index 1). This means that the semi-supervised learning method can achieve good performance even if the training data contain label errors.

Finally, we compare the classification performance in all the cases with different objective function combinations and with/without handling label errors. AUC and FPR for all the cases are summarized in figures 9 and 10. Comparing all the cases, the case combining the three objective functions and using semi-supervised learning achieved the best results (index 9). For comparison, the prediction distributions and confusion matrices for the Simple-each case, the Simple-all case, and the best classifier are shown in figure 11. In the best classifier, the number of bogus objects misclassified as real is further reduced as compared with the Simple-all case, and the false positive of the confusion matrix is 1/23 that of the Simple-each case. The ROC curve of the best classifier is shown with the green line in figure 12, where the AUC reaches 0.9998. Similarly, the relationship between FPR and false negative rate (FNR = 1 − TPR) is plotted in figure 13. The FPR decreases to 0.0002 when FNR = 0.1 (TPR = 0.9).

Top: Distribution of output probability (of being real) for each classifier. Red and blue colors show the histograms for real and bogus samples, respectively. Bottom: Confusion matrix of each classifier. The predicted labels in all the classifiers are the ones with a threshold of 0.5. The ratio in each row is normalized to 1. The numbers in parentheses represent the raw numbers.
Fig. 11.

Top: Distribution of output probability (of being real) for each classifier. Red and blue colors show the histograms for real and bogus samples, respectively. Bottom: Confusion matrix of each classifier. The predicted labels in all the classifiers are the ones with a threshold of 0.5. The ratio in each row is normalized to 1. The numbers in parentheses represent the raw numbers.

ROC curves of the Simple-each classifier and the best classifier.
Fig. 12.

ROC curves of the Simple-each classifier and the best classifier.

FNR against FPR for the Simple-each classifier and the best classifier.
Fig. 13.

FNR against FPR for the Simple-each classifier and the best classifier.

5 Discussion

In this section, we review the performance improvement in our real/bogus classification and discuss the key factors of the improvement, as well as the actual performance after implementing the best classifier in the Tomo-e Gozen transient pipeline.

5.1 Performance analysis

The following is a summary of the improvement in the Tomo-e Gozen real/bogus classification performance. As shown in table 2, the best performance is achieved when all of the following conditions are satisfied:

  • Training data for all the sensors are combined.

  • All three objective functions are used.

  • Semi-supervised learning is applied.

Compared to the Simple-each case, the AUC improves from 0.9916 to 0.9998 and FPR at TPR = 0.9 decreases from 0.0099 to 0.0002. Comparing the performance with the ZTF real/bogus classifier (FPR = 0.017 at TPR = 0.983, Duev et al. 2019), our best classifier gives FPR = 0.003 at the same TPR. Although exact comparisons cannot be made between Tomo-e Gozen and ZTF because the instruments, pipelines, and real/bogus ratios are different, we achieve FPR comparable to that of the ZTF classifier.

We then discuss key factors in improving the classification performance. First, combining training data for different sensors significantly improved performance. This improvement is a result of the data augmentation effect by mixing data from multiple sensors with different characteristics. Also, increasing the number of training data improves the performance, as expected. In addition, incorporating LDS loss into the objective function improves classification performance with the Complex model compared to the Simple-all case. The Complex model tends to overfit the training data because it can handle complex representations, resulting in lower performance on the validation data. On the other hand, when LDS loss is included in the objective function, the LDS-based regularization avoids overfitting and improves performance. We also show that, by setting the label error samples to unlabeled samples and by performing semi-supervised learning, the performance is further improved.

We here investigate whether the proposed method also works when there are more label errors in training data. The fraction of label errors in our original training dataset is about 1%. We artificially increase the fraction of label errors by inverting the labels based on the estimated fraction of label errors (figure 7). We then classify the data with our best method, semi-supervised learning with data containing unlabeled samples, and compare the results with those obtained by training with label errors remaining. Figures 14 and 15 compare the AUC and FPR when the ratio of label errors is increased to about 5% and 10%. It is found that the degree of improvement is higher as the fraction of label errors increases. When the fraction of label errors is 1%, by handling label errors, the average FPR decreases from 0.0004 to 0.0002, i.e., an improvement by a factor of about 2. On the other hand, when the fraction of label errors is 10%, the average FPR decreases from 0.0039 to 0.0003, which corresponds to an improvement by a factor of 13. This means that our method is more effective for datasets with higher fractions of label errors.

AUC for different fractions of label errors.
Fig. 14.

AUC for different fractions of label errors.

Same as figure 14 but for FPR at TPR = 0.9.
Fig. 15.

Same as figure 14 but for FPR at TPR = 0.9.

5.2 Performance in actual operations

Finally, we discuss the actual performance of our best classifier when implemented in the data analysis pipeline of Tomo-e Gozen. Prior to implementation, we determined the threshold for the classification in the actual operation. Figure 16 shows the variation of each metric as a function of the threshold. We set the threshold score at 0.85, which gives the best precision while keeping TPR above 0.9.

Variation of each metric as a function of threshold for the best classifier.
Fig. 16.

Variation of each metric as a function of threshold for the best classifier.

We investigate the changes in the number of transient candidates registered in the Tomo-e Gozen transient database before and after the implementation. Figure 17 shows the number of registrations to the database before and after implementation over a five-day period. As a rule for registration in the database, objects detected for the first time by the classifier are registered as “variable.” Among the variable candidates, those detected twice at the same coordinates are registered as “transient.” After the implementation of the new classifier, the average numbers of variable and transient registrations were reduced to 1/160 and 1/130, respectively.

Number of objects registered in the Tomo-e transient database before (left) and after (right) implementation of the best classifier. The conventional classifier is trained in the same way as the Simple-each classifier. Horizontal dotted lines show five-day averages.
Fig. 17.

Number of objects registered in the Tomo-e transient database before (left) and after (right) implementation of the best classifier. The conventional classifier is trained in the same way as the Simple-each classifier. Horizontal dotted lines show five-day averages.

To confirm that the new classifier does not miss real objects, we examined the recovery rate of real objects. By matching registered transients with TNS objects, we confirmed that the ratio of the number of matches to the number of TNS objects is comparable before and after implementation. Furthermore, the fraction of the TNS object rate among the registered transient candidates is 86 times higher after the implementation. This indicates that bogus objects that were incorrectly registered due to misclassification are greatly reduced. The number of final transient candidates is reduced from about about 5000 to 40 objects per day. This rate is at a level that human can check visually and enables effective target selection for follow-up observations in a short time.

6 Conclusions

In this paper, we have presented a new real/bogus classification scheme by handling label errors in the training data for the Tomo-e Gozen transient survey. In the wide-field, high-frequency survey with Tomo-e Gozen, which mainly targets early supernovae and rapid transients, the performance of conventional CNN classifiers was not sufficient to extract follow-up targets, because the number of bogus objects was of the order of 103 per day. Therefore, we developed a two-step training method: (1) normal supervised learning to detect label errors in the training data, and (2) semi-supervised learning with training data that include potential label errors as unlabeled samples.

The best classifier with this method achieves an AUC of 0.9998 and FPR of 0.0002 at TPR = 0.9 for validation data prepared from actual observations. Our training method does not require human effort to relabel the samples with potential label errors. We also show that our method gives a greater performance improvement when the fraction of label errors is higher. Finally, we implemented the developed classifier in the Tomo-e Gozen pipeline. After implementation, the number of registered transient candidates was reduced by a factor of about 100, to 40 candidates per day, while maintaining the recovery rate of real transients. This enables more efficient selection of follow-up targets.

Acknowledgements

We thank Yasuhiro Imoto for his significant contributions to the development of the classifiers. We are grateful to the anonymous referee for insightful suggestions.

This work has been supported by Japan Science and Technology Agency (JST) AIP Acceleration Research Grant Number JP20317829 and the Japan Society for the Promotion of Science (JSPS) KAKENHI grants 21H04491, 18H05223, and 17H06363.

This work is supported in part by the Optical and Near-Infrared Astronomy Inter-University Cooperation Program.

The Pan-STARRS1 Surveys (PS1) and the PS1 public science archive have been made possible through contributions by the Institute for Astronomy, the University of Hawaii, the Pan-STARRS Project Office, the Max-Planck Society and its participating institutes, the Max Planck Institute for Astronomy, Heidelberg and the Max Planck Institute for Extraterrestrial Physics, Garching, The Johns Hopkins University, Durham University, the University of Edinburgh, the Queen’s University Belfast, the Harvard-Smithsonian Center for Astrophysics, the Las Cumbres Observatory Global Telescope Network Incorporated, the National Central University of Taiwan, the Space Telescope Science Institute, the National Aeronautics and Space Administration under Grant No. NNX08AR22G issued through the Planetary Science Division of the NASA Science Mission Directorate, the National Science Foundation Grant No. AST-1238877, the University of Maryland, Eotvos Lorand University (ELTE), the Los Alamos National Laboratory, and the Gordon and Betty Moore Foundation.

Appendix. Details of objective function

We here describe the details of the objective functions proposed in sub-subsection 3.2.2.

  1. Cross-entropy: The cross-entropy function is an objective function commonly used in training classifiers. It works to match the output of the neural network to the teacher label for each sample. The cross-entropy function is as follows:
    (A1)
     
    (A2)
    Here Nl is the number of labeled samples, I( ) is the indicator function, c+ and c are the labels of positive and negative examples, respectively, and |$f(y=c\mid \mathbf {x};\boldsymbol{\theta })$| is the output value corresponding to the label c (⁠|$\mathbf {x}$| is the input to the neural network).
  2. exp-Cross-hinge loss (AUC maximization): If the sample ratio is highly biased, the apparent performance could be improved by always predicting the dominant class regardless of the input data. For example, in transient surveys, bogus objects are always dominant over real ones. In such a case, a classifier that classifies all the inputs as bogus can achieve a high score. However, obviously, such a classifier is not useful to detect real transients. Approaches to handling imbalanced datasets include down-sampling of the majority class (e.g., Hosenie et al. 2019) and giving extra weight to the minority class (e.g., van Roestel et al. 2021). In our proposed method, we handle imbalanced data by incorporating exp-Cross-hinge loss (Kurora et al. 2020) into the objective function, as described below.

    The exp-Cross-hinge loss is a loss function for pairs of positive and negative examples in the dataset. When the score of the negative sample becomes larger than the score of the positive sample, a loss corresponding to the difference occurs. In contrast, when the score of the positive example is larger than the score of the negative example, the hinge function prevents the loss of the set from falling below a certain level. In addition, expanding the difference between the negative and positive scores with an exponential function enables learning even when the difference is small. For all pairs of positive and negative samples, this loss function is minimized when the score of the positive sample is greater than the score of the negative sample including the margin. The definition is as follows:
    (A3)
    Here, s( ) is the score function parametrized by |$\boldsymbol{\theta }$|⁠, |$\mathbf {x}^+$| and |$\mathbf {x}^-$| are the positive and negative samples, N+ and N are the numbers of positive and negative samples, respectively, and ξ is the margin between the positive and negative sample scores. In the equation, [•]+ is the hinge function, defined as
    (A4)
    We define the score function using the outputs of the neural network as follows:
    (A5)
    The AUC is a measure of the performance of a binary classifier when we want to maximize the true positive rate and minimize the false positive rate. AUC is generally defined as the area under the ROC curve. Alternatively, it can be defined as the ratio of pairs in which the score of the positive sample is greater than the score of the negative sample among all pairs of positive and negative samples in the dataset. Since AUC is a discontinuous function, it is difficult to maximize it directly. On the other hand, since the exp-Cross-hinge function is a relaxation of the AUC function to a continuous function, we expect to obtain an approximate solution for AUC maximization by minimizing the exp-Cross-hinge function.
  3. Virtual adversarial training (VAT): We perform virtual adversarial training (VAT) (Miyato et al. 2016). VAT is a training method with the local distributional smoothness (LDS) as a regularization term. The LDS is a new notion of smoothness for the outputs of models. In VAT, special perturbations that maximize the changes of the outputs of the neural network are added to input images. The neural network is trained to minimize the change of the outputs, thus smoothing regularization. Therefore, it is expected to be robust to the influence of input perturbation. The objective function is as follows:
    (A6)
    where Nl and Nu are the numbers of labeled samples and unlabeled samples, respectively, |$\mathrm{KL} \left[ \bullet | | \bullet \right]$| is KL divergence, which is a measure of the difference between two distributions, and |$\mathbf {r}_i$| is the virtual adversarial perturbation of the ith sample. Here, |$\mathbf {r}_i$| is a tiny perturbation that maximizes the change in the classifier output and it is defined as
    (A7)
    where ε is the size of the perturbation. It is not practical to obtain |$\mathbf {r}_i$| exactly because it is computationally expensive. Instead, we efficiently obtain the approximate virtual adversarial perturbation by using the method shown in algorithm (1) of Miyato et al. (2016). Because no label information is needed to compute |$\mathbf {r}_i$|⁠, the VAT objective function can be used even for samples without labels.

Footnotes

2

The source code of our proposed method is available at 〈https://github.com/ichiro-takahashi/tomoe-realbogus〉.

4

The ROC curve shows the true positive rate (TPR) and false positive rate (FPR) values measured with different thresholds. The better the classification performance is, the closer the ROC curve is to the upper left corner of the plot.

References

Alard
 
C.
,
Lupton
 
R. H.
 
1998
,
ApJ
,
503
,
325

Ayyar
 
V.
,
Knop
 
R.
 Jr.
,
Awbrey
 
A.
,
Anderson
 
A.
,
Nugent
 
P.
 
2022
,
arXiv:2203.09908

Becker
 
A.
 
2015
,
Astrophysics Source Code Library
,
ascl:1504.004

Bellm
 
E. C.
 et al.  
2019
,
PASP
,
131
,
018002

Bloom
 
J. S.
 et al.  
2012
,
PASP
,
124
,
1175

Brink
 
H.
,
Richards
 
J. W.
,
Poznanski
 
D.
,
Bloom
 
J. S.
,
Rice
 
J.
,
Negahban
 
S.
,
Wainwright
 
M.
 
2013
,
MNRAS
,
435
,
1047

Duev
 
D. A.
 et al.  
2019
,
MNRAS
,
489
,
3582

Flewelling
 
H. A.
 et al.  
2020
,
ApJS
,
251
,
7

Gieseke
 
F.
 et al.  
2017
,
MNRAS
,
472
,
3101

He
 
K.
,
Zhang
 
X.
,
Ren
 
S.
,
Sun
 
J.
 
2016a
, in
Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR)
(
Piscataway
:
IEEE
),
770

He
 
K.
,
Zhang
 
X.
,
Ren
 
S.
,
Sun
 
J.
 
2016b
, in
Computer Vision – ECCV 2016
, ed.
Leibe
 
B.
 et al.
(
Cham
:
Springer
),
630

Hosenie
 
Z.
,
Lyon
 
R. J.
,
Stappers
 
B. W.
,
Mootoovaloo
 
A.
 
2019
,
MNRAS
,
488
,
4858

Hosenie
 
Z.
 et al.  
2021
,
Exp. Astron.
,
51
,
319

Ioffe
 
S.
,
Szegedy
 
C.
 
2015
, in
Proc. 32nd Int. Conf. Machine Learning
, ed.
Bach
 
F.
,
Blei
 
D.
 
(JMLR inc)
,
448

Ivezić
 
Ž.
 et al.  
2019
,
ApJ
,
873
,
111

Killestein
 
T. L.
 et al.  
2021
,
MNRAS
,
503
,
4838

Kurora
 
S.
,
Hachiya
 
H.
,
Shimada
 
U.
,
Ueda
 
N.
 
2020
,
IEICE Tech. Rep.
,
119
,
79

Law
 
N. M.
 et al.  
2009
,
PASP
,
121
,
1395

Mahabal
 
A.
 et al.  
2019
,
PASP
,
131
,
038002

Miyato
 
T.
,
Maeda
 
S.
,
Koyama
 
M.
,
Nakae
 
K.
,
Ishii
 
S.
 
2016
,
arXiv:1507.00677

Morii
 
M.
 et al.  
2016
,
PASJ
,
68
,
104

Northcutt
 
C. G.
,
Jiang
 
L.
,
Chuang
 
I. L.
 
2021
,
J. Artificial Intelligence Res.
,
70
,
1373

Roy
 
A. G.
,
Navab
 
N.
,
Wachinger
 
C.
 
2018
, in
Medical Image Computing and Computer Assisted Intervention – MICCAI 2018
, ed.
Frangi
 
A. F.
 et al.
(
Cham
:
Springer
),
421

Sako
 
S.
 et al.  
2018
, in
Proc. SPIE, 10702, Ground-based and Airborne Instrumentation for Astronomy VII
, ed.
Evans
 
C. J.
 et al.
(
Bellingham, WA
:
SPIE
),
107020J

Simonyan
 
K.
,
Zisserman
 
A.
 
2014
,
arXiv:1409.1556

Singh
 
S.
,
Krishnan
 
S.
 
2020
, in
Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR)
(
Piscataway
:
IEEE
),
11234

Turpin
 
D.
 et al.  
2020
,
MNRAS
,
497
,
2641

van Roestel
 
J.
 et al.  
2021
,
AJ
,
161
,
267

Waters
 
C. Z.
 et al.  
2020
,
ApJS
,
251
,
4

Wright
 
D. E.
 et al.  
2015
,
MNRAS
,
449
,
451

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)