Pulsar candidate selection using pseudo-nearest centroid neighbour classifier

Basic information of survey data sets used in this study. R denotes the class imbalance ratio of non-pulsars to pulsars.

Data set	Samples	Pulsars	Non-pulsars	R
HTRU 2	17898	1639	16259	9.92 : 1
LOTAAS 1	5053	66	4987	75.56 : 1

Table 1.

Basic information of survey data sets used in this study. R denotes the class imbalance ratio of non-pulsars to pulsars.

Data set	Samples	Pulsars	Non-pulsars	R
HTRU 2	17898	1639	16259	9.92 : 1
LOTAAS 1	5053	66	4987	75.56 : 1

The second data set is LOTAAS 1, which was obtained from LOTAAS, the LOFAR Tied-Array All-sky Survey (Lofar Working Group 2013; Cooper 2014). This data set consists of 66 pulsar and 4987 non-pulsar candidates, with a class imbalance ratio of 75.56 : 1. These few pulsars represent the first set of pulsars found during the LOTAAS survey. These are just a few of pulsars that easier to be found, not necessarily those that are difficult to find, such as pulsars with very low S/N, or unusual pulse shapes. The LOTAAS survey is an ongoing all-Northern-sky pulsar survey that is currently private.

2.2 Feature selection

Lyon et al. (2016) designed eight new features based on a statistical specialization. Eight unbiased statistical features, including mean, variance, kurtosis, and skewness, are extracted from the integrated pulse profile P and the dispersion measure–Signal-to-noise ratio (DM-SNR) curves D, as shown in Table 2. These statistical features have maximized the separation between pulsars and spurious pulsars candidates. The feature data were extracted from these data sets by the Pulsar Feature Lab on https://doi.org/10.6084/m9.figshare.1536472.v1.

Table 2.

Details of eight statistical features designed by Lyon et al. (2016).

Feature	Description
Prof._μ	Mean of the integrated profile P.
Prof._σ	Standard deviation of the integrated profile P.
Prof._k	Excess kurtosis of the integrated profile P.
Prof._s	Skewness of the integrated profile P.
DM._μ	Mean of the DM–SNR curve D.
DM._σ	Standard deviation of the DM–SNR curve D.
DM._k	Excess kurtosis of the DM–SNR curve D.
DM._s	Skewness of the DM–SNR curve D.

Feature	Description
Prof._μ	Mean of the integrated profile P.
Prof._σ	Standard deviation of the integrated profile P.
Prof._k	Excess kurtosis of the integrated profile P.
Prof._s	Skewness of the integrated profile P.
DM._μ	Mean of the DM–SNR curve D.
DM._σ	Standard deviation of the DM–SNR curve D.
DM._k	Excess kurtosis of the DM–SNR curve D.
DM._s	Skewness of the DM–SNR curve D.

Table 2.

Details of eight statistical features designed by Lyon et al. (2016).

Feature	Description
Prof._μ	Mean of the integrated profile P.
Prof._σ	Standard deviation of the integrated profile P.
Prof._k	Excess kurtosis of the integrated profile P.
Prof._s	Skewness of the integrated profile P.
DM._μ	Mean of the DM–SNR curve D.
DM._σ	Standard deviation of the DM–SNR curve D.
DM._k	Excess kurtosis of the DM–SNR curve D.
DM._s	Skewness of the DM–SNR curve D.

Feature	Description
Prof._μ	Mean of the integrated profile P.
Prof._σ	Standard deviation of the integrated profile P.
Prof._k	Excess kurtosis of the integrated profile P.
Prof._s	Skewness of the integrated profile P.
DM._μ	Mean of the DM–SNR curve D.
DM._σ	Standard deviation of the DM–SNR curve D.
DM._k	Excess kurtosis of the DM–SNR curve D.
DM._s	Skewness of the DM–SNR curve D.

These statistical features are effective and superior to other empirical features previously used. This conclusion is confirmed once again on the work by Tan et al. (2018). Since the eight statistical features designed by Lyon et al. (2016) easily make mistakes on wide pulsar recognition in practical classification processing, Tan et al. (2018) improved it by designing eight new features using time-phase and frequency-domain-phase information. At the same time, the pulsar-like RFI is separately classified. The pulsar candidate classification task is regarded as a binary classification problem (pulsars, non-pulsars) or a three-class classification problem (pulsars, noise, RFI). Multiple decision trees are constructed to improve the performance. On the new LOTAAS data set (called LOTAAS 2 instead of the previous data set), the recall is 98.7 per cent, increased by 2.5 per cent. The FPR is reduced from 2.8 per cent to 0.5 per cent. However, the increase in the number of statistical features will lead to the occurrence of dimension disasters, and it is easy to grow overfitting. In this study, we use the eight statistical features designed by Lyon et al. (2016) on HTRU 2 and LOTAAS 1.

As mentioned above, although these statistical characteristics have been proved effective, the variation range of the same feature varies on the different telescope. This is because different telescopes have different geographical locations and observation angles. Even if the signal from the same pulsar, different telescopes will receive different data, which cause different survey data sets to be different. One obvious result is that different data sets have different overlaps under the same feature, even the statistical feature. This is important because the overlap degree has an impact on the performance of machine learning classification algorithms.

3 SCATTER MATRIX-BASED CLASS SEPARABILITY MEASURE

Yu et al. (2014) systematically analyses the reasons for harmfulness produced by class imbalance. The fundamental reasons for the harmfulness arouse by class imbalance are the degree of overlapping and class imbalance ratio. Fig. 1 shows the relationship between sample distribution and the harmfulness of class imbalance in a more general situation.

Figure 1.

Relationship between sample distribution and the harmfulness of class imbalance (a) is non-overlapping balanced classification task, (b) is overlapping balanced classification task, (c) and (d) are non-overlapping imbalanced classification task and overlapping imbalanced classification task, respectively, • and ★ represent the samples belonging to different classes, circular region denotes misclassified examples, and the black curve is the boundary between two classes.

Degree of overlapping and class imbalance ratio are two factors related closely with the harmfulness of class imbalance. To quantificationally assess the degree of overlapping, a class separability measure based on two scatter matrices is used. These scatter matrices are within-class scatter matrix (S_W) and between-class scatter matrix (S_B), defined as follows

$$\begin{eqnarray} S_{\mathrm{ W}} &=& \sum _{i=1}^C P(C_{i})\times \frac{1}{s_{i}}\times \sum _{k=1}^{s_{i}}\left(x_{k}^{(i)}-m^{(i)}\right)\left(x_{k}^{(i)}-m^{(i)}\right)^{T}, \\ S_{\mathrm{ B}} &=& \sum _{i=1}^C P(C_{i})\times (m^{(i)}-m)(m^{(i)}-m)^{T}, \end{eqnarray}$$

(1),(2)

where C is the number of categories, and in this study, there are only two categories of pulsars and non-pulsars, so we consider C = 2. s_i (i = 1, ..., C) represents the number of samples in the ith class, and |$x^{(i)}_{k}$| denotes the kth sample in the ith class. P(C_i) and m⁽ⁱ⁾ are a priori probability and mean vector of the ith class, respectively. m denotes the mean vector of all the samples.

Between the two scatter matrices, S_W tests the average tightness of the samples belonging to the same class and S_B evaluates the degree of separation among the centroids of different categories. For a binary classification task, a large separability means that both classes have small within-class scatter and large between-class scatter, and the class separability measure J is given as

$$\begin{eqnarray} J=\frac{tr(S_{\mathrm{ B}})}{tr(S_{\mathrm{ W}})}, \end{eqnarray}$$

(3)

where tr(A) represents the trace of the matrix A. In this study, J is used to approximatively estimate the degree of overlapping between two classes.

Class imbalance ratio could be easily acquired by priori sample distribution. The class imbalance ratio R refers to the ratio of non-pulsars to pulsars and is calculated as follows

$$\begin{eqnarray} R=\frac{N^{-}}{N^{+}}, \end{eqnarray}$$

(4)

where N⁻ and N⁺ are number of non-pulsars and pulsars, respectively.

Finally, the evaluation criterion Q for estimating the harmfulness of class imbalance can be defined as follows (Yu et al. 2014):

$$\begin{eqnarray} Q=\frac{J}{R}=\frac{tr(S_{\mathrm{ B}})}{tr(S_{\mathrm{ W}})}\times \frac{N^{+}}{N^{-}}. \end{eqnarray}$$

(5)

A large value of Q indicates that large class separability and small class imbalance ratio, which implies the corresponding classification task is less harmful, and vice versa. To divide the classification tasks into harmful and unharmful class imbalance, a threshold T_Q is set. If Q > T_Q, the task is unharmful, otherwise it is harmful. Then, the detailed flow of harm test is summarized in Fig. 2.

Figure 2.

A flow chart for harm test from class imbalance.

4 PSEUDO-NEAREST CENTROID NEIGHBOUR CLASSIFICATION

4.1 The proposed algorithm

A non-parametric machine learning algorithm PNCN (Ma et al. 2016) is proposed to solve the pulsar candidate classification tasks. In what follows, for the ease of presentation in the general recognition problem, we suppose that there is a training set |$T=\lbrace x_{i}\in R^{d}\rbrace ^{N}_{i=1}$| with N training samples, M classes in d-dimensional feature space, and the corresponding class labels are c₁, c₂, ..., c_M. A subset of T from the class c_l is |$T_{l}=\lbrace x_{lj}\in R^{d}\rbrace ^{N_{l}}_{j=1}$| with the number of the training samples N_l. In this study, the proposed algorithm operates on two data sets containing eight input features and one output variable, which is to say, we consider M = 2 and d = 8.

The nearest centroid neighbourhood (NCN) plays a critical role in the PNCN algorithm. The concept of NCN focuses on the idea that the neighbourhood of a query pattern is simultaneously subject to the distance criterion and the symmetry criterion. The centroid of a set Z = {z₁, z₂, ..., z_n} is defined as

$$\begin{eqnarray} \bar{Z}=\frac{1}{n}\sqrt{\sum _{i=1}^nz_{i}}. \end{eqnarray}$$

(6)

Then, the nearest centroid neighbours of a given query pattern x are searched through an iterative procedure (Ma et al. 2016) as follows:

Find the first nearest centroid neighbour |$x_{1}^{\mathrm{ NCN}}$| of x, which is its nearest neighbour.
Find the ith nearest centroid neighbour |$x_{i}^{\mathrm{ NCN}} (i \ge 2)$|⁠, which is imposed by the constraint that the centroid of the query x and all previous centroid neighbours, i.e. |$x_{1}^{\mathrm{ NCN}}, x_{2}^{\mathrm{ NCN}}, ... , x_{n\text{-}1}^{\mathrm{ NCN}}$|⁠, is the closest to x.

The PNCN decides the class label of x as follows:

Search k-nearest centroid neighbours from T_l of each class c_l for the query pattern x in the training set T, say |$T^{\mathrm{ NCN}}_{lk}(x) = \lbrace x^{\mathrm{ NCN}}_{lj}\in R^{d} \rbrace ^{k}_{j=1}$| .
Compute the local mean vector |$\bar{u}_{lj}^{\mathrm{ NCN}}(x)$| of the first j nearest centroid neighbours of a query x from class c_l. Let |$\bar{U}_{lk}^{\mathrm{ NCN}}(x)=\lbrace \bar{u}_{lj}^{\mathrm{ NCN}}(x)\in R^{d}\rbrace _{j=1}^{k}$| represent the set of the k local mean vectors corresponding to k-nearest centroid neighbours in the class c_l, and the corresponding Euclidean distances to x are |$d(x, \bar{u}_{l1}^{\mathrm{ NCN}}(x)), d(x, \bar{u}_{l2}^{\mathrm{ NCN}}(x)), . . . , d(x, \bar{u}_{lk}^{\mathrm{ NCN}}(x))$|⁠, where
$$\begin{eqnarray} \bar{u}_{lj}^{\mathrm{ NCN}}(x)=\frac{1}{j}\sum _{m=1}^j x_{lm}^{\mathrm{ NCN}}. \end{eqnarray}$$
(7)
Assign different weights to k categorical local mean vectors. The weight |$\bar{W}_{lj}^{\mathrm{ NCN}}$| of the jth local mean vector |$\bar{u}_{lj}^{\mathrm{ NCN}}(x)$| for the class c_l is given as
$$\begin{eqnarray} \bar{W}_{lj}^{\mathrm{ NCN}}=\frac{1}{j}, \ \ j = 1, ... , k. \end{eqnarray}$$
(8)
Design the pseudo-nearest centroid neighbour |$\bar{x}_{l}^{\mathrm{ PNCN}}(x)$| of the query pattern x from class c_l, and c_l can be viewed as the class label of |$\bar{x}_{l}^{\mathrm{ PNCN}}(x)$|⁠. The distance |$\bar{d}(x,\bar{x}_{l}^{\mathrm{ PNCN}}(x))$| between x and |$\bar{x}_{l}^{\mathrm{ PNCN}}(x)$| can be defined by the weighted sum of distances of k categorical local mean vectors to x as follows:
$$\begin{eqnarray} \bar{d}\left(x, \bar{x}_{l}^{\mathrm{ PNCN}}(x)\right) &=& \bar{W}_{ l1}^{\mathrm{ NCN}}\times d\left(x,\bar{u}_{l1}^{\mathrm{ NCN}}(x)\right) \nonumber\\ &&+\,... + \bar{W}_{lk}^{\mathrm{ NCN}}\times d(x,\bar{u}_{ lk}^{\mathrm{ NCN}}(x)). \end{eqnarray}$$
(9)
Classify the query pattern x into the class c, as follows:
$$\begin{eqnarray} c = \arg \min _{c_{l}} \bar{d}\left(x, \bar{x}_{l}^{\mathrm{ PNCN}}(x)\right). \end{eqnarray}$$
(10)

According to the procedure of the PNCN above, we summarize it in Algorithm 1 by means of the pseudo-codes. The development code of the PNCN algorithm can be found at https://github.com/xiaojiangping/PNCN.

Algorithm 1:

Pseudo-nearest centroid neighbour algorithm.

|$\mathbf {Input: }$| A query pattern x, a number of nearest neighbours k, a training set T.

|$\mathbf {Output:}$| A decision for x that class belong to.

|$\mathbf {Step:}$|

1. Calculate the Euclidean distances of training samples in each class c_l to x.

2. Search the k-nearest centroid neighbours of x in each class c_l, say |$T_{lk}^{\mathrm{ NCN}}(x)=\lbrace x_{lj}^{\mathrm{ NCN}}\in R^{d}\rbrace ^{k}_{j=1}$|⁠.

3. Compute the local mean vector |$\bar{u}_{lj}^{\mathrm{ NCN}}(x)$| of the first j nearest neighbours of x using |$T^{\mathrm{ NCN}}_{lk}(x)$| and the corresponding distance |$d(x, \bar{u}_{lj}^{\mathrm{ NCN}}(x))$| between |$\bar{u}_{lj}^{\mathrm{ NCN}}(x)$| and x.

4. Allocate the weight |$\bar{W}_{lj}^{\mathrm{ NCN}}$| of the jth the local mean vector |$\bar{u}_{lj}^{\mathrm{ NCN}}(x)$| in the set |$\bar{U}_{lk}^{\mathrm{ NCN}}(x)$|⁠.

5. Design pseudo-nearest centroid neighbour |$\bar{x}_{l}^{\mathrm{ PNCN}}(x)$| using |$\bar{W}_{lk}^{\mathrm{ NCN}}$| and |$d(x,\bar{u}_{lk}^{\mathrm{ NCN}}(x))$|⁠.

6. Assign the class c of the closest pseudo-nearest centroid neighbour to x.

The PNCN classification algorithm has a high computational complexity. More details can be seen in Table 3. Since the complexity of the algorithm only needs to preserve the high level of the computational cost, the computation cost of KNN is O(Nd+Nk), and that of PNCN is O(2Ndk+Mdk²/2). When the nearest neighbour number k and category number M are not large, there are not much differences between the KNN classifier and the PNCN classifier on the computational cost.

Table 3.

Details of the computational cost of the KNN and the PNCN classification algorithms. Let k denotes the nearest neighbour number, and M is category number. N and d are the number of training samples and the dimension of feature space, respectively.

Classifier	Computational cost	Description
KNN	O(Nd + Nk + k)	For each class, KNN is used to determine the class of the sample.
	O(2Ndk + Nk)	Search k-nearest neighbour of the nearest centroid in each class.
PNCN	O(Mdk(k + 1)/2)	Calculate k local mean vetors corresponding to k-nearest centroid neighbours for each class.
	O(Mk)	Assigning weights to local mean vectors of each class.
	O(Md + M)	For each class, PNCN is used to determine the class of the sample.

Classifier	Computational cost	Description
KNN	O(Nd + Nk + k)	For each class, KNN is used to determine the class of the sample.
	O(2Ndk + Nk)	Search k-nearest neighbour of the nearest centroid in each class.
PNCN	O(Mdk(k + 1)/2)	Calculate k local mean vetors corresponding to k-nearest centroid neighbours for each class.
	O(Mk)	Assigning weights to local mean vectors of each class.
	O(Md + M)	For each class, PNCN is used to determine the class of the sample.

Table 3.

Details of the computational cost of the KNN and the PNCN classification algorithms. Let k denotes the nearest neighbour number, and M is category number. N and d are the number of training samples and the dimension of feature space, respectively.

Classifier	Computational cost	Description
KNN	O(Nd + Nk + k)	For each class, KNN is used to determine the class of the sample.
	O(2Ndk + Nk)	Search k-nearest neighbour of the nearest centroid in each class.
PNCN	O(Mdk(k + 1)/2)	Calculate k local mean vetors corresponding to k-nearest centroid neighbours for each class.
	O(Mk)	Assigning weights to local mean vectors of each class.
	O(Md + M)	For each class, PNCN is used to determine the class of the sample.

Classifier	Computational cost	Description
KNN	O(Nd + Nk + k)	For each class, KNN is used to determine the class of the sample.
	O(2Ndk + Nk)	Search k-nearest neighbour of the nearest centroid in each class.
PNCN	O(Mdk(k + 1)/2)	Calculate k local mean vetors corresponding to k-nearest centroid neighbours for each class.
	O(Mk)	Assigning weights to local mean vectors of each class.
	O(Md + M)	For each class, PNCN is used to determine the class of the sample.

The classification performance of a standard KNN classification algorithm is degraded by three main issues: the sparse problem, the imbalance problem, and the noise problem. The presence of noise and RFI causes the class overlaps being so great. This is why the standard KNN classification algorithm has worked poorly for pulsar data in the past. The PNCN classification algorithm is not susceptible to the same problem. Because the PNCN classification algorithm takes into account not only the spatial distribution of samples (k-nearest neighbours of a tested sample ‘x’ in each class), but also the mean (local mean points corresponding to each nearest neighbour of the centre of mass), and the contribution of these mean values to the nearest neighbour classification from the distance to the sample to be measured (local mean points give different weights). Therefore, theoretically, the PNCN classification algorithm is effective and robust in more practical situations.

4.2 Data streams

Data streams refers to a large number of continuous, fast arriving, potentially infinite, and time-varying data sequences. More information about data streams can be found in Gaber et al. (2005) and Lyon et al. (2013, 2014).

Continuous improvement of survey specifications is leading to the rapid generation of data, especially the exponential rise in pulsar candidates numbers and data volumes. For example, the Square Kilometre Array (SKA), is currently underdevelopment by an international consortium (Carilli & Rawlings 2004). In the near future, the data rate of SKA in pulsar detection is expected to be between 0.43 and 1.45 TB s⁻¹ (Smits et al. 2009). This is many orders of magnitude greater than previous surveys, and online classification of data streams is a challenge.

Since it is sometimes difficult to obtain reference data with confirmed labels and the data to be processed in the real world at the same time, a commonly used discipline is to train a supervised candidate classifiers with some off-line data and then apply them to some data streams. This is a reasonable method, but the classifier must also be trained in some i.i.d. data associated with the data in the stream. However, some data streams are known to exhibit distributional shifts over varying time periods. For example, a changing RFI environment can exhibit shifts over both short (minutes/hours), and/or long (days/weeks/years) time-scales. In either case, the shifts cause violations of the i.i.d. assumption, a phenomenon known as concept drift (Widmer & Kubat 1996; Gaber et al. 2005). Recent years, more and more attention has been paid to machine learning methods suitable for pulsar classification from data streams, for example, the gh-vfdt algorithm (Lyon et al. 2016).

Theoretically, the PNCN classification algorithm can be applied to data streams. On the one hand, like the KNN classifier, the PNCN classifier does not build a classifier in advance. That makes it suitable for data streams. When a new sample arrives, the PNCN classifier finds the k-nearest centroid neighbours to the new sample from the training space, calculates the pseudo-nearest centroid neighbour and finally assigns the class label of the closest pseudo-nearest centroid neighbour to the new sample. On the other hand, as being described on the computational cost in the penultimate paragraph of Section 4.1, the computational cost of the PNCN algorithm is high. Some special computational equipment requirements such as graphics processing unit (GPU) may be required when using this method on data streams.

5 STRATIFICATION CROSS-VALIDATION AND EVALUATION MEASURES

5.1 Stratification cross-validation

In our study, we use stratification cross-validation to evaluate the performance of the PNCN algorithm. Stratification cross-validation is one kind of cross-validation, which is a resampling method widely used in machine learning for model evaluation, especially on imbalanced data sets.

The basic idea of cross-validation is to divide the reference data into two subsets: one is a training set to be used in learning the parameters of a model, for example, the connection weights in an ANN; another is a test set. There are two main reasons for using cross-validation. First, cross-validation can be used to evaluate the prediction performance of the model, especially the performance of the training model for fresh data, which can reduce overfitting to a certain extent. Secondly, we can get as much effective information as possible from the limited data. In the original cross-validation, the training set and the validation set are randomly divided, which does not necessarily maintain the imbalanced proportion of the original data. However, the lackness of maintaining the imbalanced proportion can lead to excessive positive evaluation results. Therefore, the stratification cross-validation is proposed to let the cross-validation method more suitable for imbalanced data by maintaining the proportionality of each category in the original data.

The process of K times of stratification cross-validation is as follows. The training set is divided into K subsets and every subset maintains the proportionality of each category in the original data. Cross-validation repeats K training-test procedures. In each training-test procedure, a single subset is regarded as test data, other K-1 subsets constitute a training set, and each subset is utilized as a test set only once in K training-test procedures. Therefore, K models are learned in the stratification cross-validation, K test results are computed, and the average of the K test results are computed as the final performance measure. In application, five or ten cross-validations are the most commonly used configurations. In our study, we used five-fold stratification cross-validation, and the flow chart is shown in Fig. 3.

Figure 3.

The flow chart of five-fold stratification cross-validation.

5.2 Evaluation measures

Since the pulsar data sets are extremely imbalanced, researchers are more concerned about the classification accuracy of pulsars. In this kind of scenario, recall, precision, and FPR are three performance metrics generally used. Recall describes the proportion of positive samples (pulsar) correctly classified. The higher the recall, the more pulsar samples are correctly classified. Precision reflects the proportion of actual positive samples in the samples classified by the classifier. FPR calculates the proportion of negative samples (non-pulsar) mistakenly recognized as positives (pulsar) by a classifier. The higher the precision or the lower the FPR, the less non-pulsar signals are misclassified.

Table 4 is the confusion matrix. True positive (TP) represents the number of pulsars that were correctly classified as pulsars. True negative (TN) represents the number of non-pulsars that were correctly classified as non-pulsars. False negative (FN) represents the number of pulsars that were classified as non-pulsars. False positive (FP) represents the number of non-pulsars that were misclassified as pulsars. Based on these description variables, the performance metrics are calculated as follows:

$$\begin{eqnarray} \rm Accuracy &=& \frac{\mathrm{ TP}+\mathrm{ TN}}{\mathrm{ TP}+\mathrm{ TN}+\mathrm{ FP}+\mathrm{ FN}} \\ \rm FPR &=& \frac{\mathrm{ FP}}{\mathrm{ FP}+\mathrm{ TN}} \\ \rm Precision &=& \frac{\mathrm{ TP}}{\mathrm{ TP}+\mathrm{ FP}} \\ \rm Recall &=& \frac{\mathrm{ TP}}{\mathrm{ TP}+\mathrm{ FN}} \\ \rm \text{F-Score} &=& 2\times \frac{\mathrm{ precision}\times \mathrm{ recall}}{\mathrm{ precision} + \mathrm{ recall}} \\ \rm Specificity &=& \frac{\mathrm{ TN}}{\mathrm{ FP}+\mathrm{ TN}} \rm \text{G-Mean} &=&\sqrt{\left(\frac{\mathrm{ TP}}{\mathrm{ TP}+\mathrm{ FN}} \times \frac{\mathrm{ TN}}{\mathrm{ TN}+\mathrm{ FP}}\right)}. \end{eqnarray}$$

(11),(12),(13),(14),(15),(16),(17)

Table 4.

Confusion matrices.

	Positive	Negative
TRUE	TP	FN
FALSE	FP	TN

Table 4.

Confusion matrices.

	Positive	Negative
TRUE	TP	FN
FALSE	FP	TN

6 RESULTS AND DISCUSSIONS

6.1 The harmful test for the pulsar candidate classification tasks

In our study, there are two pulsar candidate classification tasks, using the survey data sets HTRU 2 and LOTAAS 1, respectively. The class imbalance ratio of HTRU 2 is 9.92, while that of LOTAAS 1 is 75.56, both are highly imbalanced. Before dealing with the pulsar candidate classification tasks, a harm test is conducted. The evaluation criterion Q using a scatter matrix-based class separability measure is applied to estimate the harmfulness of class imbalance.

In our experiment, we have considered the value of T_Q as 0.1 (Yu 2017), adopt five-fold stratification cross-validation, that is 80 per cent of data set as training samples. The value of Q is the average of 10 independent runs. Results show that Q_HTRU2 = 0.040413 and Q_LOTAAS1 = 0.002981, which are both less than T_Q. As described in Section 3, the classification task on the imbalanced HTUR 2 data set and LOTAAS 1 data set are harmful, which means that direct use of traditional machine learning methods to these classification tasks may not necessarily achieve good classification results. The accuracy and precision on gh-vfdt are higher and FPR on gh-vfdt is lower than that of four traditional classifiers: C4.5, MLP, NB, and SVM on HTRU 2 (Table 5; Lyon et al. 2016). This result is consistent with the harm test. Therefore, it is a necessity for us to pre-estimate the harmfulness of the imbalanced pulsar survey data sets before processing the pulsar candidate classification tasks.

Table 5.

The performance of different ML classifiers. LC1 and LC2 are two ensemble classifiers used on LOTAAS 1 and LOTAAS 2, respectively (Tan et al. 2018). Bold type indicates the best performance on HTUR 2 and LOTAAS 1. ^aIndicates the best performance compared with PNCN and LC2.

Data set	Classifier	Accuracy	Recall	Precision	G-mean	F-score	FPR
	C4.5	0.946	0.904	0.635	0.926	0.740	0.051
	MLP	0.947	0.913	0.650	0.931	0.752	0.050
	NB	0.937	0.863	0.579	0.902	0.692	0.057
HTRU 2	SVM	0.871	0.901	0.723	0.919	0.789	0.031
	gh-vfdt	0.978	0.829	0.899	0.907	0.862	0.008
	KNN	0.978	0.825	0.930	0.906	0.875	0.006
	PNCN	0.978	0.831	0.923	0.908	0.874	0.007
	C4.5	0.990	0.948	0.494	0.969	0.623	0.009
	MLP	0.997	0.979	0.753	0.988	0.846	0.002
	NB	0.996	0.959	0.673	0.977	0.782	0.004
	SVM	0.999	0.901	0.966	0.949	0.932	0.001
LOTAAS 1	gh-vfdt	0.998	0.789	0.875	0.888	0.830	0.001
	KNN	0.999	0.947	0.978	0.973	0.961	0.0003
	PNCN	0.999^a	0.956	0.974	0.977	0.964	0.0003^a
	LC1	0.968	0.962	0.961	0.967	0.961	0.028
LOTAAS 2	LC2	0.992	0.987^a	0.993^a	0.991^a	0.990^a	0.005

Data set	Classifier	Accuracy	Recall	Precision	G-mean	F-score	FPR
	C4.5	0.946	0.904	0.635	0.926	0.740	0.051
	MLP	0.947	0.913	0.650	0.931	0.752	0.050
	NB	0.937	0.863	0.579	0.902	0.692	0.057
HTRU 2	SVM	0.871	0.901	0.723	0.919	0.789	0.031
	gh-vfdt	0.978	0.829	0.899	0.907	0.862	0.008
	KNN	0.978	0.825	0.930	0.906	0.875	0.006
	PNCN	0.978	0.831	0.923	0.908	0.874	0.007
	C4.5	0.990	0.948	0.494	0.969	0.623	0.009
	MLP	0.997	0.979	0.753	0.988	0.846	0.002
	NB	0.996	0.959	0.673	0.977	0.782	0.004
	SVM	0.999	0.901	0.966	0.949	0.932	0.001
LOTAAS 1	gh-vfdt	0.998	0.789	0.875	0.888	0.830	0.001
	KNN	0.999	0.947	0.978	0.973	0.961	0.0003
	PNCN	0.999^a	0.956	0.974	0.977	0.964	0.0003^a
	LC1	0.968	0.962	0.961	0.967	0.961	0.028
LOTAAS 2	LC2	0.992	0.987^a	0.993^a	0.991^a	0.990^a	0.005

Table 5.

The performance of different ML classifiers. LC1 and LC2 are two ensemble classifiers used on LOTAAS 1 and LOTAAS 2, respectively (Tan et al. 2018). Bold type indicates the best performance on HTUR 2 and LOTAAS 1. ^aIndicates the best performance compared with PNCN and LC2.

Data set	Classifier	Accuracy	Recall	Precision	G-mean	F-score	FPR
	C4.5	0.946	0.904	0.635	0.926	0.740	0.051
	MLP	0.947	0.913	0.650	0.931	0.752	0.050
	NB	0.937	0.863	0.579	0.902	0.692	0.057
HTRU 2	SVM	0.871	0.901	0.723	0.919	0.789	0.031
	gh-vfdt	0.978	0.829	0.899	0.907	0.862	0.008
	KNN	0.978	0.825	0.930	0.906	0.875	0.006
	PNCN	0.978	0.831	0.923	0.908	0.874	0.007
	C4.5	0.990	0.948	0.494	0.969	0.623	0.009
	MLP	0.997	0.979	0.753	0.988	0.846	0.002
	NB	0.996	0.959	0.673	0.977	0.782	0.004
	SVM	0.999	0.901	0.966	0.949	0.932	0.001
LOTAAS 1	gh-vfdt	0.998	0.789	0.875	0.888	0.830	0.001
	KNN	0.999	0.947	0.978	0.973	0.961	0.0003
	PNCN	0.999^a	0.956	0.974	0.977	0.964	0.0003^a
	LC1	0.968	0.962	0.961	0.967	0.961	0.028
LOTAAS 2	LC2	0.992	0.987^a	0.993^a	0.991^a	0.990^a	0.005

Data set	Classifier	Accuracy	Recall	Precision	G-mean	F-score	FPR
	C4.5	0.946	0.904	0.635	0.926	0.740	0.051
	MLP	0.947	0.913	0.650	0.931	0.752	0.050
	NB	0.937	0.863	0.579	0.902	0.692	0.057
HTRU 2	SVM	0.871	0.901	0.723	0.919	0.789	0.031
	gh-vfdt	0.978	0.829	0.899	0.907	0.862	0.008
	KNN	0.978	0.825	0.930	0.906	0.875	0.006
	PNCN	0.978	0.831	0.923	0.908	0.874	0.007
	C4.5	0.990	0.948	0.494	0.969	0.623	0.009
	MLP	0.997	0.979	0.753	0.988	0.846	0.002
	NB	0.996	0.959	0.673	0.977	0.782	0.004
	SVM	0.999	0.901	0.966	0.949	0.932	0.001
LOTAAS 1	gh-vfdt	0.998	0.789	0.875	0.888	0.830	0.001
	KNN	0.999	0.947	0.978	0.973	0.961	0.0003
	PNCN	0.999^a	0.956	0.974	0.977	0.964	0.0003^a
	LC1	0.968	0.962	0.961	0.967	0.961	0.028
LOTAAS 2	LC2	0.992	0.987^a	0.993^a	0.991^a	0.990^a	0.005

6.2 Some evaluations of the PNCN algorithm

In this study, a non-parametric machine learning algorithm PNCN is proposed to solve the pulsar candidate classification tasks. The PNCN algorithm is an improvement of KNN, which can effectively resolve the class imbalance classification problem.

In this experiment, we adopt five-fold stratification cross-validation. Six performance metrics, accuracy, recall, precision, G-Mean, F-Score, and FPR, are calculated by averaging the results of 10 independent runs. The experimental results are compared with the KNN classifier and other six different classifiers. These experiments are conducted on HTRU 2, LOTAAS 1, and LOTASS 2 (Lyon et al. 2016; Tan et al. 2018), and the evaluation results are presented in Table 5.

The HTRU 2 data set is a standard telescope data set, which is widely used to evaluate the performance of machine learning algorithms. Comparing with the classification algorithms that implemented on HTRU 2 mentioned in Lyon et al. (2016), the performance recorded in Table 5 shows that the PNCN classifier outperforms all these five classifiers: C4.5, MLP, NB, SVM, and PNCN in four evaluation metrics: accuracy, precision, F-Score, and FPR.

Next, we compare the performance of the classification algorithms implemented on HTRU 2 and LOTAAS 1. On the one hand, the values of recall on both data sets are higher on the PNCN than that of the KNN. This result means that the PNCN classifier is more superior than the traditional KNN classifier in identifying pulsars. On the other hand, focus on a binary classification problem. It is shown that our proposed classifier outperforms all these seven classifiers based on many other evaluation metrics. Specifically, the proposed classifier can identify the pulsar with high accuracy, 97.8 per cent and 99.9 per cent on HTRU2 and LOTAAS 1, respectively. The precision and the recall on HTRU 2 are 92.3 per cent and 83.1 per cent, and those on LOTAAS 1 are 97.4 per cent and 95.6 per cent, respectively. The FPR on HTRU 2 is 0.7 per cent. The FPR on LOTAAS 1 is 0.03 per cent, which is an order of magnitude lower. Although the proposed classifier exhibits a slightly lower performance than LC2 (Tan et al. 2018) based on recall and precision on LOTAAS 1, the FPR is an order of magnitude lower than that of LC2. Finally, comparing with the PNCN classification algorithm that implements on HTRU 2 and LOAAST 1, the results on HTRU 2 are not good. The main reason is that the distribution of data collected by different telescopes is very different, such as the difference in observing frequency and angle. Therefore, although the same feature calculation formula, the value range of feature value varies with the data set. This also leads to a different degree of overlap of the same features in different data sets. As mentioned in Section 3, the degree of class overlap affects the classification effect of machine learning.

Furthermore, one significant advantage of the PNCN algorithm is its robustness to k. On the one hand, the results in Figs 4 and 5 show the effect k on classification performance. Obviously, with the changement of k, there is no significant change in each performance evaluation metrics. Figs 6 and 7 show that the value of FPR tends to be flat when k is larger. All above suggesting that the PNCN algorithm is robust to k. Due to the imbalance of data set, researchers pay more attention to the classification accuracy of pulsars, so recall, precision, and FPR are generally used to reflect the performance of the algorithm. Consider the changes of these three metrics. The best choice of the k in PNCN is 7 for HTUR 2 and 4 for LOTAAS 1. On the other hand, comparing PNCN and KNN, both classifiers get higher accuracy and precision, lower FPR. Although the value of precision on both data sets is a little lower in PNCN than that of the KNN, the advantage of the PNCN algorithm is its robustness to k. This is very important because the robustness to k means that the performance of the PNCN algorithm is more stable with the change of data. This stableness makes us believe that the PNCN classification algorithm is not as susceptible to noise and RFI as the empirical KNN classification algorithm. This stableness also makes us look forward to the application to data streams. Also, as discussed in Section 4.1, the computation cost of both algorithms is near, since the category number M is 2, which is not large.

Figure 4.

Performance evaluations on HTRU 2 using accuracy, recall, precision, G-Mean, and F-Score.

Figure 5.

Performance evaluations on LOTAAS 1 using accuracy, recall, precision, G-Mean, and F-Score.

Figure 6.

Performance evaluation on HTRU 2 using FPR.

Figure 7.

Performance evaluation on LOTAAS 1 using FPR.

10.1111/j.1365-2966.2012.22042.x

Overall, the PNCN algorithm can effectively solve the pulsar candidate classification task.

7 CONCLUSIONS

In this paper, an evaluation criterion using a scatter matrix-based class separability measure is applied to estimate the harmfulness of class imbalance on pulsar survey data sets. Results of harm test give us quantitative measure for the damage from the imbalance in pulsar survey data sets. The estimated results provide prior information to guide us to select appropriate data processing approaches and to construct efficient classifiers. This is the pre-evaluation strategy first used to deal with pulsar candidate classification tasks.

We proposed a PNCN classifier for pulsar candidate selection. The PNCN classifier is a non-parametric machine learning method first used in pulsar candidate selection problem. In terms of data feature selection, we used the statistical features designed by Lyon et al. (2016). Results show that the PNCN algorithm can effectively solve the class imbalance problem in pulsar candidate classification tasks. Our work proves once again that the selection of pulsar candidates based on statistical features is effective.

Although the PNCN classifier have an order of magnitude lower in FPR than the corresponding FPR obtained in Lyon et al. (2016) and Tan et al. (2018) on LOTAAS 1, the recall and the precision are not superior comparing with LC2 classifier. Whether the combination of appropriate data resampling methods can improve recall and precision? This is the future work we plan to explore. We also plan to apply the non-parametric machine learning method to other pulsar survey data sets.

ACKNOWLEDGEMENTS

Authors are grateful for the support from the National Natural Science Foundation of China (grant no.: 11973022), the Joint Research Fund in Astronomy (U1531242) under cooperative agreement between the National Natural Science Foundation of China (NSFC) and Chinese Academy of Sciences (CAS), the Major projects of the joint fund of Guangdong and the National Natural Science Foundation (U1811464), the Natural Science Foundation of Guangdong Province (2014A030313425, S2011010003348), and the China Scholarship Council (201706755006) to our free explorations.

REFERENCES

Bates

S. D.

et al. .,

2012

,

MNRAS

,

427

,

1052

10.1016/j.ascom.2018.02.002

Bethapudi

S.

,

Desai

S.

,

2018

,

Astron. Comput.

,

23

,

15

10.1016/j.newar.2004.09.046

Carilli

C.

,

Rawlings

S.

,

2004

,

New Astron. Rev.

,

48

,

11

10.1016/j.newar.2004.09.040

Cooper

S.

,

2014

,

Proc. LOFAR Sci.

,

Available at: http://www. astron.nl/lofarscience2014/Documents/Tuesday/Session, Accessed 2016 January 6

Cordes

J. M.

,

Kramer

M.

,

Lazio

T. J. W.

,

Stappers

B. W.

,

Backer

D. C.

,

Johnston

S.

,

2005

,

New Astron. Rev.

,

48

,

1413

Cover

T. M.

,

Hart

P. E.

,

1967

,

IEEE Trans. Inf. Theory

,

13

,

21

10.1109/TIT.1967.1053964

10.1111/j.1365-2966.2010.17082.x

Eatough

R. P.

,

Molkenthin

N.

,

Kramer

M.

,

Noutsos

A.

,

Keith

M. J.

,

Stappers

B. W.

,

Lyne

A. G.

,

2010

,

MNRAS

,

407

,

2443

Gaber

M. M.

,

Zaslavsky

A.

,

Krishnaswamy

S.

,

2005

,

ACM SIGMOD Record

,

34

,

18

Guo

P.

,

Duan

F.

,

Wang

P.

,

Yao

Y.

,

Yin

Q.

,

Xin

X.

,

2017

,

preprint (arXiv:1711.10339)

Hewish

A.

,

Bell

S. J.

,

Pilkington

J. D. H.

,

Scott

P. F.

,

Collins

R. A.

,

1968

,

Nature

,

217

,

709

10.1038/217709a0

Ho

T. K.

,

Basu

M.

,

2002

,

IEEE Trans. Pattern Anal. Mach. Learn.

,

24

,

289

Hobbs

G. B.

et al. .,

2009

,

Publ. Astron. Soc. Aust.

,

26

,

103

10.1071/AS08023

10.1111/j.1365-2966.2010.17325.x

Jain

A.

,

Bhatnagar

V.

,

2015

,

IRMJ

,

28

,

20

Keith

M. J.

et al. .,

2010

,

MNRAS

,

409

,

619

10.1109/TSMCA.2010.2084081

Khoshgoftaar

T. M.

,

Hulse

J. V.

,

Napolitano

A.

,

2011

,

IEEE Trans. Syst. Man Cybern.

,

41

,

552

10.1109/TSMCB.2008.2007853

Liu

X. Y.

,

Wu

J.

,

Zhou

Z. H.

,

2009

,

IEEE Trans. Syst. Man Cybern.

,

39

,

539

http://www.lofar.org/wiki//ex/fetch.php?media = public:lsm_new:2013_03_06_hesself.pdf,

LOFAR Pulsar Working Group

2013

,

Proc. LOFAR Status Meeting

,

Available at:

Accessed 2016 January 6

Lorimer

D. R.

,

Lyne

A. G.

,

Camilo

F.

,

1998

,

A&A

,

331

,

1002

Lyne

A. G.

et al. .,

2004

,

Science

,

303

,

1153

10.1126/science.1094645

PubMed

Lyon

R. J.

,

Brooke

J. M.

,

Knowles

J. D.

,

Stappers

B. W.

,

2013

,

IEEE Trans. Syst. Man Cybern.

,

1506

:

Lyon

R. J.

,

Brooke

J. M.

,

Knowles

J. D.

,

Stappers

B. W.

,

2014

, in

22nd International Conf. on Pattern Recognition, Hellinger Distance Trees for Imbalanced Streams

.

ICPR 2014

,

Stockholm, Sweden

, p.

1969

Lyon

R. J.

,

Stapper

B. W.

,

Cooper

S.

,

Brooke

J. M.

,

Knowles

J. D.

,

2016

,

MNRAS

,

459

,

1104

10.1093/mnras/stw656

10.1007/978-981-10-0539-8_12

Ma

H.

,

Wang

X.

,

Gou

J.

,

2016

,

Frontier Comput.

,

375

,

103

10.1046/j.1365-8711.2001.04751.x

Manchester

R. N.

et al. .,

2001

,

MNRAS

,

328

,

17

Mohamed

T. M.

,

2018

,

Future Comput. Inform. J.

,

3

,

1

Morello

V.

,

Barr

E. D.

,

Bailes

M.

,

Flynn

C. M.

,

Keane

E. F.

,

van Straten

W.

,

2014

,

MNRAS

,

443

,

1651

10.1093/mnras/stu1188

Prati

C.

,

Alahakoon

D.

,

Lee

V.

,

2004

,

ACM SIGKDD Explorations Newsletter

,

6

,

50

10.1145/1007730.1007738

10.1051/0004-6361:200810383

Smits

R.

,

Kramer

M.

,

Stappers

B.

,

Lorimer

D. R.

,

Cordes

J.

,

Faulkner

A.

,

2009

,

A&A

,

493

,

1161

Tan

C. M.

,

Lyon

R. J.

,

Stappers

B. W.

,

Cooper

S.

,

Hessels

J. W. T

,

Kondratiev

V. I.

,

Michilli

D.

,

Sanidas

S.

,

2018

,

MNRAS

,

474

,

4571

10.1093/mnras/stx3047

Thornton

D.

,

2013

,

PhD thesis

,

Univ. Manchester

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Wagner

T.

,

1971

,

IEEE Trans. Inf. Theory

,

17

,

566

10.1109/TIT.1971.1054698

10.1016/j.jcis.2018.06.053

Wang

K.

,

Guo

P.

,

Yu

F. S.

,

Duan

L. Z.

,

Wang

Y. P.

,

Du

H.

,

2018

,

Int. J. Comput. Intel. Syst.

,

11

,

575

Wang

H. F.

et al. .,

2019

,

Sci. China-Phys. Mech. Astron.

,

62

,

1

Widmer

G.

,

Kubat

M.

,

1996

,

Mach. Learn.

,

23

,

69

10.1023/A:101804650

Yu

H. L.

,

2017

,

Class Imbalance Learning Theories and Algorithms

.

Tsinghua Univ. Press

,

Tsinghua

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Yu

H. L.

,

Ni

J.

,

Xu

S.

,

Qin

B.

,

Jv

H. R.

,

2014

,

Intell. Data Anal.

,

18

,

203

10.3233/IDA-140637

10.1016/j.cma.2019.05.033

Zhang

H. Y.

,

Zhao

Z.

,

An

T.

,

Lao

B. Q.

,

Chen

X.

,

2019

,

Comput. Electr. Eng.

,

73

,

1