SUMMARY

Signal and noise classification can add an extra level of constraint for earthquake phase picking by pinpointing the signal waveforms from continuous seismic data for more accurate arrival picking. However, the continuously increasing data collected by worldwide stations exceeds the ability of manual analysis. Moreover, manual earthquake data analysis depends on seismologists’ expert knowledge, resulting in inconsistent analysis results. To address this, we proposed a generalized deep learning (DL) network architecture to discriminate earthquake signal and noise waveforms. The proposed DL framework is a novel architecture comprising a feature extractor, a classifier and two hybrid attention modules. It utilizes different kernel sizes for more detailed feature extraction, and the hybrid attention mechanism module can guide the network to focus more on the waveform characteristics. To illustrate the power of the proposed DL network, we applied it to classify the earthquake signal and noise of the 3-C Texas Earthquake Dataset. The results demonstrate that the accuracy of the proposed method in the testing set reaches 99.83 per cent. We further utilize the transfer learning strategy to demonstrate the transferability of the proposed network with the Stanford earthquake data set, showing an encouraging classification accuracy of 95.03 per cent. Additionally, we conducted an additional experiment on arrival picking by integrating decoder blocks into the classification network, which achieves remarkable P- and S-wave arrival picking accuracy.

1 INTRODUCTION

Classification between earthquake signal and noise is an inevitable problem in real-time earthquake monitoring. The seismic records obtained from continuous monitoring over a long period of time contain various information, which helps the passive source location, near-surface ground motion prediction and further understanding of the motion metrics inversion. However, these continuous monitoring records contain numerous interferences, causing low-precision P- and S-wave phase picking, magnitude estimation and passive source location. Furthermore, the sheer volume of continuous seismograms obtained from thousands of stations surpasses the capacity for manual analysis. Hence, the development of automated detection and classification methods employing machine-learning algorithms is crucial. Nonetheless, microearthquake events, characterized by their low magnitude and high attenuation, pose significant challenges to detection and classification tasks.

Prior work on earthquake classification and detection can be divided into conventional physics-driven approaches and data-driven methods represented by deep learning. Physics-driven approaches employ specific mathematical equations for signal and noise waveform representation and classify them within the physical domain. Among a large amount of seismicity detection and classification techniques, the Short-Term Average/Long-Term Average (Vaezi & Van der Baan 2015), which refers to the ratio between the short-term and long-term average of time-series earthquake data, holds significance in earthquake signal identification as a physics-driven way. However, it cannot effectively identify signal and noise when encountering seismic records with weak amplitude (Seydoux et al. 2020). Beyreuther & Wassermann (2008) investigated the earthquake detection and classification tasks in continuous 3-C records using discrete Hidden Markov Models (DHMM). The classification on a single station shows promising results on classification tasks and the superior performance of DHMM over benchmark methods. The Stockwell transform (Cheng et al. 2019; Sharma & Nanda 2020; Saad et al. 2023a) can extract time-frequency features from signals, thereby minimizing classification errors. Its high-time-frequency resolution and effective time-frequency clustering contribute to improved classification accuracy. However, the aforementioned conventional methods struggle to handle seismic records from multiple stations, often yielding incorrect predictions, especially when the earthquake records exhibit large-amplitude variations or low-signal-to-noise ratio.

Recent years have witnessed the emergence of data-driven machine-learning algorithms, offering a novel approach to processing vast amounts of data with minimal prior knowledge (Seydoux et al. 2020). Machine learning algorithms, for example decision trees (Chin et al. 2019), multilayer perceptron (Giacco et al. 2009), random forest (Provost et al. 2017; Wenner et al. 2021), logistic regression classifier (Wang et al. 2023) and support vector machine (Kim et al. 2020) have found widespread application in earthquake classification and various applications in seismology. These trained algorithms can implicitly discern between signal and noise waveforms (Kong et al. 2016). However, in both approaches, features are manually extracted and subjectively chosen in an attempt to characterize the waveforms comprehensively.

Deep learning (DL) methods offer a novel data-driven approach to automatic feature extraction, leading to notable advancements in seismological data processing and interpretation. These models are trained using manual annotations to automatically extract relevant features and decision rules from extensive labelled data sets, facilitating comprehensive analysis across a broad spectrum of input data. Recently, a group of neural networks has obtained promising applications in the seismology community, empowering more precise and efficient feature extraction, for example event detection (Zhu & Beroza 2019; Saad et al. 2023b, a), event classification (Gulia & Wiemer 2019; Mousavi et al. 2019b; Saad et al. 2020; Ma et al. 2023), data reconstruction (Wang et al. 2022), first-motion-polarity classification (Mousavi et al. 2019b), magnitude estimation (McBrearty & Beroza 2022), P- and S-wave phase picking (Zhu & Beroza 2019; Saad et al. 2023b), single-station source location (Zhang et al. 2019), earthquake correlation (Zhang et al. 2019) and focal mechanism solutions (Zhang et al. 2022).

Minimizing the false negative and false positive rates is the main goal in earthquake classification (Mousavi et al. 2020). DL can learn the essential features of earthquake waveforms and provide high-order feature extraction through complex nonlinear mapping, which makes deep neural networks more suitable for their applications in seismology, that is event detection, phase picking and earthquake location. Seydoux et al. (2020) proposed an unsupervised machine-learning framework for real-time earthquake signal/noise detection and clustering. They utilized a deep scattering network for automatic feature extraction and the Gaussian mixture model to cluster signal and noise in the feature domain. Classification performances on continuous seismic records and blind detection show that this approach can achieve relatively high-precision detection in seismogenic regions in a ground-truth-free way. Saad et al. (2022b) introduced the capsule neural network to achieve earthquake discrimination tasks from quarry blasts. Due to the high similarity in time-series between earthquakes and quarry explosions, artificial identification can easily lead to miscalculation. Thus, they introduced capsule neural networks, which learn the dependencies between feature maps extracted by the network for scalograms classification of the two events. The result shows that the capsule neural network achieves remarkable discrimination results based on limited training samples. Jiang et al. (2023) built a multiclassifier architecture that utilized a convolutional neural network (CNN) for earthquakes, rockfalls and quakes with weak signals classification tasks. Inspired by the Visual Geometry Group Network (VGGNet) (Simonyan & Zisserman 2014), they exploited 6-C time-series, short-time Fourier transform (STFT), and continuous wavelet transform (CWT) maps as the input data and explored the network’s discrimination performance on different types of input data. The results show that CNN achieves the highest accuracy when using CWT maps as the input data. However, the CWT-based training set generation method consumes more time than merely using the time-series as input.

Supervised deep learning methods require a large amount of labelled data for robust model training. Our investigation considers two approaches for generating target data in earthquake waveform classification and first-arrival picking tasks. One method involves transforming 1-D time-series into 2-D spectrograms using various techniques, for example CWT (Rioul & Flandrin 1992) and STFT (Griffin & Lim 1984; Liu et al. 2020). These 2-D training samples provide enhanced frequency resolution and improved classification performance accuracy compared to the 1-D time-series as input. However, utilizing 2-D convolutional networks increases the time and memory requirements for label data generation and network parametrization. Alternatively, employing the original 3-C time-series as input preserves the fundamental characteristics of seismic records and reduces training time. In this paper, we leverage a vast data set collected by TexNet, adopting 3-C time-series as input mitigates the computational burden compared to using 2-D spectrograms. Moreover, using time-series as input enhances the model’s adaptability for real-time classification and first-arrival picking tasks.

Bearing the above advantages and disadvantages, we propose a novel network incorporating multiscale future fusion and hybrid attention modules for 3-C time-series earthquake signal and noise classification. The proposed network comprises three blocks: the feature extractor, the discriminator and the hybrid attention blocks. To ensure the reproducibility of our work, we trained and tested the proposed DL framework on publicly available data sets: the Texas Earthquake Dataset (TXED) (Chen et al. 2024) and the STanford EArthquake Dataset (STEAD) (Mousavi et al. 2019a). The transfer learning strategy (Torrey & Shavlik 2010) is adopted to showcase the network’s generalizability in the STEAD. Moreover, based on the above classification network, a multitask network was built to achieve both classification and phase-picking tasks. Compared with previous pioneering works, our main contributions to the task of earthquake signal and noise classification based on DL methods are as follows:

  • We propose a novel DL framework for high-precision signal and noise discrimination in the TXED. The proposed method can still achieve high-discrimination accuracy when faced with few training samples and unbalanced training data.

  • We utilize convolutional layers with different kernel sizes to extract multiscale features from the input 3-C time-series. The extracted feature maps are fused (from the last dimension) to enable the network to balance waveform details and overall characteristics.

  • We introduce a hybrid attention mechanism module (Woo et al. 2018), where the channel attention mechanism allows the network to focus on waveform characteristics of a single component. The spatial attention combines waveform information from the 3-C, enabling better dimensionality reduction and feature extraction.

  • In the classifier module, we employ a convolutional neural network-gated recurrent unit (CNN-GRU) module (Chung et al. 2014) to enhance the model’s temporal modelling capabilities, thereby achieving high-precision earthquake signal and noise classification.

  • The transfer learning strategy is introduced to test the proposed DL network’s generalizability by applying it to the STEAD classification task.

The remaining part of this paper is arranged as follows: we first review relevant prior works on earthquake classification using deep learning methods. Next, we detail the proposed deep learning network structure and provide insights into the training and testing procedures. Afterward, we apply the trained model to discriminate between earthquake signals and noise on the TXED. Additionally, we introduce a multitask deep learning network that integrates the classification network with extensive decoder and output blocks to achieve both classification and arrival-picking tasks. Furthermore, we implement a transfer learning strategy for classifying STEAD signals and noise (M  |$\gt $| 3.0) and delve into the specific architecture of the proposed network in the “DISCUSSION” section. Finally, we present several key conclusions drawn from this study.

2 RELATED WORKS

In Table 1, we summarize previous pioneering applications in earthquake signal and noise classification with DL methods. We analyse the discrimination and detection performance of different methods in terms of network structure, data types used, special modules and their superiorities.

Table 1.

Prior work on earthquake detection and classification across different DL architectures and data sets.

ReferenceObjectiveNetwork architectureData formatdata set
(Seydoux et al. 2020)Earthquake signal and noise classificationScattering network with Gaussian mixture modelTime-seriesNuugaatsiaq landslide
(Linville et al. 2019)Earthquake and quarry blasts classificationCNN with long-short-term memory3-C spectrogramsUUSSa
(Tibi et al. 2019)Tectonic, mining-induced and mining earthquake classificationCNN3-C spectrogramsUUSS and UUEBb
(Mousavi et al. 2019c)Earthquake event detectionResidual network3-C spectrogramsNorth California
(Ku et al. 2020)Earthquake detection, microearthquake and noise classificationAttention-based CNN3-C time-seriesNECISc and IRISd
(Mousavi et al. 2020)Seismic detection and phase pickingMultitask learning3-C time-seriesSTEADe
(Shakeel et al. 2021)Earthquake magnitude estimation3D-CNN and RNN1-C Log-MeL spectrogramsSTEAD
(Duan et al. 2021)Microseismic event classificationCNN22-C seismogramsSynthetic data
(Saad et al. 2021)Earthquake event detectionU-net3-C CWT spectrogramsTexas, North California, Japan and Egypt
(Liu et al. 2021)Microseismic classificationCNN6-C seismogramsUnderground coal mine
(Saad et al. 2022b)Earthquakes and quarry blasts classificationCapsule Neural NetworkSeismogramsENSNf
(Trani et al. 2022)Earthquake, other event and noise classificationCNN3-C time-series and seismogramsKNMIg
(Saad et al. 2022a)Event detection and magnitude estimationVision Transformer3-C time-seriesSTEAD
(Jiang et al. 2023)Earthquake, quake, rockfall and noise classificationCNN3-C time-series, STFT maps and CWT mapsRésif
(Ma et al. 2023)Microseismic, blasting and noise classificationModified Visual Geometry Group 13-layer Network (VGG13) with attention1-C STFT maps
Our modelEarthquake signal and noise classification, and arrival pickingMSFF with hybrid attention3-C time-seriesTXEDh
ReferenceObjectiveNetwork architectureData formatdata set
(Seydoux et al. 2020)Earthquake signal and noise classificationScattering network with Gaussian mixture modelTime-seriesNuugaatsiaq landslide
(Linville et al. 2019)Earthquake and quarry blasts classificationCNN with long-short-term memory3-C spectrogramsUUSSa
(Tibi et al. 2019)Tectonic, mining-induced and mining earthquake classificationCNN3-C spectrogramsUUSS and UUEBb
(Mousavi et al. 2019c)Earthquake event detectionResidual network3-C spectrogramsNorth California
(Ku et al. 2020)Earthquake detection, microearthquake and noise classificationAttention-based CNN3-C time-seriesNECISc and IRISd
(Mousavi et al. 2020)Seismic detection and phase pickingMultitask learning3-C time-seriesSTEADe
(Shakeel et al. 2021)Earthquake magnitude estimation3D-CNN and RNN1-C Log-MeL spectrogramsSTEAD
(Duan et al. 2021)Microseismic event classificationCNN22-C seismogramsSynthetic data
(Saad et al. 2021)Earthquake event detectionU-net3-C CWT spectrogramsTexas, North California, Japan and Egypt
(Liu et al. 2021)Microseismic classificationCNN6-C seismogramsUnderground coal mine
(Saad et al. 2022b)Earthquakes and quarry blasts classificationCapsule Neural NetworkSeismogramsENSNf
(Trani et al. 2022)Earthquake, other event and noise classificationCNN3-C time-series and seismogramsKNMIg
(Saad et al. 2022a)Event detection and magnitude estimationVision Transformer3-C time-seriesSTEAD
(Jiang et al. 2023)Earthquake, quake, rockfall and noise classificationCNN3-C time-series, STFT maps and CWT mapsRésif
(Ma et al. 2023)Microseismic, blasting and noise classificationModified Visual Geometry Group 13-layer Network (VGG13) with attention1-C STFT maps
Our modelEarthquake signal and noise classification, and arrival pickingMSFF with hybrid attention3-C time-seriesTXEDh
a

UUSS represents University of Utah Seismic Station.

b

UUEB represents Unconstrained Utah Event Bulletin.

c

NCEDC stands for Northern California Earthquake Data Center.

d

IRIS represents the University of Utah Seismic Station.

e

STEAD represents Stanford Earthquake Dataset.

f

ENSN refers to the Egyptian Seismic Network.

g

KNMI is for Royal Netherlands Meteorological Institute.

h

TXED stands for the Texas Earthquake Dataset for AI.

Table 1.

Prior work on earthquake detection and classification across different DL architectures and data sets.

ReferenceObjectiveNetwork architectureData formatdata set
(Seydoux et al. 2020)Earthquake signal and noise classificationScattering network with Gaussian mixture modelTime-seriesNuugaatsiaq landslide
(Linville et al. 2019)Earthquake and quarry blasts classificationCNN with long-short-term memory3-C spectrogramsUUSSa
(Tibi et al. 2019)Tectonic, mining-induced and mining earthquake classificationCNN3-C spectrogramsUUSS and UUEBb
(Mousavi et al. 2019c)Earthquake event detectionResidual network3-C spectrogramsNorth California
(Ku et al. 2020)Earthquake detection, microearthquake and noise classificationAttention-based CNN3-C time-seriesNECISc and IRISd
(Mousavi et al. 2020)Seismic detection and phase pickingMultitask learning3-C time-seriesSTEADe
(Shakeel et al. 2021)Earthquake magnitude estimation3D-CNN and RNN1-C Log-MeL spectrogramsSTEAD
(Duan et al. 2021)Microseismic event classificationCNN22-C seismogramsSynthetic data
(Saad et al. 2021)Earthquake event detectionU-net3-C CWT spectrogramsTexas, North California, Japan and Egypt
(Liu et al. 2021)Microseismic classificationCNN6-C seismogramsUnderground coal mine
(Saad et al. 2022b)Earthquakes and quarry blasts classificationCapsule Neural NetworkSeismogramsENSNf
(Trani et al. 2022)Earthquake, other event and noise classificationCNN3-C time-series and seismogramsKNMIg
(Saad et al. 2022a)Event detection and magnitude estimationVision Transformer3-C time-seriesSTEAD
(Jiang et al. 2023)Earthquake, quake, rockfall and noise classificationCNN3-C time-series, STFT maps and CWT mapsRésif
(Ma et al. 2023)Microseismic, blasting and noise classificationModified Visual Geometry Group 13-layer Network (VGG13) with attention1-C STFT maps
Our modelEarthquake signal and noise classification, and arrival pickingMSFF with hybrid attention3-C time-seriesTXEDh
ReferenceObjectiveNetwork architectureData formatdata set
(Seydoux et al. 2020)Earthquake signal and noise classificationScattering network with Gaussian mixture modelTime-seriesNuugaatsiaq landslide
(Linville et al. 2019)Earthquake and quarry blasts classificationCNN with long-short-term memory3-C spectrogramsUUSSa
(Tibi et al. 2019)Tectonic, mining-induced and mining earthquake classificationCNN3-C spectrogramsUUSS and UUEBb
(Mousavi et al. 2019c)Earthquake event detectionResidual network3-C spectrogramsNorth California
(Ku et al. 2020)Earthquake detection, microearthquake and noise classificationAttention-based CNN3-C time-seriesNECISc and IRISd
(Mousavi et al. 2020)Seismic detection and phase pickingMultitask learning3-C time-seriesSTEADe
(Shakeel et al. 2021)Earthquake magnitude estimation3D-CNN and RNN1-C Log-MeL spectrogramsSTEAD
(Duan et al. 2021)Microseismic event classificationCNN22-C seismogramsSynthetic data
(Saad et al. 2021)Earthquake event detectionU-net3-C CWT spectrogramsTexas, North California, Japan and Egypt
(Liu et al. 2021)Microseismic classificationCNN6-C seismogramsUnderground coal mine
(Saad et al. 2022b)Earthquakes and quarry blasts classificationCapsule Neural NetworkSeismogramsENSNf
(Trani et al. 2022)Earthquake, other event and noise classificationCNN3-C time-series and seismogramsKNMIg
(Saad et al. 2022a)Event detection and magnitude estimationVision Transformer3-C time-seriesSTEAD
(Jiang et al. 2023)Earthquake, quake, rockfall and noise classificationCNN3-C time-series, STFT maps and CWT mapsRésif
(Ma et al. 2023)Microseismic, blasting and noise classificationModified Visual Geometry Group 13-layer Network (VGG13) with attention1-C STFT maps
Our modelEarthquake signal and noise classification, and arrival pickingMSFF with hybrid attention3-C time-seriesTXEDh
a

UUSS represents University of Utah Seismic Station.

b

UUEB represents Unconstrained Utah Event Bulletin.

c

NCEDC stands for Northern California Earthquake Data Center.

d

IRIS represents the University of Utah Seismic Station.

e

STEAD represents Stanford Earthquake Dataset.

f

ENSN refers to the Egyptian Seismic Network.

g

KNMI is for Royal Netherlands Meteorological Institute.

h

TXED stands for the Texas Earthquake Dataset for AI.

Prior works (Ku et al. 2020; Mousavi et al. 2020; Saad et al. 2022a; Trani et al. 2022; Jiang et al. 2023) utilized 3-C time-series as the input for their models, which exploited the most characteristics of waveforms in time-series. These methods (Linville et al. 2019; Tibi et al. 2019; Mousavi et al. 2019c; Shakeel et al. 2021; Saad et al. 2021; Jiang et al. 2023) used the spectrograms generated by the STFT, CWT and other transforms as the input data, which enables the network architecture to obtain more frequency resolution information hence achieving high-precision in detection and classification tasks. However, the transform-based data generation methods consume more time than just taking the time-series as input data. Compared with the above pioneering works, we introduced a multiscale feature fusion model for automatic dimensionality reduction and feature extraction, which enables 1-D convolutional (Conv1-D) layers, taking into account both details and local features. We chose the 3-C time-series as the input data for our DL network architecture. Seydoux et al. (2020) used unsupervised learning for real-time earthquake classification and detection. Unlike the supervised method, it does not require labelled training pairs, which overcomes the bias in the manual training set annotation. However, unsupervised learning methods highly depend on the specific network structure and carefully chosen hyperparameters, which limits their further applications in diverse data sets. Earthquake event detection and phase picking share similarities, but their objectives differ. Therefore, Mousavi et al. (2020) proposed a multitask DL architecture to train a neural network for seismic event detection and P- and S-wave picking. Compared to using only CNN, the attention mechanism-based method can guide the network to focus on essential waveform characteristics, while using recurrent neural network (RNN) can enhance the network’s ability to model time-series data (Ku et al. 2020; Shakeel et al. 2021).

3 METHODOLOGY

This section is divided into three subsections. The first subsection gives the details about the proposed DL architecture. In the second section, we show some details of training and testing for the proposed method. Finally, we introduce objective evaluation metrics to test the network’s classification performance.

3.1 Network architecture

We propose a novel DL framework (see Figs 1 and 2) for signal and noise discriminating in the 3-C earthquake time-series. The network primarily consists of a feature extractor, a classifier within the feature space domain and two hybrid attention mechanism blocks. The feature extractor, composed of four multiscale feature fusion (MSFF) modules, enables automatic dimensionality reduction and feature extraction. Additionally, we explored the impact of the number of MSFF modules on classification accuracy in the “DISCUSSION” section. The classifier accurately identifies the noise and signal probabilities of 3-C waveforms within the feature space. The convolutional block attention mechanism (CBAM) block enhances the network’s focus on waveform characteristics, thereby further improving the feature mapping. The feature extractor contains four submodules and three Conv1-D layers with different kernel sizes that can simultaneously extract feature information at different scales. Next, the feature maps with different receptive field sizes are concatenated (Huang et al. 2017) to enhance the model’s ability to extract features at various scales. The different kernel sizes in Conv1-D layers can obtain the relevant sections of local (detailed information) and global (global information) earthquake signal and noise features. Note that the feature fusion modules shown in Figs 1 and 2 share a common structure except for the activation function. Given that the input time-series consists of 1-D earthquake data with both positive and negative values, employing nonlinear activation functions directly may lead to the suppression of neurons with negative values, potentially resulting in the loss of key waveform characteristics. Consequently, we employ a linear activation function within the initial feature fusion module to maintain robustness. Due to the temporal dependencies in time-series earthquake data, we introduced gated recurrent units (GRU) with different hidden layers in the classifier block to enhance the network’s temporal modelling capabilities. Finally, we employed a flattened layer and a fully connected layer to reshape the output dimensions of the network, and a softmax activation function was applied to ensure that the output conforms to a probability distribution.

The diagram of the proposed classification network. It includes a feature extractor, a classifier and a hybrid attention model. 3-C waveforms are the input data and the output data are the probability values corresponding to signal and noise waveforms.
Figure 1.

The diagram of the proposed classification network. It includes a feature extractor, a classifier and a hybrid attention model. 3-C waveforms are the input data and the output data are the probability values corresponding to signal and noise waveforms.

Details about the proposed DL architecture shown in Fig. 1. Each feature fusion module contains one fully connected layer with an ReLU activation function, three Conv1-D layers with different kernel sizes, one feature maps concatenation layer and one activation function. The classifier module has two GRU, and each GRU layer is followed with a tanh activation function. Moreover, we add a flattened and full connection layer to reshape the output into a probability distribution. CBAM consists of two main parts: spatial attention and channel attention mechanism. The Add layer fuses the feature maps through the above two blocks.
Figure 2.

Details about the proposed DL architecture shown in Fig. 1. Each feature fusion module contains one fully connected layer with an ReLU activation function, three Conv1-D layers with different kernel sizes, one feature maps concatenation layer and one activation function. The classifier module has two GRU, and each GRU layer is followed with a tanh activation function. Moreover, we add a flattened and full connection layer to reshape the output into a probability distribution. CBAM consists of two main parts: spatial attention and channel attention mechanism. The Add layer fuses the feature maps through the above two blocks.

Additionally, because of the interdependence between earthquake classification and arrival picking tasks, we developed an extended multitask network to address both tasks effectively. The multitask network is illustrated in Fig. 3. Based on the earthquake classification network shown in Fig. 1, we added four decoder blocks and one output block to the classification network. As shown in Fig. 3, each decoder block consists of a fully connected layer with an Rectified Linear Unit (ReLU) activation function, an MSSF module with a LeakyReLU activation function, a batch normalization (BN) layer, and an upsampling layer. Notably, we employ skip connections to combine the output of each hybrid attention module with the output of the BN layer in each decoder block. These skip connections allow the network to merge features from deeper layers with those from shallower layers, enhancing the network’s generalizability and mitigating overfitting during training. Moreover, we used a Conv1-D layer with a sigmoid activation function to reshape the feature maps in the output block.

An illustration of the proposed multitask deep neural network workflow. The network contains four encoders/decoders, two output blocks, and four hybrid attention blocks. 3-C time-series acts as the input of the multitask network, and the output is the probability of signal/noise waveforms and the arrival time of P and S wave, respectively.
Figure 3.

An illustration of the proposed multitask deep neural network workflow. The network contains four encoders/decoders, two output blocks, and four hybrid attention blocks. 3-C time-series acts as the input of the multitask network, and the output is the probability of signal/noise waveforms and the arrival time of P and S wave, respectively.

3.2 Training details

As we can see from Fig. 1, our proposed DL network has three main blocks. The feature extractor, which is followed by a classifier, uses four MSFF modules to capture nonlinear representations in the feature space. The classifier uses the feature representations to discriminate between earthquake signal and noise. During the training process, we use the gradient information of a hybrid loss function (L) in eq. (1) to determine the direction for updating model parameters, thereby gradually steering the model toward the optimal solution.

(1)

where |${L_{\mathrm{ bce}}}$| represents the binary cross-entropy (BCE) loss function. It is commonly used in binary classification problems to measure the discrepancy between the probability distribution predicted by the model and the true distribution. |${L_{\mathrm{ kld}}}$| stands for the Kullback–Leibler divergence (KLD) loss, which compares two different discrete probability distributions and can alleviate the issue of vanishing gradients that arises with the use of cross-entropy loss functions as the number of network layers increases. |$\lambda \in [0,1]$| is a hyperparameter that balances |${L_{\mathrm{ bce}}}$| and |${L_{\mathrm{ kld}}}$|⁠. In our work, the feature extraction and classification in the feature domain are jointly performed by setting the balanced factor equal to 0.1. We have gained experience from other researchers’ work (Xie et al. 2016) on the numerical setting of the balance factor. |${L_{\mathrm{ kld}}}$| stands for the KLD loss in eq. (1):

(2)

where P and Q represent two probability distributions, while i denotes the elements within these distributions. A smaller value of |$\mathrm{ KL}(P||Q)$| indicates a closer similarity between the two probability distributions.

(3)

here, y represents the true labels, |$\hat{y}$| stands for the predicted probability values by the model and N denotes the number of samples. |${L_{\mathrm{ bce}}}$| represents the BCE loss in eq. (1). BCE loss function is primarily used for binary classification problems, where the label y is either 0 or 1, and |$\hat{y}$| is the probability predicted by the model for the positive class.

3.3 Evaluation metrics

We objectively evaluate the classification performance of the proposed DL architecture with the accuracy (Acc), Precision, Recall, and F1-score, which are defined with the following formulas:

(4)

where TP, TN, FP and FN are true positive, true negative, false positive and false negative, respectively.

(5)

where Precision represents the proportion of correctly identified positive instances out of all predicted positive instances. Recall measures the proportion of samples that are correctly predicted to be positive among all samples that are actually positive. The F1-score is the harmonic mean of precision and recall, providing a single value to evaluate the model’s performance by balancing both metrics.

To ensure that we obtain a relatively stable and convincing classification result for the 3-C time-series, we trained the proposed network five times and calculated the average accuracy of the five predictions as the final output. As shown in eq. (6):

(6)

Furthermore, we introduce kurtosis, a statistical measure that describes the sharpness or peakedness of the input sample distribution. Typically, a high-kurtosis value suggests a sharp peak, while a low one indicates less dramatic fluctuations in the input time-series. Given a 3-C time-series to the kurtosis algorithm, we calculate the kurtosis value of each channel in the “Numerical experiment results” section.

4 DATA GENERATION

In this paper, we utilized two representative data sets to demonstrate the network’s performance across data from different regions. Among them, TXED was used to test the earthquake discrimination and arrival picking tasks, and STEAD was used in the transfer learning experiments. The TXED is a high-quality data set with 519 689 60-second 3-C seismograms, consisting of 312 231 earthquake 3-C waveforms and 207 458 noise 3-C waveforms. It is worth noting that the signal waveforms of TXED all go through a robust manual picking process and are of relatively high quality. We randomly select 10 000 signal waveforms and 10 000 noise waveforms from the TXED data set as the training data. The signal waveforms were pre-processed with several common stages, for example de-trending, de-instrument response, resampling to 100 Hz, band-pass filtering and blank data interpolation. We use bandpass filtering between 1 and 45 Hz to remove the high- and low-frequency noise, for example ambient noise, random noise, heavy machine noise and vehicle noise. Each 3-C waveform has a fixed window size of 6000 samples, and most of the earthquake signals were segmented according to the P-wave arrival time from 0–10 s. Fig. 4 shows several representative signal waveforms, whereas Figs 4(a) and (b) display microearthquake waveforms with different signal-to-noise ratios (SNRs; Chen & Fomel 2015), Fig. 4(c) shows a very minor earthquake with Ml = 2.5 and the exhibited waveform has a relatively low SNR, and Fig. 4(d) is a moderate earthquake with Ml = 5.4. In Fig. 5, several representative noise waveforms are shown: random, unknown and vehicle noise. The TXED includes various types of waveforms, which cover all types of possible solutions. STEAD is a global earthquake data set containing 1.2 million labelled earthquake records and 2613 stations worldwide. Both TXED and STEAD contain numerous labelled data to train a robust network model. However, transferring a trained model between the two datasets remains a potential challenge due to differences in regions and acquisition periods and variations in ambient noise, amplitude and frequency.

Example of signal waveform with different magnitudes and SNR in TXED. (a) The waveform of a microearthquake with high SNR. (b) A microearthquake waveform with relatively low SNR. (c) A moderate earthquake waveform with relatively high SNR. (d) The signal waveform of one of the largest earthquakes in TXED.
Figure 4.

Example of signal waveform with different magnitudes and SNR in TXED. (a) The waveform of a microearthquake with high SNR. (b) A microearthquake waveform with relatively low SNR. (c) A moderate earthquake waveform with relatively high SNR. (d) The signal waveform of one of the largest earthquakes in TXED.

Example of different types of noise waveform in TXED. (a) Regular noise with unknown cases. (b) and (c) Random noise with low and high amplitudes, different frequencies, respectively. (d) Ambient noise.
Figure 5.

Example of different types of noise waveform in TXED. (a) Regular noise with unknown cases. (b) and (c) Random noise with low and high amplitudes, different frequencies, respectively. (d) Ambient noise.

5 RESULTS

This section begins with a hyperparameter setup. Then, we utilize two data sets from TXED to demonstrate the effectiveness of the proposed classification network. Finally, we present the results of P- and S-wave arrival picking using the multitask deep neural network.

5.1 Hyperparameter setup

The goal of the proposed classification network is to classify the signal and noise waveforms using the 3-C time-series as the input. The network’s output consists of two columns: the first column stands for the probabilities of the signal waveform and the second one represents the probabilities of the noise waveform. We introduce a threshold strategy in eq. (7) to evaluate the performance. When the probability |$y_{\mathrm{ ph}}$| of the input signal or noise is greater than 0.5, the value is assigned to 1; otherwise, the value is assigned to 0.

(7)

To ensure the reproducibility of the experimental results, we set the same hyperparameters (shown in Table 2) for each numerical experiment. We use 50 epochs for each training process of classification tasks. The Adam optimizer (Kingma & Ba 2014) is applied for fine-tuning the network, employing a learning rate step-down strategy that enables the network to achieve a more accurate optimal solution. The linear activation function is introduced to the first MSFF block to avoid partial loss of neurons with negative values, and the LeakyReLU (Chen et al. 2019) is employed for nonlinear feature presentations. Additionally, to prevent network overfitting, we employed an early stopping strategy to monitor the loss on the validation set. Specifically, if the loss on the validation set does not effectively decrease (i.e. no significant drop) over a consecutive period of five epochs, we halt the network training.

Table 2.

Hyperparameters of the proposed DL network.

HyperparameterSpecifications
Input size(None, 6000, 3)
Output size(None, 2)
Parameters125 394
OptimizerAdam
LossKLD & Binary_crossentropy
Activation functionLinear & LeakyReLU
Batch size1024
Epoch50
Learning rate[2e − 04, 1e − 03]
HyperparameterSpecifications
Input size(None, 6000, 3)
Output size(None, 2)
Parameters125 394
OptimizerAdam
LossKLD & Binary_crossentropy
Activation functionLinear & LeakyReLU
Batch size1024
Epoch50
Learning rate[2e − 04, 1e − 03]
Table 2.

Hyperparameters of the proposed DL network.

HyperparameterSpecifications
Input size(None, 6000, 3)
Output size(None, 2)
Parameters125 394
OptimizerAdam
LossKLD & Binary_crossentropy
Activation functionLinear & LeakyReLU
Batch size1024
Epoch50
Learning rate[2e − 04, 1e − 03]
HyperparameterSpecifications
Input size(None, 6000, 3)
Output size(None, 2)
Parameters125 394
OptimizerAdam
LossKLD & Binary_crossentropy
Activation functionLinear & LeakyReLU
Batch size1024
Epoch50
Learning rate[2e − 04, 1e − 03]

5.2 Numerical experiment results

After obtaining training samples randomly selected from the TXED, we divided them into training, testing and validation sets using a proportion of 0.85:0.1:0.05, respectively. To ensure we select the same data set in each training session, we set a fixed random seed during the selection of the training data set. Also, we set a different random seed to choose a new data set for model inference. Similarly, we introduced an extensive testing data set extracted from a specific part of the TXED to evaluate the generalizability of the classification network further. We performed five training sessions for each testing data set to make the results reliable and calculated the corresponding confusion matrix using the average values. Additionally, we compared the evaluation metrics of our proposed method with a state-of-the-art approach (Jiang et al. 2023) by testing on the above data sets.

Fig. 6 shows the accuracy and loss curves in a specific training session of the proposed DL network. It is obvious that from epoch 0 to epoch 10, the loss curve decreases rapidly, and the Acc curve increases up to around 98 per cent. After epoch 10, the Acc and loss values converge step-by-step. We compare the classification outcomes of the proposed method with the benchmark method using the two data sets mentioned earlier, as shown in Table 3. The results of all evaluation indicators indicate that our DL framework can achieve more than 99 per cent classification accuracies in both randomly selected data and the data set extracted from a fixed part of TXED. The low standard deviations across all evaluation metrics confirm that the proposed method consistently achieves stable and generalized performance in discriminating signal and noise waveforms across multiple tests. These results show that the proposed method achieves higher accuracies in both experiments than the benchmark method. Notably, when comparing the standard deviations of the evaluation metrics, the proposed method exhibits lower standard deviations in Acc, Precision and F1-score. This indicates that our method provides more stable and generalized predictions when applying the trained model to unseen data sets.

Acc and loss curves of the training process. The network is trained with 10 000 signal waveforms and 10 000 noise waveforms randomly selected from the TXED. The ratio of training, testing, and validation samples is 0.8:0.15:0.05.
Figure 6.

Acc and loss curves of the training process. The network is trained with 10 000 signal waveforms and 10 000 noise waveforms randomly selected from the TXED. The ratio of training, testing, and validation samples is 0.8:0.15:0.05.

Table 3.

Classification result comparisons of different data sets using the proposed and benchmark method.

MethodDatasetAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
Proposed methodPart of TXED data99.78 per cent0.27 per cent99.93 per cent0.03 per cent99.90 per cent0.04 per cent99.91 per cent0.03 per cent
 Randomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Benchmark methodaPart of TXED data98.03 per cent1.21 per cent96.50 per cent4.25 per cent99.72 per cent0.02 per cent98.07 per cent1.10 per cent
 Randomly selected data98.42 per cent0.5 per cent97.00 per cent1.76 per cent99.94 per cent0.00 per cent98.44 per cent0.47 per cent
MethodDatasetAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
Proposed methodPart of TXED data99.78 per cent0.27 per cent99.93 per cent0.03 per cent99.90 per cent0.04 per cent99.91 per cent0.03 per cent
 Randomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Benchmark methodaPart of TXED data98.03 per cent1.21 per cent96.50 per cent4.25 per cent99.72 per cent0.02 per cent98.07 per cent1.10 per cent
 Randomly selected data98.42 per cent0.5 per cent97.00 per cent1.76 per cent99.94 per cent0.00 per cent98.44 per cent0.47 per cent
a

represents the citation (Jiang et al. 2023).

Table 3.

Classification result comparisons of different data sets using the proposed and benchmark method.

MethodDatasetAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
Proposed methodPart of TXED data99.78 per cent0.27 per cent99.93 per cent0.03 per cent99.90 per cent0.04 per cent99.91 per cent0.03 per cent
 Randomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Benchmark methodaPart of TXED data98.03 per cent1.21 per cent96.50 per cent4.25 per cent99.72 per cent0.02 per cent98.07 per cent1.10 per cent
 Randomly selected data98.42 per cent0.5 per cent97.00 per cent1.76 per cent99.94 per cent0.00 per cent98.44 per cent0.47 per cent
MethodDatasetAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
Proposed methodPart of TXED data99.78 per cent0.27 per cent99.93 per cent0.03 per cent99.90 per cent0.04 per cent99.91 per cent0.03 per cent
 Randomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Benchmark methodaPart of TXED data98.03 per cent1.21 per cent96.50 per cent4.25 per cent99.72 per cent0.02 per cent98.07 per cent1.10 per cent
 Randomly selected data98.42 per cent0.5 per cent97.00 per cent1.76 per cent99.94 per cent0.00 per cent98.44 per cent0.47 per cent
a

represents the citation (Jiang et al. 2023).

Fig. 7 shows the confusion matrices of five training sessions on the randomly selected data from TXED, which contains 10 000 signal and 10 000 noise waveforms. We can see that the number of wrong predictions on signal waveforms ranges from 3 to 14 (i.e. false negatives), and the wrong predictions on noise waveforms range from 0 to 7 (i.e. false positives), which indicates that the proposed DL framework achieves high precision on the TXED discrimination. We have calculated the distributions of these predictions (shown in Fig. 7). The numerical results indicate that the probabilities of true negatives (TN) and true positives (TP) exceed 0.98 in the network model’s prediction results, demonstrating that the proposed method yields promising outcomes for earthquake classification. In Figs 8 and  9, we have visualized eight incorrect predictions on signal and noise waveforms (visualizing from Fig. 7), respectively. To better compare, we have calculated the kurtosis coefficient of each component in a 3-C time-series (predicted incorrectly by the trained model), which characterizes the peakedness of a curve at its mean value. Typically, when the kurtosis value is greater than 3, it is more likely that the time-series of the component represents a signal. Conversely, if the kurtosis value is less than 3, there is a higher probability that the component is noise. By calculating the kurtosis value, we can assess whether the waveform predicted by the network model in case of prediction errors is trustworthy. Figs 8(a), (c), (g), and (h) show relatively low-kurtosis coefficients, which indicate that they suffer from noise interference. As a result, it is challenging to determine noise or signal with the trained model. It can be seen that Figs 8(c) and (h) are caused by label errors. However, the incorrect predictions shown in Figs 8(b), (d), (e) and (f) are attributed to the limitations of the network model. For the wrong predictions on the noise 3-C time-series shown in Fig. 9, all of their mean kurtoses are below 3, indicating that our proposed deep learning network structure fails to predict these waveforms correctly. In summary, the proposed method generally demonstrates encouraging performance on the TXED data.

Confusion matrix of the discrimination results of the TXED. (a)–(e) The confusion matrices corresponding to five training sessions.
Figure 7.

Confusion matrix of the discrimination results of the TXED. (a)–(e) The confusion matrices corresponding to five training sessions.

Visualization of incorrect signal predictions (signal predicted wrongly as noise). Note that we calculated each component’s kurtosis in the 3-C waveforms and displayed them in the subfigures. From (a)–(h), the mean kurtosis of each 3-C time-series are 3.359, 28.481, 0.086, 8.511, 14.485, 26.832, 4.922 and 0.412, respectively.
Figure 8.

Visualization of incorrect signal predictions (signal predicted wrongly as noise). Note that we calculated each component’s kurtosis in the 3-C waveforms and displayed them in the subfigures. From (a)–(h), the mean kurtosis of each 3-C time-series are 3.359, 28.481, 0.086, 8.511, 14.485, 26.832, 4.922 and 0.412, respectively.

Visualization of incorrect noise predictions (noise predicted wrongly as signal). Note that we calculated each component’s kurtosis in the 3-C waveforms and displayed them in the subfigures. From (a)–(h), the mean kurtosis of each 3-C time-series are 0.072, 0.131, 0.253, 0.186, 0.208, 0.107, 0.308 and 0.093, respectively.
Figure 9.

Visualization of incorrect noise predictions (noise predicted wrongly as signal). Note that we calculated each component’s kurtosis in the 3-C waveforms and displayed them in the subfigures. From (a)–(h), the mean kurtosis of each 3-C time-series are 0.072, 0.131, 0.253, 0.186, 0.208, 0.107, 0.308 and 0.093, respectively.

5.3 Multitask neural network for seismic arrival picking and classification

The tasks of event discrimination and arrival picking in earthquake data are closely interdependent. Typically, a more accurate classification model enhances the efficiency of P- and S-wave arrival picking. Since classification models are generally easier to develop than arrival-picking models, we designed a multitask neural network based on the classification network shown in Fig. 1 for event classification and first-arrival picking. In this case, we utilized the modified network shown in Fig. 3 to perform the P- and S-wave arrival picking. The proposed network is trained using multistation 3-C signal waveforms with a 60 s time window. All signal waveforms are filtered from 1 to 45 Hz using a bandpass filter, and standard normalization is applied to subtract the mean from each training sample. After obtaining the arrival times of the P and S waves, we use a Gaussian function to generate the labels for P- and S-wave arrivals. We then fed 100 000 labelled training samples into the proposed network, allocating 85 per cent for the training set, 15 per cent for the testing set, and 5 per cent for the validation set. The network is trained for 200 epochs using a BCE loss function. An early stopping strategy is implemented to halt the training process if the validation loss does not decrease for five consecutive epochs. We use the Adam optimizer with a learning rate of 0.001. Once trained, the model is applied to predict the arrival of P and S wave simultaneously using the testing set. As illustrated in Fig. 10, the mean absolute error (MAE) and standard deviation (⁠|$\sigma$|⁠) for P-wave picking are 0.19 and 0.53 s, respectively. The corresponding MAE and |$\sigma$| for S-wave picking are 0.23 and 0.50 s, respectively. We can see from Fig. 10 that the majority of prediction differences fall within ±1 s (i.e. 35 samples) for both P- and S-wave arrivals. Further evaluation demonstrates that the proposed method successfully predicts P and S arrivals with accuracies of 95.07 and 94.54 per cent, respectively. We further visualized the representative picking results for both the training and testing sets, as displayed in Figs 11 and 12. Note that we calculated the index of the maximum value of the P- and S-wave probabilities to determine the arrival times. The dashed lines with different colours in the 3-C waveform represent the predicted arrivals of P and S waves, while the solid lines indicate the actual arrivals. The visualization of each prediction shows that the trained model can predict the P and S waves with negligible error using the multistation waveforms.

Error distribution for the P- and S-wave arrival picking using the testing set. (a) The error distribution of P-wave arrival picking. (b) The error distribution of the S-wave arrival picking.
Figure 10.

Error distribution for the P- and S-wave arrival picking using the testing set. (a) The error distribution of P-wave arrival picking. (b) The error distribution of the S-wave arrival picking.

Visualization of the P- and S-wave arrival picking results on the training set. (a)–(d) Randomly selected samples from the predictions of the training set. Note that from top to bottom of each subfigure are the 3-C waveform, the predicted probabilities of P- and S wave, and the labelled arrivals of the P and S wave, respectively.
Figure 11.

Visualization of the P- and S-wave arrival picking results on the training set. (a)–(d) Randomly selected samples from the predictions of the training set. Note that from top to bottom of each subfigure are the 3-C waveform, the predicted probabilities of P- and S wave, and the labelled arrivals of the P and S wave, respectively.

Visualization of the P- and S-wave arrival picking results on the testing set. (a)–(d) Randomly selected samples from the predictions of the testing set. Note that from top to bottom of each subfigure are the 3-C waveform, the predicted probabilities of P and S wave, and the labelled arrivals of the P and S wave, respectively.
Figure 12.

Visualization of the P- and S-wave arrival picking results on the testing set. (a)–(d) Randomly selected samples from the predictions of the testing set. Note that from top to bottom of each subfigure are the 3-C waveform, the predicted probabilities of P and S wave, and the labelled arrivals of the P and S wave, respectively.

6 DISCUSSION

This section provides a deeper evaluation of the proposed DL model across various experiments. First, we investigate the impact of the number of MSFF modules and the composition of the training data through control experiments. Then, we explore the stability of the network by training with less data and unbalanced signal and noise waveforms. Finally, the STEAD is introduced to gain further insight into the transferability of the classification network.

6.1 The influence of the number of feature fusion modules on network performance.

To better select the number of MSFF modules in the network, we compared the loss and accuracy curves of the training process with different numbers of MSFF modules in Fig. 13. Each model was trained for 50 epochs. As the number of epochs increases, the loss curves of all training sessions quickly converge, and the corresponding training Acc curve continues to increase. Acc and loss curves of all training sessions change rapidly during the epoch from 0 to 8 and then tend to be smooth. We selected the curves from epochs 5 to 20 to zoom in for a clearer comparison. When comparing the zoomed sections of each subfigure, it is evident that there is no significant difference in the convergence speed of the loss curves when using different numbers of MSFF modules. However, we finally chose four MSFF modules for a good balance between convergence speed and accuracy.

A comparison of classification performance using different numbers of MSFF modules. A zoomed-in section is provided in each subfigure for better comparison.
Figure 13.

A comparison of classification performance using different numbers of MSFF modules. A zoomed-in section is provided in each subfigure for better comparison.

6.2 Training with unbalanced data

We have demonstrated the effectiveness of the proposed method by training with balanced data sets. Here, we perform several experiments on unbalanced data sets to explore the robustness of the network. Additionally, all hyperparameters are consistent with those used for the aforementioned classification experiments shown in Table 2.

The first experiment was conducted with 10 000 signal waveforms and fewer noise waveforms (see Fig. 14a). The signal and noise waveform proportions are set as 1:1, 2:1, 3:1, 4:1 and 5:1, respectively. We can see that the Acc and loss curves of the five training processes quickly converge when the number of iterations is between 0 and 20. However, when the signal and noise waveforms ratio is 5:1, the Acc curve rises from approximately 0.6 to 0.9, and the loss curve drops from 0.07 to 0.005, consuming approximately 15 epochs. For other ratios, they converge more quickly. The descent rate of the loss curve matches the ascent rate of the accuracy curve, indicating the model’s effectiveness during training. By comparing the converging time, we found that when the ratio of signal and noise waveforms is 1:1, it consumes the least converging time. Figs 15 (a)–(d) show the confusion matrices of different ratios of signal and noise waveforms in validation sets. It is obvious that the proposed network can achieve an accuracy above 97 per cent even when the data set is very unbalanced.

Classification performance comparison with unbalanced training samples. (a) The impact on Acc and loss curves using different proportions of signal and noise waveforms (num. of signal waveforms $> $ num. of noise waveforms). (b) The impact on Acc and loss curves using different proportions of signal and noise waveforms (num. of signal waveforms $< $ num. of noise waveforms).
Figure 14.

Classification performance comparison with unbalanced training samples. (a) The impact on Acc and loss curves using different proportions of signal and noise waveforms (num. of signal waveforms |$> $| num. of noise waveforms). (b) The impact on Acc and loss curves using different proportions of signal and noise waveforms (num. of signal waveforms |$< $| num. of noise waveforms).

Confusion matrix of the discrimination results (tested in Fig. 14) with unbalanced training samples. (a)–(d) The confusion matrix corresponding to the signal waveform and noise waveform ratio in the training sample is 1:5, 2:5, 3:5 and 4:5, respectively. (e)–(h) The confusion matrix corresponding to the signal and noise ratio in the training sample is 5:1, 5:2, 5:3 and 5:4, respectively.
Figure 15.

Confusion matrix of the discrimination results (tested in Fig. 14) with unbalanced training samples. (a)–(d) The confusion matrix corresponding to the signal waveform and noise waveform ratio in the training sample is 1:5, 2:5, 3:5 and 4:5, respectively. (e)–(h) The confusion matrix corresponding to the signal and noise ratio in the training sample is 5:1, 5:2, 5:3 and 5:4, respectively.

Fig. 14(b) shows the Acc and loss curves using 10 000 noise waveforms and different numbers of signal waveforms as training samples. We also choose the same ratios as the above experiments. When the ratio is 1:1, the Acc and loss curves cost the least converging time. However, the network obtained similar evaluation curves with a proportion of 1:2, which is different from the previous experiment. When the ratio is 1:5, it shows relatively worse results on the Acc and loss curve than other experiments, consuming the most time to converge. During these experiments, only a few wrong predictions appear (see Figs 15e–h), which demonstrates the proposed network acquires excellent performance in handling challenging tasks.

We further explore the network’s robustness by introducing it to fewer training samples and unbalanced training data. The results are exhibited in Figs 16(a) and (b). We can see that when the numbers of signal and noise waveforms are 4000 and 5000, respectively, the Acc and loss curves are almost identical, indicating that the proposed method can achieve great performance in balanced data with 8000 training samples. However, the network’s performance will worsen when the training samples drop from 6000 to 2000 for balanced training. Fig. 16(b) demonstrates that the trained model still achieves an accuracy of 95.43 per cent when the signal and noise waveform ratio is 5:1, despite the lack of training samples and extremely unbalanced training data.

Comparisons of Acc and loss curves using different training samples. (a) The impact on Acc and loss curves using different numbers of training data, where ts stands for training samples. (b) With fewer waveforms in the training set. The influence on Acc and loss curves using different proportions of signal and noise waveforms (num. of signal waveforms $> $ num. of noise waveforms).
Figure 16.

Comparisons of Acc and loss curves using different training samples. (a) The impact on Acc and loss curves using different numbers of training data, where ts stands for training samples. (b) With fewer waveforms in the training set. The influence on Acc and loss curves using different proportions of signal and noise waveforms (num. of signal waveforms |$> $| num. of noise waveforms).

6.3 Transfer learning to the STEAD

We have trained a novel DL model on the TXED data set, which achieves high-precision classification. For the transfer learning step, we use the pre-trained model by the proposed network for transfer learning. We applied the transfer learning strategy that froze all layers except the last two in the pre-trained model and then trained a fully connected network with the STEAD. The hyperparameters in the training process are set according to Table 2. We first utilize the pre-trained model from the TXED data set for the STEAD signal detection tasks to compare the waveform classification results before and after transfer learning. The result shows that the accuracy, precision, recall and F1-score are 90.63 per cent, 98.14 per cent, 90.61 per cent and 94.20 per cent, respectively. In contrast, after training, we calculated the average classification performance metrics for STEAD with transfer learning, which are 95.03 per cent, 96.53 per cent, 93.70 per cent and 95.14 per cent, respectively. The results indicate that with transfer learning, our proposed network achieves a higher signal and noise classification accuracy, demonstrating that the proposed DL architecture is more generalized after introducing the transfer learning strategy.

6.4 Weight matrix visualization

Here, we present visualizations of feature maps from different Conv1-D layers and Add layers. Through visualizing these feature maps, we can gain a deeper understanding of the feature information that a specific filter extracts from the input 3-C waveform data. In Fig. 17, each subfigure shows 14 feature maps corresponding to the Conv1D layer in the MSFF block and the output of the hybrid attention model of the proposed DL architecture. In the first MSFF block, the input 3-C signal time-series is transformed into earthquake-like waveforms by the three Conv1-D layers with different kernel sizes. The feature maps in the second MSFF module show the waveform features in a more detailed way. We can see that the feature maps in the third MSFF block and the second hybrid attention module contain more significant first arrival waveform information, which helps the classifier achieve better waveform detection performance. From Figs 17(a) and (b), we can see that the proposed DL architecture can capture decisive waveform characteristics even in a noisy input 3-C signal. Figs 17(c) and (d) demonstrate that our network can easily obtain the waveform features from a signal with a relatively high SNR.

Feature maps of different Conv1-D and Add layers using the randomly selected signal from the training samples. (a) and (b) show the feature maps of the 574th and 5336th noise waveforms in the testing set. (c) and (d) display the feature maps of the 742nd and 2867th signal waveforms in the testing set. Note that we plot the length of the feature maps of each layer to the left of each subfigure.
Figure 17.

Feature maps of different Conv1-D and Add layers using the randomly selected signal from the training samples. (a) and (b) show the feature maps of the 574th and 5336th noise waveforms in the testing set. (c) and (d) display the feature maps of the 742nd and 2867th signal waveforms in the testing set. Note that we plot the length of the feature maps of each layer to the left of each subfigure.

6.5 The role of hybrid attention module

The hybrid attention module, which utilizes spatial and channel attention to enhance the feature extraction capabilities of the convolutional neural network, is employed by the proposed method to improve the classification and arrival picking accuracy of 3-C earthquake data. To explore the role of the hybrid attention module in earthquake classification tasks, we conducted an ablation experiment. Table 4 presents the classification results with and without the attention block. The results indicate that incorporating the hybrid attention modules relatively enhances the performance of the proposed method in Acc, Precision, and F1-score, owing to the improved feature extraction capabilities of the attention mechanism.

Table 4.

Ablation experiment on the role of hybrid attention module.

 Data setAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
With attentionRandomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Without attentionRandomly selected data99.01 per cent0.01 per cent98.17 per cent0.03 per cent99.89 per cent0.00 per cent99.23 per cent0.1 per cent
 Data setAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
With attentionRandomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Without attentionRandomly selected data99.01 per cent0.01 per cent98.17 per cent0.03 per cent99.89 per cent0.00 per cent99.23 per cent0.1 per cent
Table 4.

Ablation experiment on the role of hybrid attention module.

 Data setAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
With attentionRandomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Without attentionRandomly selected data99.01 per cent0.01 per cent98.17 per cent0.03 per cent99.89 per cent0.00 per cent99.23 per cent0.1 per cent
 Data setAccuracy (mean, |$\sigma$|⁠)Precision (mean, |$\sigma$|⁠)Recall (mean, |$\sigma$|⁠)F1-score (mean, |$\sigma$|⁠)
With attentionRandomly selected data99.83 per cent0.05 per cent99.81 per cent0.03 per cent99.85 per cent0.03 per cent99.83 per cent0.05 per cent
Without attentionRandomly selected data99.01 per cent0.01 per cent98.17 per cent0.03 per cent99.89 per cent0.00 per cent99.23 per cent0.1 per cent

7 CONCLUSION

In this paper, we proposed a generalized neural network for automatic earthquake detection. The proposed method contains a feature extractor, a classifier and two hybrid attention modules. The input of the network model is a 3-C time-series, and the output consists of probabilities corresponding to the signal and noise. Additionally, by incorporating decoder layers into the classification network, we build a multitask deep learning framework for P- and S-wave arrival picking. We have explored how the number of MSFF modules affects the network’s training performance and finally settled on using four MSFF modules as the feature extractor. The classifier can accurately classify nonlinear features extracted by the feature extractor in the feature space domain. The hybrid attention mechanism effectively guides the network to focus on waveform features of the time-series, with only a slight increase in parameters. Extensive numerical experiments in TXED demonstrate that our proposed network achieves classification accuracy exceeding 99 per cent for both signal and noise in balanced training samples. Furthermore, even with a limited number of highly unbalanced training samples, it still performs well in earthquake signal and noise waveform discrimination. We successfully transferred the pre-trained model to the STEAD by employing a transfer learning strategy, achieving commendable results. In summary, our proposed DL framework leverages multiscale feature extraction and hybrid attention mechanisms to achieve high-precision classification between earthquake signals and noise, exhibiting impressive generalization capabilities when dealing with diverse data sets.

ACKNOWLEDGMENTS

We would like to express our gratitude to four reviewers and the editor for their constructive suggestions, which have greatly helped to improve the quality of this paper. We also extend our thanks to the providers of the open-source data. This research was supported by the National Natural Science Foundation of China (grant nos. 42174159 and 41904110), the Natural Science Foundation of Hubei Province (grant no. 2021CFB498), and the Open Fund of Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze University), Ministry of Education (grant no. KPI2021-01).

DATA AVAILABILITY

The codes related to this paper are available via https://github.com/cuiyang512/Multi-task-EQDetection. The TXED data set can be downloaded from https://github.com/chenyk1990/txed. The DEMO scripts to extract signal and noise waveforms can be found at https://github.com/chenyk1990/txed/tree/main/demos. The STEAD data set is available from https://github.com/smousavi05/STEAD.

REFERENCES

Beyreuther
 
M.
,
Wassermann
 
J.
,
2008
.
Continuous earthquake detection and classification using discrete hidden markov models
,
Geophys. J. Int.
,
175
(
3
),
1055
1066
.

Chen
 
Y.
,
Fomel
 
S.
,
2015
.
Random noise attenuation using local signal-and-noise orthogonalization
,
Geophysics
,
80
(
6
),
WD1
WD9
.

Chen
 
Y.
,
Mai
 
Y.
,
Xiao
 
J.
,
Zhang
 
L.
,
2019
.
Improving the antinoise ability of DNNs via a bio-inspired noise adaptive activation function rand softplus
,
Neural Comput.
,
31
(
6
),
1215
1233
.

Chen
 
Y.
 et al.  
2024
.
TXED: the texas earthquake dataset for AI
,
Seismol. Res. Lett.
,
95
(
3
),
2013
2022
.

Cheng
 
Y.
,
Li
 
Y.
,
Zhang
 
C.
,
2019
.
First-break picking for microseismic data based on cascading use of shearlet and stockwell transforms
,
Geophys. Prospect.
,
67
(
1
),
85
96
.

Chin
 
T.-L.
,
Huang
 
C.-Y.
,
Shen
 
S.-H.
,
Tsai
 
Y.-C.
,
Hu
 
Y. H.
,
Wu
 
Y.-M.
,
2019
.
Learn to detect: Improving the accuracy of earthquake detection
,
IEEE Trans. Geosci. Remote Sens.
,
57
(
11
),
8867
8878
.

Chung
 
J.
,
Gulcehre
 
C.
,
Cho
 
K.
,
Bengio
 
Y.
,
2014
.
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
,
preprint
()

Duan
 
Y.
,
Shen
 
Y.
,
Canbulat
 
I.
,
Luo
 
X.
,
Si
 
G.
,
2021
.
Classification of clustered microseismic events in a coal mine using machine learning
,
J. Rock Mech. Geotech. Eng.
,
13
(
6
),
1256
1273
.

Giacco
 
F.
,
Esposito
 
A.
,
Scarpetta
 
S.
,
Giudicepietro
 
F.
,
Marinaro
 
M.
,
2009
.
Support vector machines and mlp for automatic classification of seismic signals at stromboli volcano
, in
Proc. 19th Italian Workshop Neural Nets
, pp.
116
123
., eds,
Apolloni
 
B.
,
Bassis
 
S.
,
Morabito
 
C.F.
,
IOS Press
.

Griffin
 
D.
,
Lim
 
J.
,
1984
.
Signal estimation from modified short-time fourier transform
,
IEEE Trans. Acoust. Speech Signal Process.
,
32
(
2
),
236
243
.

Gulia
 
L.
,
Wiemer
 
S.
,
2019
.
Real-time discrimination of earthquake foreshocks and aftershocks
,
Nature
,
574
(
7777
),
193
199
.

Huang
 
G.
,
Liu
 
Z.
,
Van Der Maaten
 
L.
,
Weinberger
 
K. Q.
,
2017
.
Densely connected convolutional networks
, in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
,
IEEE
,
Honolulu, HI, USA
, pp.
4700
4708
.

Jiang
 
J.
,
Stankovic
 
V.
,
Stankovic
 
L.
,
Parastatidis
 
E.
,
Pytharouli
 
S.
,
2023
.
Microseismic event classification with time-, frequency-, and wavelet-domain convolutional neural networks
,
IEEE Trans. Geosci. Remote Sens.
,
61
,
1
14
.

Kim
 
S.
,
Lee
 
K.
,
You
 
K.
,
2020
.
Seismic discrimination between earthquakes and explosions using support vector machine
,
Sensors
,
20
(
7
),
1879
,
doi:10.3390/s20071879
 

Kingma
 
D. P.
,
Ba
 
J.
,
2014
.
Adam: A method for stochastic optimization
,
preprint
()

Kong
 
Q.
,
Allen
 
R. M.
,
Schreier
 
L.
,
Kwon
 
Y.-W.
,
2016
.
Myshake: A smartphone seismic network for earthquake early warning and beyond
,
Sci. Adv.
,
2
(
2
),
e1501055
,
doi:10.1126/sciadv.1501055
 

Ku
 
B.
,
Min
 
J.
,
Ahn
 
J.-K.
,
Lee
 
J.
,
Ko
 
H.
,
2020
.
Earthquake event classification using multitasking deep learning
,
IEEE Geosci. Remote Sens. Lett.
,
18
(
7
),
1149
1153
.

Linville
 
L.
,
Pankow
 
K.
,
Draelos
 
T.
,
2019
.
Deep learning models augment analyst decisions for event discrimination
,
Geophys. Res. Lett.
,
46
(
7
),
3643
3651
.

Liu
 
G.
,
Li
 
C.
,
Rao
 
Y.
,
Chen
 
X.
,
2020
.
Oriented pre-stack inverse Q filtering for resolution enhancements of seismic data
,
Geophys. J. Int.
,
223
(
1
),
488
501
.

Liu
 
L.
,
Song
 
W.
,
Zeng
 
C.
,
Yang
 
X.
,
2021
.
Microseismic event detection and classification based on convolutional neural network
,
J. Appl. Geophys.
,
192
,
104 380
,
doi:10.1016/j.jappgeo.2021.104380
 

Ma
 
C.
 et al.  
2023
.
Fine classification method for massive microseismic signals based on short-time fourier transform and deep learning
,
Remote Sens.
,
15
(
2
),
502
,
doi:10.3390/rs15020502
 

McBrearty
 
I. W.
,
Beroza
 
G. C.
,
2022
.
Earthquake location and magnitude estimation with graph neural networks
, in
2022 IEEE International Conference on Image Processing (ICIP)
,
IEEE
, pp.
3858
3862
.

Mousavi
 
S. M.
,
Sheng
 
Y.
,
Zhu
 
W.
,
Beroza
 
G. C.
,
2019a
.
STanford EArthquake Dataset (STEAD): A global data set of seismic signals for AI
,
IEEE Access
,
7
,
179 464
179 476
.

Mousavi
 
S. M.
,
Zhu
 
W.
,
Ellsworth
 
W.
,
Beroza
 
G.
,
2019b
.
Unsupervised clustering of seismic signals using deep convolutional autoencoders
,
IEEE Geosci. Remote Sens. Lett.
,
16
(
11
),
1693
1697
.

Mousavi
 
S. M.
,
Zhu
 
W.
,
Sheng
 
Y.
,
Beroza
 
G. C.
,
2019c
.
CRED: A deep residual network of convolutional and recurrent units for earthquake signal detection
,
Sci. Rep.
,
9
(
1
),
10267
,
doi:10.1038/s41598-019-45748-1
 

Mousavi
 
S. M.
,
Ellsworth
 
W. L.
,
Zhu
 
W.
,
Chuang
 
L. Y.
,
Beroza
 
G. C.
,
2020
.
Earthquake transformer—an attentive deep-learning model for simultaneous earthquake detection and phase picking
,
Nat. Commun.
,
11
(
1
),
3952
,
doi:10.1038/s41467-020-17591-w
 

Provost
 
F.
,
Hibert
 
C.
,
Malet
 
J.-P.
,
2017
.
Automatic classification of endogenous landslide seismicity using the random forest supervised classifier
,
Geophys. Res. Lett.
,
44
(
1
),
113
120
.

Rioul
 
O.
,
Flandrin
 
P.
,
1992
.
Time-scale energy distributions: A general class extending wavelet transforms
,
IEEE Trans. Signal Process.
,
40
(
7
),
1746
1757
.

Saad
 
O. M.
,
Hafez
 
A. G.
,
Soliman
 
M. S.
,
2020
.
Deep learning approach for earthquake parameters classification in earthquake early warning system
,
IEEE Geosci. Remote Sens. Lett.
,
18
(
7
),
1293
1297
.

Saad
 
O. M.
,
Huang
 
G.
,
Chen
 
Y.
,
Savvaidis
 
A.
,
Fomel
 
S.
,
Pham
 
N.
,
Chen
 
Y.
,
2021
.
Scalodeep: A highly generalized deep learning framework for real-time earthquake detection
,
J. Geophys. Res.: Solid Earth
,
126
(
4
),
e2020JB021473
,
doi:10.1029/2020JB021473
 

Saad
 
O. M.
,
Chen
 
Y.
,
Savvaidis
 
A.
,
Fomel
 
S.
,
Chen
 
Y.
,
2022a
.
Real-time earthquake detection and magnitude estimation using vision transformer
,
J. Geophys. Res.: Solid Earth
,
127
(
5
),
e2021JB023657
,
doi:10.1029/2021JB023657
 

Saad
 
O. M.
,
Soliman
 
M. S.
,
Chen
 
Y.
,
Amin
 
A. A.
,
Abdelhafiez
 
H.
,
2022b
.
Discriminating earthquakes from quarry blasts using capsule neural network
,
IEEE Geosci. Remote Sens. Lett.
,
19
,
1
5
.

Saad
 
O. M.
 et al.  
2023a
.
Earthquake forecasting using big data and artificial intelligence: A 30-week real-time case study in china
,
Bull. seism. Soc. Am.
,
113
(
6
),
2461
2478
.

Saad
 
O. M.
 et al.  
2023b
.
EQCCT: A production-ready earthquake detection and phase-picking method using the compact convolutional transformer
,
IEEE Trans. Geosci. Remote Sens.
,
61
,
1
15
.

Seydoux
 
L.
,
Balestriero
 
R.
,
Poli
 
P.
,
Hoop
 
M. D.
,
Campillo
 
M.
,
Baraniuk
 
R.
,
2020
.
Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning
,
Nat. Commun.
,
11
(
1
),
3972
,
doi:10.1038/s41467-020-17841-x
 

Shakeel
 
M.
,
Itoyama
 
K.
,
Nishida
 
K.
,
Nakadai
 
K.
,
2021
.
EMC: Earthquake magnitudes classification on seismic signals via convolutional recurrent networks
, in
2021 IEEE/SICE International Symposium on System Integration (SII)
,
IEEE
, pp.
388
393
.

Sharma
 
A.
,
Nanda
 
S. J.
,
2020
.
Timely detection of seismic waves in ground motion data using improved s-transform
, in
2020 5th IEEE International Conference on Recent Advances and Innovations in Engineering (ICRAIE)
,
IEEE
, pp.
1
9
.

Simonyan
 
K.
,
Zisserman
 
A.
,
2014
.
Very deep convolutional networks for large-scale image recognition
,
preprint
()

Tibi
 
R.
,
Linville
 
L.
,
Young
 
C.
,
Brogan
 
R.
,
2019
.
Classification of local seismic events in the Utah region: A comparison of amplitude ratio methods with a spectrogram-based machine learning approach
,
Bull. seism. Soc. Am.
,
109
(
6
),
2532
2544
.

Torrey
 
L.
,
Shavlik
 
J.
,
2010
.
Transfer learning
, in
Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques
,
IGI global
, pp.
242
264
.

Trani
 
L.
,
Pagani
 
G. A.
,
Zanetti
 
J. P. P.
,
Chapeland
 
C.
,
Evers
 
L.
,
2022
.
DeepQuake—An application of CNN for seismo-acoustic event classification in The Netherlands
,
Comput. Geosci.
,
159
,
104 980
.

Vaezi
 
Y.
,
Van der Baan
 
M.
,
2015
.
Comparison of the STA/LTA and power spectral density methods for microseismic event detection
,
Geophys. Suppl. MNRAS
,
203
(
3
),
1896
1908
.

Wang
 
T.
,
Bian
 
Y.
,
Zhang
 
Y.
,
Hou
 
X.
,
2023
.
Classification of earthquakes, explosions and mining-induced earthquakes based on xgboost algorithm
,
Comput. Geosci.
,
170
,
105 242
,
doi:10.1016/j.cageo.2022.105242
 

Wang
 
Z.
,
Liu
 
G.
,
Du
 
J.
,
Li
 
C.
,
Qi
 
J.
,
2022
.
Low-frequency extrapolation of prestack viscoacoustic seismic data based on dense convolutional network
,
IEEE Trans. Geosci. Remote Sens.
,
60
,
1
13
.

Wenner
 
M.
,
Hibert
 
C.
,
van Herwijnen
 
A.
,
Meier
 
L.
,
Walter
 
F.
,
2021
.
Near-real-time automated classification of seismic signals of slope failures with continuous random forests
,
Nat. Hazards Earth Syst. Sci.
,
21
(
1
),
339
361
.

Woo
 
S.
,
Park
 
J.
,
Lee
 
J.-Y.
,
Kweon
 
I. S.
,
2018
.
Cbam: Convolutional block attention module
, in
Proceedings of the European Conference on Computer Vision (ECCV)
, pp.
3
19
.

Xie
 
J.
,
Girshick
 
R.
,
Farhadi
 
A.
,
2016
.
Unsupervised deep embedding for clustering analysis
, in
International Conference on Machine Learning
,
PMLR
, pp.
478
487
.

Zhang
 
M.
,
Ellsworth
 
W. L.
,
Beroza
 
G. C.
,
2019
.
Rapid earthquake association and location
,
Seismol. Res. Lett.
,
90
(
6
),
2276
2284
.

Zhang
 
X.
,
Reichard-Flynn
 
W.
,
Zhang
 
M.
,
Hirn
 
M.
,
Lin
 
Y.
,
2022
.
Spatiotemporal graph convolutional networks for earthquake source characterization
,
J. Geophys. Res.: Solid Earth
,
127
(
11
),
e2022JB024401
,
doi:10.1029/2022JB024401
 

Zhu
 
W.
,
Beroza
 
G. C.
,
2019
.
PhaseNet: A deep-neural-network-based seismic arrival-time picking method
,
Geophys. J. Int.
,
216
(
1
),
261
273
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.