Abstract

The advanced diagnosis of faults in railway point machines is crucial for ensuring the smooth operation of the turnout conversion system and the safe functioning of trains. Signal processing and deep learning-based methods have been extensively explored in the realm of fault diagnosis. While these approaches effectively extract fault features and facilitate the creation of end-to-end diagnostic models, they often demand considerable expert experience and manual intervention in feature selection, structural construction and parameter optimization of neural networks. This reliance on manual efforts can result in weak generalization performance and a lack of intelligence in the model. To address these challenges, this study introduces an intelligent fault diagnosis method based on deep reinforcement learning (DRL). Initially, a one-dimensional convolutional neural network agent is established, leveraging the specific characteristics of point machine fault data to automatically extract diverse features across multiple scales. Subsequently, deep Q network is incorporated as the central component of the diagnostic framework. The fault classification interactive environment is meticulously designed, and the agent training network is optimized. Through extensive interaction between the agent and the environment using fault data, satisfactory cumulative rewards and effective fault classification strategies are achieved. Experimental results demonstrate the proposed method's high efficacy, with a training accuracy of 98.9% and a commendable test accuracy of 98.41%. Notably, the utilization of DRL in addressing the fault diagnosis challenge for railway point machines enhances the intelligence of diagnostic process, particularly through its excellent independent exploration capability.

1. Introduction

China's railway is currently experiencing a phase of rapid development. The collaborative synergy between various pieces of key equipment significantly contributes to the secure and steady operation of trains [1–3]. The turnout conversion system holds a pivotal role in high-speed railway and heavy-haul railway transportation. Serving as a crucial component within the system, railway point machines precisely manage the train's running direction by adjusting the position of the turnout [4, 5]. In the event of a malfunction, the normal operation of point machines is imperative. Any failure in their operation can lead to a complete halt in the turnout conversion system, preventing high-speed trains from changing their operating lines as required. In more severe scenarios, such malfunctions can result in major accidents like train derailments, posing incalculable risks to personnel safety and economic well-being [6–9].

Research into methods for diagnosing faults in railway point machines has stimulated in-depth discussions among scholars. These studies primarily utilize data-driven approaches combined with signal-processing and supervised-learning methods. Huang et al. [10] employed the Fréchet distance and similarity function to measure the similarity between the standard current curve and the curve under examination, providing an initial assessment of the operational status of point machines. Wei et al. [11] introduced the ensemble empirical mode decomposition (EEMD) and fuzzy entropy theory to process the power signal of point machines, combining the grey correlation algorithm for swift diagnosis. However, the former can only determine the occurrence of a fault and cannot provide information regarding the specific fault type. The latter is susceptible to limitations, particularly when faced with a large dataset, as fault curves with high similarity may result in poor discrimination performance. The rapid advancement of machine learning (ML) and deep learning (DL) has propelled intelligent fault diagnosis to new heights [12]. Supported by substantial data, the algorithm is anticipated to overcome the previously mentioned issues. Chen et al. [13] proposed a deep residual convolutional neural network to extract local features from power curves, integrating a multi-head self-attention mechanism to focus on key features. The combination of these approaches facilitates efficient diagnosis of faults. Wang and Li [14] devised a composite model of neural network; by shallowly extracting and deeply mining the power signal characteristics, the classifier produces ideal recognition results. In contrast to the above diagnosis methods that rely on electrical signals, Cao et al. [15] focused on extracting sound signals from point machines, employing two-stage feature selection and ensemble learning to effectively classify faults. In a similar vein, Sun et al. [16] applied variational mode decomposition (VMD) to process the vibration signals associated with the fault states, synthesized multiscale features to construct a feature set and utilized support vector machine (SVM) for fault identification. The exploration of both sound and vibration signals introduces a novel perspective to the fault diagnosis of railway point machines. Particularly, the vibration signal, with its inherent stability and rich feature information, holds representative value in fault identification research.

A review of the literature reveals that the fusion of signal processing and ML can serve as an effective method for fault identification. However, when dealing with high-dimensional fault data, signal decomposition is not only time-consuming but also heavily reliant on manual expertise for feature selection and classifier parameter tuning. DL-based diagnostic methods can adaptively extract fault features and comprehensively analyse the intrinsic relationships between these features, leading to satisfactory outcomes [17, 18]. Yet, these methods have their limitations: (1) When addressing specific tasks, optimizing network structures and parameters often depends on substantial human intervention and effort, which can result in a network's poor generalization capability. (2) The essence of supervised learning mechanism lies in conducting multiple trainings within a given data distribution to identify an excellent classification model. The absence of additional incentives influencing the entire learning process also underscores the need for improvement in the independent exploration ability of DL methods in fault diagnosis. Fortunately, another product of artificial intelligence (AI), reinforcement learning (RL), has been effectively combined with DL, harnessing its remarkable self-exploration ability alongside the strong perceptual capabilities of DL, forming what is known as deep reinforcement learning (DRL) and further propelling the evolution of intelligent fault diagnosis [19, 20].

To date, DRL has found extensive application in areas such as automatic control, intelligent scheduling and unmanned driving [21–23]. However, a group of researchers has embarked on employing it to tackle the issue of mechanical fault identification, marking a novel approach in this domain. Ding et al. [24] successfully achieved fault diagnosis in rotating machinery by integrating a stacked automatic encoder (SAE) network with DQN (Deep Q Network), thereby providing substantial evidence for the effectiveness of the DRL model in this domain. Wang and Xuan [25] innovatively proposed a diagnosis method for bearings and tools, leveraging one-dimensional convolution and enhanced Actor-Critic algorithm to achieve superior recognition outcomes. Wang et al. [26] integrated DQN and transfer learning to achieve fault diagnosis of planetary gearboxes under varying operating conditions. Their findings validate that DRL possesses distinct advantages over some DL methods. However, when dealing with various types of fault data, delving deeper into finding a harmonious blend between DL's network structure and RL's reward mechanisms and strategies remains an area that requires further exploration to yield superior outcomes. To this end, this research introduces an intelligent fault diagnosis framework for railway point machines, utilizing a one-dimensional convolutional neural network (1DCNN) and an optimized DQN algorithm. The primary contributions in this work are as follows:

  1. Combining the characteristics of the current data from point machines, a fault diagnosis agent based on the 1DCNN model is thoughtfully crafted to automatically extract multiscale features, mitigating the need for excessive dependence on manual design.

  2. The DQN algorithm serves as the core element in the proposed diagnosis framework and is further optimized. This optimization encompasses the establishment of a classification interactive environment and the exploration of an agent training network to obtain more gratifying cumulative rewards.

  3. The meticulously designed reward mechanism is closely tailored to the fault diagnosis scenario, which promotes the agent to generate excellent classification policy after extensive interaction with the data environment, ultimately leading to more favourable diagnostic outcomes.

The remaining sections of this study are outlined as follows. Section 2 introduces the relevant foundational theory of RL. Section 3 proposes the data processing and the construction of DRL fault diagnosis framework. A detailed analysis of the experimental configuration and diagnostic results is provided in section 4. Section 5 offers a comprehensive summary of the research.

2. Foundational theory of RL

2.1. Principle of RL

RL is a computational approach wherein an agent engages with the environment iteratively to achieve a desired objective [27]. Fig. 1 illustrates the detailed interaction process. Guided by a policy |$\pi $|⁠, the agent takes action |${{a}_t}$| to alter the present environmental state |${{s}_t}$|⁠. Simultaneously, the environment provides corresponding reward |${{r}_t}$| based on the action execution, and the feedback on the environmental state |${{s}_{t + 1}}$| is conveyed to the agent in the subsequent moment, creating a closed-loop process. In the field of RL, the environment is essentially conceptualized as a Markov decision process (MDP), detailing the evolution of state information and the mechanism of reciprocal transitions between states [28]. The interplay between policy |$\pi $|⁠, discount factor |$\gamma $| and reward function is considered to optimize the expectation of cumulative reward |${{R}_t}$|⁠, as in Eq. (1).

(1)

where |${{r}_{t + k}}$| is the reward values at different moments.

The dynamic interaction between the agent and the environment.
Fig. 1.

The dynamic interaction between the agent and the environment.

2.2. DQN algorithm

DQN represents a novel network architecture rooted in Q-Learning [29]. Its primary methodology involves leveraging a neural network to model the action value function, denoted as |$Q( {s,a} )$|⁠. This neural network approximates the action values |$Q( {{{s}_t},{{a}_t}} )$|⁠, associated with all feasible actions |${{a}_t}$|⁠, within each state |${{s}_t}$|⁠. Consequently, the neural network employed for this purpose is often referred to as the Q network. DQN is particularly well-suited for addressing scenarios involving discrete actions. The algorithm efficiently processes the state |${{s}_t}$| by inputting it into the network, obtaining the corresponding action values for each possible action. The specific procedure of this algorithm is shown in Fig. 2.

The execution process of the DQN algorithm.
Fig. 2.

The execution process of the DQN algorithm.

To ensure superior and stable performance of the Q network during training, the algorithm incorporates an experience replay module and target network structure [30]. The experience replay module involves the continuous collection of trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| by the agent interacting with the environment, guided by a greedy strategy. Once the data volume in the buffer exceeds a predefined threshold, batches of data are randomly sampled for training the Q network. The target network is designed with a structure identical to the earlier trained network |${{Q}_\omega }( {{{s}_t},{{a}_t}} )$| but differs in the frequency of parameter updates. The parameters |$\varpi $| of target network |${{Q}_\varpi }( {{{s}_t},{{a}_t}} )$| are updated simultaneously with the parameters |$\omega $| of the earlier network every few steps C, enhancing the overall stability of the Q network training. Noteworthy is that the update of network parameters is achieved through a combination of minimizing the object loss function and utilizing gradient descent. The computation of loss function is defined as Eq. (2).

(2)

where N signifies the aggregate count of trajectory data, |${{Q}_\omega }( {{{s}_i},{{a}_i}} )$| denotes the earlier training network, |${{Q}_\varpi }( {{{s}_{i + 1}},{{a}_{i + 1}}} )$| represents the target network, |$\omega $| and |$\varpi $| stand for the training parameters of the respective networks, and |${{r}_i} + \gamma {{\max }_a}{{Q}_\varpi }( {{{s}_{i + 1}},{{a}_{i + 1}}} )$| is the action value derived from the output of the target network.

3. Methodology

3.1. Data processing

Due to the discernible disparity in the overall action duration of railway point machines arising from diverse faults, the extracted experimental data falls short of meeting the dimension and numerical range criteria for the direct application of the diagnosis model. Addressing these shortcomings, the specific process consists of dimension adjustment and data normalization. On the one hand, the data dimension is reasonably determined by setting a threshold. If the experimental data dimension exceeds this threshold, redundant data is eliminated. Conversely, if the dimension is less than the threshold, vacant data is filled with zeros. At same time, the maximum-minimum normalization method is used to mitigate distribution differences among various data, ensuring compatibility with the proposed model. The normalization calculation is shown in Eq. (3).

(3)

where |${{x}_n}$| represents the processed data, |${{x}_i}$| denotes the data to be processed, |${{x}_{\max }}$| is the maximum value in the data to be processed and |${{x}_{\min }}$| represents the minimum value.

3.2. Developing the DRL framework for a fault diagnosis task

3.2.1. 1DCNN model

Within the proposed DRL diagnosis framework, the 1DCNN serves as the DL module primarily employed for constructing a fault diagnosis agent. Conventionally, convolutional neural network (CNN) has gained widespread recognition and application for processing two-dimensional data, with images being a typical example [31, 32]. Notably, 1DCNN demonstrates better processing capabilities when dealing with one-dimensional sequence data, such as current or vibration signals. As illustrated in Fig. 3, this convolution operation offers the flexibility to dynamically adjust the convolution kernel size, facilitating the rapid extraction of features.

One-dimensional convolution operation.
Fig. 3.

One-dimensional convolution operation.

Building upon the aforementioned description, the 1DCNN model designed in this study is depicted in Fig. 4. It comprises five key components: input, convolution layer, pooling layer, fully connected layer and output. The input module encompasses data representing various faults of point machines acquired through data processing. The convolutional structure consists of two layers, capturing diverse multiscale fault features by employing different numbers of convolution kernels. The two pooling layer utilizes the maximum pooling method to diminish feature dimensions and mitigate the impact of irrelevant features. The fully connected layer consolidates all features following convolution and pooling operations, computing the corresponding Q value for different fault types through a function mapping relationship. Hence, the established 1DCNN model can proficiently attain precise estimation of the agent's action value.

The proposed 1DCNN model in this study (assuming the length of input data is 165).
Fig. 4.

The proposed 1DCNN model in this study (assuming the length of input data is 165).

3.2.2. Improved DQN algorithm

When applying DRL to the realm of fault diagnosis, it is imperative to select an appropriate algorithm tailored to the specific task [33]. Presently, the DRL algorithms predominantly fall into two categories: those based on value function and those based on policy, the typical online-policy gradient algorithms like REINFORCE, Actor-Critic and PPO (Proximal Policy Optimization). These algorithms in application imply a lower likelihood of reusing training samples, posing challenges in achieving satisfactory results, particularly in classification problems characterized by limited sample data. In contrast, the offline-policy algorithm, such as DQN, stands out for its experience replay buffer design. This design enables the collection of ample samples, facilitating enough learning by the agent. It effectively enhances the utilization of training samples while mitigating the correlation between them. However, the standard DQN algorithm often falls short of achieving desired results across different problems. In this research, we design a classification interactive environment and agent training network based on fault data type, sample size and distribution characteristics. The aim is to achieve great stability and satisfactory cumulative rewards.

  • Interactive environment for fault classification

The core of establishing the interactive environment lies in employing MDP to vividly articulate the fault classification task and delineate each component [34]. Assuming the fault dataset used for training is denoted as |$D = \{ {{{k}_1},{{k}_2},{{k}_3}, \cdot \cdot \cdot ,{{k}_n}} \}$|⁠, the set of label categories to which the fault sample belongs is |$L = \{ {{{l}_1},{{l}_2},{{l}_3}, \cdot \cdot \cdot ,{{l}_m}} \}$|⁠, and each sample can be associated with the corresponding label, |${{k}_i} = {{l}_t},i \in [ {1,n} ],t \in [ {1,m} ]$|⁠. The breakdown of each module in MDP is as follows:

State: there exists a specific correlation between the state set S and the training dataset D; each state |${{s}_t}$| at every time step corresponds to a sample |${{k}_i}$| in the dataset. Consequently, we can define the state set as

(4)

where S represents the set of state, i.e. the data sample set. At the initiation of the interaction, the state |${{s}_1}$| corresponds to the sample |${{k}_1}$|⁠. Clearly, when the environment state is reset, it means that the fault sample is also randomly initialized.

Action: the classification actions executed by the agent are associated with fault categories, and these fault categories are consistent with the label categories. Consequently, the set of classification actions is defined as

(5)

where m represents the total number of classification actions, which is also equivalent to the total number of fault types and label categories.

Reward: the formulation of reward is the key link in MDP; a well-designed reward function frequently yields optimal training outcomes [35]. In this study, a reward mechanism is devised based on the peculiarities of the fault classification task. If the action executed by the agent in the current time step aligns with the sample label, a reward is granted. Conversely, if the action contradicts the label, the reward is 0. The rules governing rewards are designed as

(6)

where |${{a}_t}$| denotes the action taken under the current state |${{s}_t}$|⁠, and |${{a}_t} \in A,t \in [ {1,m} ]$|⁠.

Discount factor: the discount factor |$\gamma $| typically falls within the range of |$[ {0,1} )$|⁠. The closer the value is to 1, the greater emphasis is placed on long-term rewards. Conversely, the closer it is to 0, the more focus is directed towards recent rewards. In the fault classification problem, where instant rewards hold more significance, a smaller value for |$\gamma $| is employed.

  • Agent training network

Once the interactive environment is successfully established, the agent and the environment engage in interactive learning. This is achieved by optimizing the training network to acquire the ideal action value |${{Q}^*}$|⁠. Subsequently, a satisfactory fault classification policy |${{\pi }^*}$| is derived, facilitating improved classification action selection. In this research, we will delve into two key aspects, enhancing the Q network and optimizing the greedy strategy.

Enhanced Q network: the Q network in the standard DQN algorithm comprises a basic fully connected layer structure, proving inadequate for robust data feature extraction. Therefore, we adopt the proposed 1DCNN model as a new Q network to better fit the action value. The preprocessed fault data serves as input, treated as a discrete state within the network. Following convolution, pooling and so on, the network yields corresponding Q values for various classification actions. The fitting relationship of the neural network is defined as

(7)

where |${{s}_t}$| represents the state of the current time step, |${{f}_\theta }$| is the function mapping relationship of neural network, |$\theta $| is all the parameters of 1DCNN model and Q denotes the output action value.

Optimized greedy strategy. the greedy strategy is commonly employed to facilitate exploration and exploitation during the interaction, aiding action selection to maximize cumulative rewards. Nevertheless, in the standard DQN, the greedy factor |$\varepsilon $| remains static, lacking a balance between exploration and exploitation. In this study, |$\varepsilon $| is engineered to linearly decay in proportion to the number of interaction steps. The specific expression is defined as

(8)
(9)

where |${{e}_{\min }}$| is the lower threshold of greedy factor, |${{e}_0}$| is the initial value, |$\eta $| signifies the decay rate of greedy factor per single time step, u stands for the current interaction step, |$\hat{Q}( a )$| is to the assessment of cumulative reward expectation, |$\arg {{\max }_{a \in A}}\hat{Q}( a )$| is the selection of an action from the reward expectation evaluation set with the highest value, based on the probability |$1 - \varepsilon $|⁠, and |$a \in A$| is the random selection of an action from the action set A, according to the probability |$\varepsilon $|⁠, used to calculate the expected reward value upon execution.

Building upon the analysis of each module, algorithm 1 elucidates the specific implementation of improved the DQN algorithm.

Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for

3.3. Key steps of the proposed method

A novel method utilizing DRL for the fault diagnosis of railway point machines is introduced. The implementation process of this idea is illustrated in Fig. 5. The following steps provide a concise summary of this method.

The intricate structure of the fault diagnosis method proposed in this study.
Fig. 5.

The intricate structure of the fault diagnosis method proposed in this study.

Step 1: Extract the current data of railway point machines and ensure uniform sample distribution for each fault type. Employ data processing to standardize dimensions and normalize sample values, creating a fault dataset. Subsequently, divide the dataset into training and test samples according to an appropriate ratio.

Step 2: Formulate the fault classification task as a MDP and adeptly design an interactive environment. This environment encompasses essential modules such as state, action, reward and discount factors.

Step 3: Optimize the Q network and the greedy strategy to establish the agent training network. Integrate the components developed in step 2, synergizing the efforts to construct a comprehensive DRL framework.

Step 4: Following the intricate details of reward mechanism in the MDP and algorithm 1, engage in interactive learning between the agent and the environment. Explore the execution of optimal actions to obtain cumulative rewards, ultimately realizing the generation of an excellent classification policy.

Step 5: The training process of the network halts upon reaching the maximum number of epochs. Subsequently, test samples are reintroduced to evaluate the agent's classification performance, enabling a thorough assessment.

4. Experiment and result analysis

4.1. Experimental set-up and data description

Among the AC series railway point machines, the ZDJ9 is a notable example, encompassing key components such as an electric motor, reducer, friction clutch, ball screw, throwing rod, indication rod and switch circuit controller. Renowned for its ample conversion power and high reliability, the ZDJ9 point machine finds widespread usage in China's high-speed railway lines [36, 37].

In pursuit of intelligent fault diagnosis for railway point machines, we conducted experiments utilizing current data recorded during the operation under various fault conditions. This data was sourced from a centralized signal monitoring system within the Shenyang Railway Bureau, China. The current data is systematically collected by sensors with a holding time interval of 40 ms. Taking the ZDJ9 point machine as an example, its normal operation duration is approximately 5.8 s, corresponding to about 145 sampling points. Given the variance in current data due to different fault types, we applied the data-processing method outlined in section 3.1 to set a data dimension threshold of |$T = 165$| and normalized the data to maintain a numerical range between 0 and 1. Fig. 6 shows the standard movement stages of the ZDJ9 point machine, encompassing unlocking (A), conversion (B), locking (C) and the slow-down phases (D). During unlocking, 1DQJ and 2DQJ sequentially activate, establishing circuits for switch control. The electric motor propels the conversion process, reaching maximum output current. Subsequently, as the external locking device successfully opens, the current gradually decreases. Transitioning into the conversion and locking stages, the current curve generally remains below 2A. Upon completion of the locking process, the turnout is secured, resulting in another decline in current value. The final stage involves the disconnection of the 1DQJ self-closing circuit, initiating the slow-down state. Simultaneously, the outdoor indication circuits establish a connection, maintaining the current value at approximately 0.5 A. After a defined duration, this phase concludes. The other types of fault data information are shown in Table 1.

The current curve of normal movement for ZDJ9.
Fig. 6.

The current curve of normal movement for ZDJ9.

Table 1.

Common fault types and phenomena in railway point machines.

TypeFault phenomenaSample size
F1During the unlocking, the current value experiences continuous fluctuations before reaching its peak.90
F2Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0.90
F3The current continues to output, failing to decrease to its normal level at the conclusion of the locking.90
F4After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes.90
F5During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A.90
F6The duration of slowing for 1DQJ extends beyond the typical time range.90
TypeFault phenomenaSample size
F1During the unlocking, the current value experiences continuous fluctuations before reaching its peak.90
F2Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0.90
F3The current continues to output, failing to decrease to its normal level at the conclusion of the locking.90
F4After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes.90
F5During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A.90
F6The duration of slowing for 1DQJ extends beyond the typical time range.90
Table 1.

Common fault types and phenomena in railway point machines.

TypeFault phenomenaSample size
F1During the unlocking, the current value experiences continuous fluctuations before reaching its peak.90
F2Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0.90
F3The current continues to output, failing to decrease to its normal level at the conclusion of the locking.90
F4After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes.90
F5During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A.90
F6The duration of slowing for 1DQJ extends beyond the typical time range.90
TypeFault phenomenaSample size
F1During the unlocking, the current value experiences continuous fluctuations before reaching its peak.90
F2Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0.90
F3The current continues to output, failing to decrease to its normal level at the conclusion of the locking.90
F4After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes.90
F5During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A.90
F6The duration of slowing for 1DQJ extends beyond the typical time range.90

Fig. 7 illustrates the correlation between current and time in various faults, primarily categorized into mechanical issues (F1, F3, F4) and suboptimal device performance (F2, F5, F6). For example, the fault F1 arises from mechanical jamming in the turning shaft of the switch circuit controller, preventing the action of the moving contact group during the initial startup. The fault F2 of poor performance of an open-phase protector (DBQ) leads to an interruption in output current, causing the 1DQJ self-closing circuit to be cut off and resulting in an instantaneous current value of 0. Notably, the abnormal performance of certain electronic devices can subtly create a high similarity between the fault current curve and the normal current curve. For instance, the current curve corresponding to F6 closely mirrors the waveform of the normal F0 curve. Analysis reveals that this similarity is due to inherent characteristics of electronic devices causing a short delay in the slow-down process. Therefore, incorporating experimental data akin to this enhances the persuasiveness of the proposed method in terms of classification performance.

The correlation between current and time in typical faults.
Fig. 7.

The correlation between current and time in typical faults.

4.2. Experimental verification and discussion

4.2.1. Selecting the right DRL algorithm

Presently, a multitude of algorithms tailored to task-specific demands exist and astute selections often yield exceptional training outcomes. This study employs fault data to experiment with both online-policy and offline-policy algorithms, aiming to identify the most suitable fault identification algorithm. Initially, the extracted 630 experiment data is split into 504 training data and 126 test data, adhering to an 8:2 ratio, detailed in Table 2. Subsequently, the comparison experiment includes online-policy algorithms such as REINFORCE, Actor-Critic, PPO and SAC (Soft Actor-Critic), while the offline-policy algorithm is DQN. The policy or value networks within these algorithms are configured as fully connected layers, with additional crucial parameter information provided in Table 3.

Table 2.

The division of experimental data.

TypeSize of training/testing dataLabel
F072/180
F172/181
F272/182
F372/183
F472/184
F572/185
F672/186
TypeSize of training/testing dataLabel
F072/180
F172/181
F272/182
F372/183
F472/184
F572/185
F672/186
Table 2.

The division of experimental data.

TypeSize of training/testing dataLabel
F072/180
F172/181
F272/182
F372/183
F472/184
F572/185
F672/186
TypeSize of training/testing dataLabel
F072/180
F172/181
F272/182
F372/183
F472/184
F572/185
F672/186
Table 3.

Setting of key parameters in different algorithms.

ParameterSymbolValue
Learning rate|$\alpha $|0.001
Discount factor|$\gamma $|0.1
Greedy factor|$\varepsilon $|0.02
The capacity threshold of the experience replay buffer|${{R}_1},{{R}_2}$|100,10000
Batch sizeB32
Number of neurons in the fully connected layerH128
EpochsI600
Interaction steps in a single epochT504
Update frequency of the target networkC10
ParameterSymbolValue
Learning rate|$\alpha $|0.001
Discount factor|$\gamma $|0.1
Greedy factor|$\varepsilon $|0.02
The capacity threshold of the experience replay buffer|${{R}_1},{{R}_2}$|100,10000
Batch sizeB32
Number of neurons in the fully connected layerH128
EpochsI600
Interaction steps in a single epochT504
Update frequency of the target networkC10

Note: The learning rate of the policy network in Actor-Critic, PPO and SAC is consistent with Table 3, and the learning rate of the value network is 1e-2.

Table 3.

Setting of key parameters in different algorithms.

ParameterSymbolValue
Learning rate|$\alpha $|0.001
Discount factor|$\gamma $|0.1
Greedy factor|$\varepsilon $|0.02
The capacity threshold of the experience replay buffer|${{R}_1},{{R}_2}$|100,10000
Batch sizeB32
Number of neurons in the fully connected layerH128
EpochsI600
Interaction steps in a single epochT504
Update frequency of the target networkC10
ParameterSymbolValue
Learning rate|$\alpha $|0.001
Discount factor|$\gamma $|0.1
Greedy factor|$\varepsilon $|0.02
The capacity threshold of the experience replay buffer|${{R}_1},{{R}_2}$|100,10000
Batch sizeB32
Number of neurons in the fully connected layerH128
EpochsI600
Interaction steps in a single epochT504
Update frequency of the target networkC10

Note: The learning rate of the policy network in Actor-Critic, PPO and SAC is consistent with Table 3, and the learning rate of the value network is 1e-2.

Fig. 8 illustrates the reward generated by the five algorithms under the condition of 600 epochs. After meticulous analysis, as the number of interaction epochs increases, the cumulative reward of the SAC algorithm experiences a gradual rise, yet it remains below 300 even at the maximum epoch. Both REINFORCE and Actor-Critic, when trained up to approximately 450 epochs, achieve stable reward values hovering around 432. Notably, the reward curves of two approaches nearly coincide during the entire training phase. In comparison to other algorithms, while PPO achieves a slightly higher cumulative reward than DQN when the training process is stable, it experiences a sharp decline around epoch 510, impacting algorithm stability. Conversely, the DQN exhibits rapid convergence speed and maintains a more stable cumulative reward. The reward curves indirectly validate that the offline-policy algorithm DQN effectively utilizes training data and excels in fault classification tasks. This success establishes a solid foundation for further exploration of a DRL framework suitable for railway point machine fault identification.

The cumulative reward curves derived from various algorithms.
Fig. 8.

The cumulative reward curves derived from various algorithms.

4.2.2. Constructing the DL module and designing the greedy strategy

To ensure that the constructed DL module optimally fits the action value, and the greedy strategy effectively balances the trade-off between exploration and exploitation, this study conducts experiments incorporating various DL network structures and greedy strategies. Combination 1 consists of a fully connected layer paired with a deterministic strategy. Combination 2 integrates a convolutional layer, a pooling layer and a fully connected layer with a deterministic strategy. Combination 3 utilizes a convolutional layer, a pooling layer and a fully connected layer with an exponential decay strategy. In these combinations, the convolutional layer's output channels are set to 16, utilizing a convolution kernel size of |$1 \times 3$| with a stride of 1. The pooling layer features a kernel size of |$1 \times 2$| with a stride of 2. Additionally, the fully connected layer is established with 128 neurons. In the deterministic strategy, we specify the greedy factor |$\varepsilon = 0.02$|⁠. For the exponential decay strategy, both the exponential decay factor |$\lambda = 0.9$| and the initial greedy factor |${{e}_0} = 0.1$| are defined. Our selected combination involves two convolutional layers and two pooling layers, and a fully connected layer with a linear attenuation strategy. The number of output channels in the second convolutional layer has been augmented by 8, while the remaining parameters remain consistent with the aforementioned settings. In the linear decay strategy, we set the initial greedy factor |${{e}_0} = 0.02$|⁠, the lower limit |${{e}_{\min }} = 0.008$| and linear decay factor |$\eta = 0.001$|⁠. Once the combination configuration is finalized, it is integrated into the DQN for training. The training reward curves under four different combinations are shown in Fig. 9, and the determination of the optimal combination is based on the analysis of cumulative rewards.

The cumulative training rewards under different combinations.
Fig. 9.

The cumulative training rewards under different combinations.

After a comprehensive consideration involving the number of training samples and the reward mechanism outlined in section 3.2.2, it is evident that the cumulative reward range per training epoch spans from 0 to 504. This signifies that the maximum reward value is 504. As depicted in the results in Fig. 9 and the local view of the reward curve, under the consistent application of the greedy strategy, in contrast to combination 1, combination 2 displays a relatively slower convergence speed and a lower reward value in the initial stages of training, indicating a higher incidence of error recognition. While the reward curves of the two converge in the later stages, it is noteworthy that the design of the DL module exerts a discernible influence on the overall training outcomes. On the contrary, while combination 3 employs the exponential decay strategy, reaching 504 in certain local segments, the overall curve exhibits pronounced fluctuations, signifying a lack of stability in the training network. The curve obtained through the proposed combination not only closely aligns with the maximum reward curve but also indirectly reflects superior stability performance through reduced fluctuation levels.

To gauge the disparity in rewards between different combinations, we computed the average reward throughout the training process, as detailed in Table 4, revealing a clear trend: the proposed combination exhibits a higher average reward value compared to the other three combinations. Moreover, the average growth rate of rewards has achieved 1.44%. These findings not only align with the reward curve depicted in Fig. 9 but also show that 1DCNN and the linear decay greedy strategy designed can obtain satisfactory results. These insights affirm that the agent network showcases excellent prowess in classification decision-making. This success signifies the development of a proficient DRL framework for diagnosing faults.

Table 4.

The overall average reward under various combinations.

CombinationOverall average rewardMaximum cumulative reward
Combination 1493.37504
Combination 2485.38504
Combination 3495.54504
Our combination498.45504
CombinationOverall average rewardMaximum cumulative reward
Combination 1493.37504
Combination 2485.38504
Combination 3495.54504
Our combination498.45504
Table 4.

The overall average reward under various combinations.

CombinationOverall average rewardMaximum cumulative reward
Combination 1493.37504
Combination 2485.38504
Combination 3495.54504
Our combination498.45504
CombinationOverall average rewardMaximum cumulative reward
Combination 1493.37504
Combination 2485.38504
Combination 3495.54504
Our combination498.45504

4.2.3. Evaluation of the superiority of the proposed method

In this study, we juxtapose the proposed diagnostic methods with classification algorithms grounded in DL and DRL, and conduct a thorough evaluation by computing diverse performance indicators to substantiate the proposed method's superiority. The specific parameter configurations for these methods employed in the experiment are as follows:

  1. Method 1 is the Backpropagation (BP) neural network. The hidden layer comprises 32 neurons, with a learning rate set at 0.01, a momentum factor of 0.9 and a maximum iteration limit of 50.

  2. Method 2 employs a Long Short-Term Memory (LSTM) model. The hidden layer consists of 32 neurons, with a single network layer. The learning rate is set to 0.01, the maximum number of iterations is 100 and the batch size is 32. The network parameters are updated using the cross-entropy loss function and the stochastic gradient descent optimizer.

  3. Method 3 is the 1DCNN model. This model comprises a convolutional layer, a pooling layer and a fully connected layer. The parameter configurations for the convolution layer and pooling layer are detailed in section 4.2.2. The fully connected layer consists of 32 neurons, with a maximum iteration limit of 50. The other parameter updates follow the same way as those in the LSTM model.

  4. Method 4 entails a diagnosis model rooted in DRL, as introduced in reference [26]. The construction of the agent involves employing a multiscale residual convolutional neural network (MRCNN). Three independent one-dimensional convolution modules are utilized, featuring output channels of 32, 64 and 128, respectively. The convolution kernel size is set to |$1 \times 3$|⁠, with a stride of 1, and the output channel of the residual unit is 224. In the reward mechanism designed for correct classification, a reward of 1 is assigned, while for incorrect classification the reward is set to −1. The update of greedy factor |$\varepsilon $| follows a linear decay way, transitioning from 0.01 to 0.008. The discount factor is 0.1 and the training epochs is 600.

  5. Method 5 introduces a diagnosis model of SAE-based DRL, as proposed in reference [24]. The SAE comprises a network structure with four layers. The input layer's unit count is determined by the input sample's dimensionality. The two encoder layers consist of 32 and 16 units, respectively. The output layer's unit count is contingent upon the dimension of the Q value. The configuration of the reward mechanism, discount factor and training epochs align with the approaches detailed in method 4. The update of the greedy factor |$\varepsilon $| is consistent with the research in this paper.

  6. The proposed method. Within the DRL fault diagnosis framework in this study, the 1DCNN model is incorporated as the DL component and the structural specifics and parameter details of the model align with the optimal combination outlined in section 4.2.2. The design of the greedy strategy within the RL component mirrors that elucidated in optimal combination of section 4.2.2. Additionally, other training parameters remain in accordance with those presented in Table 3.

Based on the designed comparative experiments, we input 504 sets of training data into various classification models to derive training outcomes, as depicted in Fig. 10. Subsequently, upon completion of model training, an additional 126 sets of test data are employed to assess the classification performance. The test results are presented in Table 5. Combining with the reward mechanism in this study, the formula for calculating classification accuracy is defined as

(10)

where |${{Y}_1}$| is the average cumulative reward in the training or testing process, that is, the number of correctly classified samples, and |${{Y}_2}$| denotes the maximum cumulative reward achievable during the training or testing, that is, the total number of training samples or test samples.

The training accuracy under various classification methods.
Fig. 10.

The training accuracy under various classification methods.

Table 5.

The test results across different classification methods.

Classification methodTesting accuracy (%)
Method 188.09
Method 292.06
Method 396.83
Method 497.62
Method 597.62
Our method98.41
Classification methodTesting accuracy (%)
Method 188.09
Method 292.06
Method 396.83
Method 497.62
Method 597.62
Our method98.41
Table 5.

The test results across different classification methods.

Classification methodTesting accuracy (%)
Method 188.09
Method 292.06
Method 396.83
Method 497.62
Method 597.62
Our method98.41
Classification methodTesting accuracy (%)
Method 188.09
Method 292.06
Method 396.83
Method 497.62
Method 597.62
Our method98.41

Fig. 10 shows the accuracy of various methods during the training phase. The BP neural network as a classical classification method achieves an accuracy of 90.87% through continuous parameter optimization. As the current popular intelligent algorithms, LSTM and 1DCNN have 4.93% and 7.35% higher accuracy than BP, respectively. The methods 4 and 5 further enhance training accuracy, underscoring the advantages of DRL in the domain of fault diagnosis. Concurrently, when compared to the aforementioned methods, the proposed approach achieves a remarkable training accuracy of 98.9%.

This outcome signifies significant progress in model enhancement, indicating the feasibility of obtaining an optimal fault classification strategy more effortlessly through the proposed study. Simultaneously, referencing the findings in Table 5, it becomes apparent that the proposed method achieves an impressive classification accuracy of 98.41% on the test data, showcasing a high level of consistency with the performance observed during the training phase. This strongly indicates that when employing the DRL method for the railway point machines fault diagnosis task, its inherent independent exploration capability not only enhances the intelligence of diagnosis but also yields more satisfactory results.

Upon analysing the training and test results across the various methods mentioned above, the superiority of the proposed method is substantiated to a certain extent. Next, we will employ classification result visualization techniques and construct a multi-classification confusion matrix to comprehensively illustrate the classification performance of different methods.

Fig. 11 intuitively reflects the classification outcomes of different methods on the test data. It is evident that the BP neural network misclassifies the majority of F6 fault data as F5, leading to a notable decrease in overall testing accuracy. The LSTM model shows a slight improvement in the classification of F6, but some F0 fault data is misclassified, resulting in marginal enhancement in testing accuracy. Similarly, the 1DCNN model exhibits F0 error recognition, but its classification results for other fault data are ideal. The overall number of misclassifications for methods 4 and 5 has been significantly reduced, with a primary focus on addressing fault types F2 and F6. In contrast, the proposed method only misclassifies two samples, showcasing superior classification performance.

The schematic diagram depicting correct and incorrect classifications: (a) method 1; (b) method 2; (c) method 3; (d) method 4; (e) method 5; (f) proposed method.
Fig. 11.

The schematic diagram depicting correct and incorrect classifications: (a) method 1; (b) method 2; (c) method 3; (d) method 4; (e) method 5; (f) proposed method.

To quantitatively analyse the classification results for different faults, we have generated the multi-classification confusion matrix illustrated in Fig. 12. This matrix provides insights into the probability of each fault being correctly classified, based on the relationship between the actual label and the predicted label. Both BP and LSTM exhibit misclassification in two fault types, with particularly poor accuracy for F6 fault data at only 22% and 67%. The results for 1DCNN show relative improvement, and a detailed analysis of the current curves for F0 and F6 reveals highly similar waveforms, underscoring a limitation in 1DCNN's ability to classify such similar samples. While methods 4 and 5 also encompass error recognition for various fault types, their overall accuracy surpasses other approaches. Fortunately, the proposed method not only exhibits error recognition for a singular fault type but also maintains a commendable accuracy of 88%, demonstrating a more stable classification performance.

The confusion matrix diagram for multi-class classification: (a) method 1; (b) method 2; (c) method 3; (d) method 4; (e) method 5; (f) proposed method.
Fig. 12.

The confusion matrix diagram for multi-class classification: (a) method 1; (b) method 2; (c) method 3; (d) method 4; (e) method 5; (f) proposed method.

4.2.4. Evaluation of the reliability of the proposed method

The various comparative experiments conducted effectively validate the efficacy of the fault diagnosis method, built upon the DRL framework. This is evidenced by its strong performance in both training cumulative reward and classification test accuracy. However, it remains crucial to thoroughly validate the reliability of the proposed method. In our approach, we opted to vary the iterations, enabling the agent to execute multiple runs on the test data. This observation aimed to determine whether the accuracy demonstrated substantial differences across iterations. The classification test results across these diverse iterations are as follows.

Fig. 13 illustrates the reward distribution of test data after a single iteration, which represents the test process chosen by the proposed method in this study. Notably, only two out of 126 samples received the reward 0, indicating a misclassification count of 2.

The distribution of reward at iteration 1.
Fig. 13.

The distribution of reward at iteration 1.

Figs. 14 and 15 display the distribution of cumulative reward when iterating the test data 50 and 100 times, respectively. The cumulative reward values predominantly cluster within the range of 123 to 126. Combined with Table 6, the average reward values for these iterations are 124.82 and 124.97, corresponding to accuracies of 99.06% and 99.18%. Comparatively, against the 98.41% accuracy achieved through a single iteration, the classification accuracy demonstrates improvement and tends towards stability with increased iterations. This underscores that, guided by the optimal classification policy obtained during the training phase, the agent depends on the increasing familiarity with the data environment through multiple iterations to attain more stable and precise recognition results. Moreover, in the context of fault classification using DL, test outcomes exhibit a notable reliance on the trained model. Following multiple training sessions, test results might exhibit random fluctuations. Consequently, the DRL method introduced in this study adeptly integrates DL's perceptual capabilities with RL's autonomous exploration prowess in the railway point machines fault diagnosis task, thereby achieving a more excellent classification effect.

The distribution of cumulative reward at 50 iterations.
Fig. 14.

The distribution of cumulative reward at 50 iterations.

The distribution of cumulative reward at 100 iterations.
Fig. 15.

The distribution of cumulative reward at 100 iterations.

Table 6.

The classification results under various iterations.

IterationAverage cumulative rewardMaximum cum8lative rewardAccuracy (%)
1124.0012698.41
50124.8212699.06
100124.9712699.18
IterationAverage cumulative rewardMaximum cum8lative rewardAccuracy (%)
1124.0012698.41
50124.8212699.06
100124.9712699.18
Table 6.

The classification results under various iterations.

IterationAverage cumulative rewardMaximum cum8lative rewardAccuracy (%)
1124.0012698.41
50124.8212699.06
100124.9712699.18
IterationAverage cumulative rewardMaximum cum8lative rewardAccuracy (%)
1124.0012698.41
50124.8212699.06
100124.9712699.18

5. Conclusions

Considering the constraints posed by the current level of intelligence in railway point machines fault diagnosis, we refine the 1DCNN model to form the agent. Simultaneously, we optimize the DQN algorithm to enhance independent exploration capabilities, ultimately creating a robust DRL-based diagnostic framework.

Importantly, many comparative experiments underscore the perfect performance of the proposed method; the infusion of independent exploration ability will address fault diagnosis in a humanoid thought process, thereby substantially enhancing diagnostic intelligence. Furthermore, the flexibility of the Q network in the DQN allows for its adaptable design with reference to other DL models. This adaptability enables the proposed DRL framework to be effectively extended to diverse types of railway point machines, such as S700K, ZYJ7 and beyond. Notably, it is also crucial to explore enhanced reward mechanisms to align with the requirements of fault diagnosis scenarios. Looking to the future, we aim to engage in profound contemplation and detailed analysis from this standpoint. We are also committed to applying innovative fault diagnosis methods to delve into the research of additional crucial railway equipment, so as to systematically evaluate the health status of these components for enhanced effectiveness.

Acknowledgements

This research was supported by the Transportation Science and Technology Project of the Liaoning Provincial Department of Education (Grant No. 202243), the Provincial Key Laboratory Project (Grant No. GJZZX2022KF05) and the Natural Science Foundation of Liaoning Province (Grant No. 2019-ZD-0094).

Author contributions statement

Shuai Xiao designed the methodology. Shuai Xiao and Qingsheng Feng carried out data analysis and wrote the manuscript. Xue Li and Hong Li helped organize the manuscript.

Conflict of interest statement

There was no conflict of interest in the submission of the manuscript, and all authors agreed to publish it.

References

1.

Wen
 
T
,
Xie
 
G
,
Cao
 
Y
. et al.  
A DNN-based channel model for network planning in train control systems
.
IEEE Trans Intell Trans Syst
.
2021
;
23
:
2392
9
.

2.

Hamadache
 
M
,
Dutta
 
S
,
Olaby
 
O
. et al.  
On the fault detection and diagnosis of railway switch and crossing systems: an overview
.
Appl Sci
.
2019
;
9
:
5129
.

3.

Cao
 
Y
,
Ji
 
Y
,
Sun
 
Y
. et al.  
The fault diagnosis of a switch machine based on deep random forest fusion
.
IEEE Intell Transp Syst Mag
.
2022
;
15
:
437
452
.

4.

Hu
 
XX
,
Cao
 
Y
,
Tang
 
T
. et al.  
Data-driven technology of fault diagnosis in railway point machines: review and challenges
.
Transp Saf and Environ
.
2022
;
4
:
tdac036
.

5.

Chen
 
HT
,
Jiang
 
B
,
Ding
 
SX
. et al.  
Data-driven fault diagnosis for traction systems in high-speed trains: a survey, challenges, and perspectives
.
IEEE Trans Intell Trans Syst
.
2022
;
23
:
1700
16
.

6.

Wang
 
ZP
,
Jia
 
LM
,
Qin
 
Y
.
An integrated feature extraction algorithm for condition monitoring of railway point machine
. In
Proceedings of 2016 Prognostics and System Health Management Conference
,
Chengdu, China
, pages
1
5
.,
2016
;

7.

Atamuradov
 
V
,
Medjaher
 
K
,
Camci
 
F
. et al.  
Railway point machine prognostics based on feature fusion and health state assessment
.
IEEE Trans Instru and Measure
.
2019
;
68
:
2691
704
.

8.

Bian
 
C
,
Yang
 
S
,
Huang
 
T
. et al.  
Degradation state mining and identification for railway point machines
.
Reliab Eng Syst Safe
.
2019
;
188
:
432
43
.

9.

Kim
 
H
,
Sa
 
J
,
Chung
 
Y
. et al.  
Fault diagnosis of railway point machines using dynamic time warping
.
Electron Lett
.
2016
;
52
:
818
9
.

10.

Huang
 
SZ
,
Chen
 
W
,
Zhang
 
F
. et al.  
Method of turnout fault diagnosis based on Fréchet distance
.
Journal of Tongji University
.
2018
;
46
:
1690
-
1695
.

11.

Wei
 
WJ
,
Liu
 
XF
,
Zhang
 
XM
. et al.  
Fault diagnosis of S700K switch machine based on EEMD multi-scale fuzzy entropy
.
J China Railw Soc
.
2022
;
44
:
60
66
.

12.

Jiang
 
G
,
He
 
H
,
Yan
 
J
. et al.  
Multiscale convolutional neural networks for fault diagnosis of wind turbine gearbox
.
IEEE Trans Ind Electron
.
2019
;
65
:
3196
207
.

13.

Chen
 
XR
,
Liu
 
H
,
Duan
 
Z
.
Railway switch fault diagnosis based on multi-heads channel self attention, residual connection and deep CNN
.
Transp Saf and Environ
.
2022
;
5
:
tdac045
.

14.

Wang
 
RF
,
Li
 
Y
.
Fault diagnosis of S700K switch machine based on 1DCNN-BiLSTM hybrid model
.
J Electron Measure and Instru
.
2022
;
36
:
193
200
.

15.

Cao
 
Y
,
Sun
 
Y
,
Xie
 
G
. et al.  
A sound-based fault diagnosis method for rail way point machines based on two-stage feature selection strategy and ensemble classifier
.
IEEE Trans Intell Trans Syst
.
2021
;
23
:
12074
83
.

16.

Sun
 
YK
,
Cao
 
Y
,
Liu
 
HT
. et al.  
Condition monitoring and fault diagnosis strategy of railway point machines using vibration signals
.
Transp Saf and Environ
.
2022
;
5
:
tdac048
.

17.

Zhao
 
R
,
Yan
 
RQ
,
Chen
 
ZH
. et al.  
Deep learning and its applications to machine health monitoring
.
Mech Syst Signal Process
.
2019
;
115
:
213
37
.

18.

Zhang
 
W
,
Peng
 
G
,
Li
 
C
. et al.  
A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals
.
Sensors
.
2017
;
17
:
425
.

19.

Xing
 
S
,
Lei
 
Y
,
Wang
 
S
. et al.  
Distribution-invariant deep belief network for intelligent fault diagnosis of machines under new working conditions
.
IEEE Trans Ind Electron
.
2021
;
68
:
2617
25
.

20.

Mnih
 
V
,
Kavukcuoglu
 
K
,
Silver
 
D
. et al.  
Human-level control through deep reinforcement learning
.
Nature
.
2015
;
518
:
529
33
.

21.

Duan
 
J
,
Li
 
SE
,
Guan
 
Y
. et al.  
Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data
.
IET Intel Transport Syst
.
2020
;
14
:
297
305
.

22.

Faust
 
A
,
Palunko
 
I
,
Cruz
 
P
. et al.  
Automated aerial suspended cargo delivery through reinforcement learning
.
Artif Intell
.
2017
;
247
:
381
98
.

23.

Everett
 
M
,
Chen
 
YF
,
How
 
JP
.
Motion planning among dynamic, decision making agents with deep reinforcement learning
.
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
,
IEEE
,
2018
;
3052
9
.

24.

Ding
 
Y
,
Ma
 
L
,
Ma
 
J
. et al.  
Intelligent fault diagnosis for rotating machinery using deep Q-network based health state classification: a deep reinforcement learning approach
.
Adv Eng Inform
.
2019
;
42
:
100977
.

25.

Wang
 
ZS
,
Xuan
 
JP
.
Intelligent fault recognition framework by using deep reinforcement learning with one dimension convolution and improved actor-critic algorithm
.
Adv Eng Inform
.
2021
;
49
:
101315
.

26.

Wang
 
H
,
Xu
 
JW
,
Yan
 
RQ
.
Intelligent fault diagnosis for planetary gearbox using transferable deep Q network under variable conditions with small training data
.
J Dyna, Monitor Diagnostic
.
2023
;
2
:
30
41
.

27.

Sutton
 
RS
,
Barto
 
AG
.
Reinforcement Learning: an Introduction
, second ed.
Cambridge, MA
:
MIT Press
,
2018
.

28.

Lin
 
E
,
Chen
 
Q
,
Qi
 
X
.
Deep reinforcement learning for imbalanced classification
.
Appl Intell
.
2020
;
50
:
2488
502
.

29.

Hu
 
X
,
Liu
 
T
,
Qi
 
X
. et al.  
Reinforcement learning for hybrid and plug-in hybrid electric vehicle energy management: recent advances and prospects
.
IEEE Indust Electron Mag
.
2019
;
13
:
16
25
.

30.

Li
 
G
,
Wu
 
J
,
Deng
 
C
. et al.  
Deep reinforcement learning-based online domain adaptation method for fault diagnosis of rotating machinery
.
IEEE Trans Mechatron
.
2022
;
27
:
2796
805
.

31.

Shao
 
HD
,
Jiang
 
HK
,
Zhao
 
HW
. et al.  
A novel deep autoencoder feature learning method for rotating machinery fault diagnosis
.
Mech Syst Signal Process
.
2017
;
95
:
187
204
.

32.

Appana
 
DK
,
Prosvirin
 
A
,
Kim
 
J-M
.
Reliable fault diagnosis of bearings with varying rotational speeds using envelope spectrum and convolution neural networks
.
Soft Comput
.
2018
;
22
:
6719
29
.

33.

Wang
 
H
,
Xu
 
J
,
Yan
 
R
.
Multi-scale attention based deep reinforcement learning for intelligent fault diagnosis of planetary gearbox
.
J Mech Eng
.
2022
;
58
:
133
42
.

34.

Li
 
L
,
Li
 
D
,
Song
 
T
. et al.  
Actor-critic learning control based on ℓ2 –regularized temporal-difference prediction with gradient correction
.
IEEE Trans Neural Networks Learn Syst
.
2018
;
29
:
5899
909
.

35.

Prashanth
 
LA
,
Ghavamzadeh
 
M
.
Variance-constrained actor-critic algorithms for discounted and average reward mdps
.
Machine Learning
.
2016
;
105
:
367
417
.

36.

Sun
 
YK
,
Cao
 
Y
,
Xie
 
G
. et al.  
Condition monitoring for railway point machines based on sound analysis and support vector machine
.
Chin J Electron
.
2020
;
29
:
786
92
.

37.

Liu
 
JQ
,
Wen
 
T
,
Xie
 
G
. et al.  
Modified multi-scale symbolic dynamic entropy and fuzzy broad learning-based fast fault diagnosis of railway point machines
.
Transp Saf and Environ
.
2022
;
5
:
tdac065
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]