-
PDF
- Split View
-
Views
-
Cite
Cite
Shuai Xiao, Qingsheng Feng, Xue Li, Hong Li, Research on intelligent fault diagnosis for railway point machines using deep reinforcement learning, Transportation Safety and Environment, Volume 6, Issue 4, October 2024, tdae007, https://doi.org/10.1093/tse/tdae007
- Share Icon Share
Abstract
The advanced diagnosis of faults in railway point machines is crucial for ensuring the smooth operation of the turnout conversion system and the safe functioning of trains. Signal processing and deep learning-based methods have been extensively explored in the realm of fault diagnosis. While these approaches effectively extract fault features and facilitate the creation of end-to-end diagnostic models, they often demand considerable expert experience and manual intervention in feature selection, structural construction and parameter optimization of neural networks. This reliance on manual efforts can result in weak generalization performance and a lack of intelligence in the model. To address these challenges, this study introduces an intelligent fault diagnosis method based on deep reinforcement learning (DRL). Initially, a one-dimensional convolutional neural network agent is established, leveraging the specific characteristics of point machine fault data to automatically extract diverse features across multiple scales. Subsequently, deep Q network is incorporated as the central component of the diagnostic framework. The fault classification interactive environment is meticulously designed, and the agent training network is optimized. Through extensive interaction between the agent and the environment using fault data, satisfactory cumulative rewards and effective fault classification strategies are achieved. Experimental results demonstrate the proposed method's high efficacy, with a training accuracy of 98.9% and a commendable test accuracy of 98.41%. Notably, the utilization of DRL in addressing the fault diagnosis challenge for railway point machines enhances the intelligence of diagnostic process, particularly through its excellent independent exploration capability.
1. Introduction
China's railway is currently experiencing a phase of rapid development. The collaborative synergy between various pieces of key equipment significantly contributes to the secure and steady operation of trains [1–3]. The turnout conversion system holds a pivotal role in high-speed railway and heavy-haul railway transportation. Serving as a crucial component within the system, railway point machines precisely manage the train's running direction by adjusting the position of the turnout [4, 5]. In the event of a malfunction, the normal operation of point machines is imperative. Any failure in their operation can lead to a complete halt in the turnout conversion system, preventing high-speed trains from changing their operating lines as required. In more severe scenarios, such malfunctions can result in major accidents like train derailments, posing incalculable risks to personnel safety and economic well-being [6–9].
Research into methods for diagnosing faults in railway point machines has stimulated in-depth discussions among scholars. These studies primarily utilize data-driven approaches combined with signal-processing and supervised-learning methods. Huang et al. [10] employed the Fréchet distance and similarity function to measure the similarity between the standard current curve and the curve under examination, providing an initial assessment of the operational status of point machines. Wei et al. [11] introduced the ensemble empirical mode decomposition (EEMD) and fuzzy entropy theory to process the power signal of point machines, combining the grey correlation algorithm for swift diagnosis. However, the former can only determine the occurrence of a fault and cannot provide information regarding the specific fault type. The latter is susceptible to limitations, particularly when faced with a large dataset, as fault curves with high similarity may result in poor discrimination performance. The rapid advancement of machine learning (ML) and deep learning (DL) has propelled intelligent fault diagnosis to new heights [12]. Supported by substantial data, the algorithm is anticipated to overcome the previously mentioned issues. Chen et al. [13] proposed a deep residual convolutional neural network to extract local features from power curves, integrating a multi-head self-attention mechanism to focus on key features. The combination of these approaches facilitates efficient diagnosis of faults. Wang and Li [14] devised a composite model of neural network; by shallowly extracting and deeply mining the power signal characteristics, the classifier produces ideal recognition results. In contrast to the above diagnosis methods that rely on electrical signals, Cao et al. [15] focused on extracting sound signals from point machines, employing two-stage feature selection and ensemble learning to effectively classify faults. In a similar vein, Sun et al. [16] applied variational mode decomposition (VMD) to process the vibration signals associated with the fault states, synthesized multiscale features to construct a feature set and utilized support vector machine (SVM) for fault identification. The exploration of both sound and vibration signals introduces a novel perspective to the fault diagnosis of railway point machines. Particularly, the vibration signal, with its inherent stability and rich feature information, holds representative value in fault identification research.
A review of the literature reveals that the fusion of signal processing and ML can serve as an effective method for fault identification. However, when dealing with high-dimensional fault data, signal decomposition is not only time-consuming but also heavily reliant on manual expertise for feature selection and classifier parameter tuning. DL-based diagnostic methods can adaptively extract fault features and comprehensively analyse the intrinsic relationships between these features, leading to satisfactory outcomes [17, 18]. Yet, these methods have their limitations: (1) When addressing specific tasks, optimizing network structures and parameters often depends on substantial human intervention and effort, which can result in a network's poor generalization capability. (2) The essence of supervised learning mechanism lies in conducting multiple trainings within a given data distribution to identify an excellent classification model. The absence of additional incentives influencing the entire learning process also underscores the need for improvement in the independent exploration ability of DL methods in fault diagnosis. Fortunately, another product of artificial intelligence (AI), reinforcement learning (RL), has been effectively combined with DL, harnessing its remarkable self-exploration ability alongside the strong perceptual capabilities of DL, forming what is known as deep reinforcement learning (DRL) and further propelling the evolution of intelligent fault diagnosis [19, 20].
To date, DRL has found extensive application in areas such as automatic control, intelligent scheduling and unmanned driving [21–23]. However, a group of researchers has embarked on employing it to tackle the issue of mechanical fault identification, marking a novel approach in this domain. Ding et al. [24] successfully achieved fault diagnosis in rotating machinery by integrating a stacked automatic encoder (SAE) network with DQN (Deep Q Network), thereby providing substantial evidence for the effectiveness of the DRL model in this domain. Wang and Xuan [25] innovatively proposed a diagnosis method for bearings and tools, leveraging one-dimensional convolution and enhanced Actor-Critic algorithm to achieve superior recognition outcomes. Wang et al. [26] integrated DQN and transfer learning to achieve fault diagnosis of planetary gearboxes under varying operating conditions. Their findings validate that DRL possesses distinct advantages over some DL methods. However, when dealing with various types of fault data, delving deeper into finding a harmonious blend between DL's network structure and RL's reward mechanisms and strategies remains an area that requires further exploration to yield superior outcomes. To this end, this research introduces an intelligent fault diagnosis framework for railway point machines, utilizing a one-dimensional convolutional neural network (1DCNN) and an optimized DQN algorithm. The primary contributions in this work are as follows:
Combining the characteristics of the current data from point machines, a fault diagnosis agent based on the 1DCNN model is thoughtfully crafted to automatically extract multiscale features, mitigating the need for excessive dependence on manual design.
The DQN algorithm serves as the core element in the proposed diagnosis framework and is further optimized. This optimization encompasses the establishment of a classification interactive environment and the exploration of an agent training network to obtain more gratifying cumulative rewards.
The meticulously designed reward mechanism is closely tailored to the fault diagnosis scenario, which promotes the agent to generate excellent classification policy after extensive interaction with the data environment, ultimately leading to more favourable diagnostic outcomes.
The remaining sections of this study are outlined as follows. Section 2 introduces the relevant foundational theory of RL. Section 3 proposes the data processing and the construction of DRL fault diagnosis framework. A detailed analysis of the experimental configuration and diagnostic results is provided in section 4. Section 5 offers a comprehensive summary of the research.
2. Foundational theory of RL
2.1. Principle of RL
RL is a computational approach wherein an agent engages with the environment iteratively to achieve a desired objective [27]. Fig. 1 illustrates the detailed interaction process. Guided by a policy |$\pi $|, the agent takes action |${{a}_t}$| to alter the present environmental state |${{s}_t}$|. Simultaneously, the environment provides corresponding reward |${{r}_t}$| based on the action execution, and the feedback on the environmental state |${{s}_{t + 1}}$| is conveyed to the agent in the subsequent moment, creating a closed-loop process. In the field of RL, the environment is essentially conceptualized as a Markov decision process (MDP), detailing the evolution of state information and the mechanism of reciprocal transitions between states [28]. The interplay between policy |$\pi $|, discount factor |$\gamma $| and reward function is considered to optimize the expectation of cumulative reward |${{R}_t}$|, as in Eq. (1).
where |${{r}_{t + k}}$| is the reward values at different moments.

2.2. DQN algorithm
DQN represents a novel network architecture rooted in Q-Learning [29]. Its primary methodology involves leveraging a neural network to model the action value function, denoted as |$Q( {s,a} )$|. This neural network approximates the action values |$Q( {{{s}_t},{{a}_t}} )$|, associated with all feasible actions |${{a}_t}$|, within each state |${{s}_t}$|. Consequently, the neural network employed for this purpose is often referred to as the Q network. DQN is particularly well-suited for addressing scenarios involving discrete actions. The algorithm efficiently processes the state |${{s}_t}$| by inputting it into the network, obtaining the corresponding action values for each possible action. The specific procedure of this algorithm is shown in Fig. 2.

To ensure superior and stable performance of the Q network during training, the algorithm incorporates an experience replay module and target network structure [30]. The experience replay module involves the continuous collection of trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| by the agent interacting with the environment, guided by a greedy strategy. Once the data volume in the buffer exceeds a predefined threshold, batches of data are randomly sampled for training the Q network. The target network is designed with a structure identical to the earlier trained network |${{Q}_\omega }( {{{s}_t},{{a}_t}} )$| but differs in the frequency of parameter updates. The parameters |$\varpi $| of target network |${{Q}_\varpi }( {{{s}_t},{{a}_t}} )$| are updated simultaneously with the parameters |$\omega $| of the earlier network every few steps C, enhancing the overall stability of the Q network training. Noteworthy is that the update of network parameters is achieved through a combination of minimizing the object loss function and utilizing gradient descent. The computation of loss function is defined as Eq. (2).
where N signifies the aggregate count of trajectory data, |${{Q}_\omega }( {{{s}_i},{{a}_i}} )$| denotes the earlier training network, |${{Q}_\varpi }( {{{s}_{i + 1}},{{a}_{i + 1}}} )$| represents the target network, |$\omega $| and |$\varpi $| stand for the training parameters of the respective networks, and |${{r}_i} + \gamma {{\max }_a}{{Q}_\varpi }( {{{s}_{i + 1}},{{a}_{i + 1}}} )$| is the action value derived from the output of the target network.
3. Methodology
3.1. Data processing
Due to the discernible disparity in the overall action duration of railway point machines arising from diverse faults, the extracted experimental data falls short of meeting the dimension and numerical range criteria for the direct application of the diagnosis model. Addressing these shortcomings, the specific process consists of dimension adjustment and data normalization. On the one hand, the data dimension is reasonably determined by setting a threshold. If the experimental data dimension exceeds this threshold, redundant data is eliminated. Conversely, if the dimension is less than the threshold, vacant data is filled with zeros. At same time, the maximum-minimum normalization method is used to mitigate distribution differences among various data, ensuring compatibility with the proposed model. The normalization calculation is shown in Eq. (3).
where |${{x}_n}$| represents the processed data, |${{x}_i}$| denotes the data to be processed, |${{x}_{\max }}$| is the maximum value in the data to be processed and |${{x}_{\min }}$| represents the minimum value.
3.2. Developing the DRL framework for a fault diagnosis task
3.2.1. 1DCNN model
Within the proposed DRL diagnosis framework, the 1DCNN serves as the DL module primarily employed for constructing a fault diagnosis agent. Conventionally, convolutional neural network (CNN) has gained widespread recognition and application for processing two-dimensional data, with images being a typical example [31, 32]. Notably, 1DCNN demonstrates better processing capabilities when dealing with one-dimensional sequence data, such as current or vibration signals. As illustrated in Fig. 3, this convolution operation offers the flexibility to dynamically adjust the convolution kernel size, facilitating the rapid extraction of features.

Building upon the aforementioned description, the 1DCNN model designed in this study is depicted in Fig. 4. It comprises five key components: input, convolution layer, pooling layer, fully connected layer and output. The input module encompasses data representing various faults of point machines acquired through data processing. The convolutional structure consists of two layers, capturing diverse multiscale fault features by employing different numbers of convolution kernels. The two pooling layer utilizes the maximum pooling method to diminish feature dimensions and mitigate the impact of irrelevant features. The fully connected layer consolidates all features following convolution and pooling operations, computing the corresponding Q value for different fault types through a function mapping relationship. Hence, the established 1DCNN model can proficiently attain precise estimation of the agent's action value.

The proposed 1DCNN model in this study (assuming the length of input data is 165).
3.2.2. Improved DQN algorithm
When applying DRL to the realm of fault diagnosis, it is imperative to select an appropriate algorithm tailored to the specific task [33]. Presently, the DRL algorithms predominantly fall into two categories: those based on value function and those based on policy, the typical online-policy gradient algorithms like REINFORCE, Actor-Critic and PPO (Proximal Policy Optimization). These algorithms in application imply a lower likelihood of reusing training samples, posing challenges in achieving satisfactory results, particularly in classification problems characterized by limited sample data. In contrast, the offline-policy algorithm, such as DQN, stands out for its experience replay buffer design. This design enables the collection of ample samples, facilitating enough learning by the agent. It effectively enhances the utilization of training samples while mitigating the correlation between them. However, the standard DQN algorithm often falls short of achieving desired results across different problems. In this research, we design a classification interactive environment and agent training network based on fault data type, sample size and distribution characteristics. The aim is to achieve great stability and satisfactory cumulative rewards.
Interactive environment for fault classification
The core of establishing the interactive environment lies in employing MDP to vividly articulate the fault classification task and delineate each component [34]. Assuming the fault dataset used for training is denoted as |$D = \{ {{{k}_1},{{k}_2},{{k}_3}, \cdot \cdot \cdot ,{{k}_n}} \}$|, the set of label categories to which the fault sample belongs is |$L = \{ {{{l}_1},{{l}_2},{{l}_3}, \cdot \cdot \cdot ,{{l}_m}} \}$|, and each sample can be associated with the corresponding label, |${{k}_i} = {{l}_t},i \in [ {1,n} ],t \in [ {1,m} ]$|. The breakdown of each module in MDP is as follows:
State: there exists a specific correlation between the state set S and the training dataset D; each state |${{s}_t}$| at every time step corresponds to a sample |${{k}_i}$| in the dataset. Consequently, we can define the state set as
where S represents the set of state, i.e. the data sample set. At the initiation of the interaction, the state |${{s}_1}$| corresponds to the sample |${{k}_1}$|. Clearly, when the environment state is reset, it means that the fault sample is also randomly initialized.
Action: the classification actions executed by the agent are associated with fault categories, and these fault categories are consistent with the label categories. Consequently, the set of classification actions is defined as
where m represents the total number of classification actions, which is also equivalent to the total number of fault types and label categories.
Reward: the formulation of reward is the key link in MDP; a well-designed reward function frequently yields optimal training outcomes [35]. In this study, a reward mechanism is devised based on the peculiarities of the fault classification task. If the action executed by the agent in the current time step aligns with the sample label, a reward is granted. Conversely, if the action contradicts the label, the reward is 0. The rules governing rewards are designed as
where |${{a}_t}$| denotes the action taken under the current state |${{s}_t}$|, and |${{a}_t} \in A,t \in [ {1,m} ]$|.
Discount factor: the discount factor |$\gamma $| typically falls within the range of |$[ {0,1} )$|. The closer the value is to 1, the greater emphasis is placed on long-term rewards. Conversely, the closer it is to 0, the more focus is directed towards recent rewards. In the fault classification problem, where instant rewards hold more significance, a smaller value for |$\gamma $| is employed.
Agent training network
Once the interactive environment is successfully established, the agent and the environment engage in interactive learning. This is achieved by optimizing the training network to acquire the ideal action value |${{Q}^*}$|. Subsequently, a satisfactory fault classification policy |${{\pi }^*}$| is derived, facilitating improved classification action selection. In this research, we will delve into two key aspects, enhancing the Q network and optimizing the greedy strategy.
Enhanced Q network: the Q network in the standard DQN algorithm comprises a basic fully connected layer structure, proving inadequate for robust data feature extraction. Therefore, we adopt the proposed 1DCNN model as a new Q network to better fit the action value. The preprocessed fault data serves as input, treated as a discrete state within the network. Following convolution, pooling and so on, the network yields corresponding Q values for various classification actions. The fitting relationship of the neural network is defined as
where |${{s}_t}$| represents the state of the current time step, |${{f}_\theta }$| is the function mapping relationship of neural network, |$\theta $| is all the parameters of 1DCNN model and Q denotes the output action value.
Optimized greedy strategy. the greedy strategy is commonly employed to facilitate exploration and exploitation during the interaction, aiding action selection to maximize cumulative rewards. Nevertheless, in the standard DQN, the greedy factor |$\varepsilon $| remains static, lacking a balance between exploration and exploitation. In this study, |$\varepsilon $| is engineered to linearly decay in proportion to the number of interaction steps. The specific expression is defined as
where |${{e}_{\min }}$| is the lower threshold of greedy factor, |${{e}_0}$| is the initial value, |$\eta $| signifies the decay rate of greedy factor per single time step, u stands for the current interaction step, |$\hat{Q}( a )$| is to the assessment of cumulative reward expectation, |$\arg {{\max }_{a \in A}}\hat{Q}( a )$| is the selection of an action from the reward expectation evaluation set with the highest value, based on the probability |$1 - \varepsilon $|, and |$a \in A$| is the random selection of an action from the action set A, according to the probability |$\varepsilon $|, used to calculate the expected reward value upon execution.
Building upon the analysis of each module, algorithm 1 elucidates the specific implementation of improved the DQN algorithm.
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
3.3. Key steps of the proposed method
A novel method utilizing DRL for the fault diagnosis of railway point machines is introduced. The implementation process of this idea is illustrated in Fig. 5. The following steps provide a concise summary of this method.

The intricate structure of the fault diagnosis method proposed in this study.
Step 1: Extract the current data of railway point machines and ensure uniform sample distribution for each fault type. Employ data processing to standardize dimensions and normalize sample values, creating a fault dataset. Subsequently, divide the dataset into training and test samples according to an appropriate ratio.
Step 2: Formulate the fault classification task as a MDP and adeptly design an interactive environment. This environment encompasses essential modules such as state, action, reward and discount factors.
Step 3: Optimize the Q network and the greedy strategy to establish the agent training network. Integrate the components developed in step 2, synergizing the efforts to construct a comprehensive DRL framework.
Step 4: Following the intricate details of reward mechanism in the MDP and algorithm 1, engage in interactive learning between the agent and the environment. Explore the execution of optimal actions to obtain cumulative rewards, ultimately realizing the generation of an excellent classification policy.
Step 5: The training process of the network halts upon reaching the maximum number of epochs. Subsequently, test samples are reintroduced to evaluate the agent's classification performance, enabling a thorough assessment.
4. Experiment and result analysis
4.1. Experimental set-up and data description
Among the AC series railway point machines, the ZDJ9 is a notable example, encompassing key components such as an electric motor, reducer, friction clutch, ball screw, throwing rod, indication rod and switch circuit controller. Renowned for its ample conversion power and high reliability, the ZDJ9 point machine finds widespread usage in China's high-speed railway lines [36, 37].
In pursuit of intelligent fault diagnosis for railway point machines, we conducted experiments utilizing current data recorded during the operation under various fault conditions. This data was sourced from a centralized signal monitoring system within the Shenyang Railway Bureau, China. The current data is systematically collected by sensors with a holding time interval of 40 ms. Taking the ZDJ9 point machine as an example, its normal operation duration is approximately 5.8 s, corresponding to about 145 sampling points. Given the variance in current data due to different fault types, we applied the data-processing method outlined in section 3.1 to set a data dimension threshold of |$T = 165$| and normalized the data to maintain a numerical range between 0 and 1. Fig. 6 shows the standard movement stages of the ZDJ9 point machine, encompassing unlocking (A), conversion (B), locking (C) and the slow-down phases (D). During unlocking, 1DQJ and 2DQJ sequentially activate, establishing circuits for switch control. The electric motor propels the conversion process, reaching maximum output current. Subsequently, as the external locking device successfully opens, the current gradually decreases. Transitioning into the conversion and locking stages, the current curve generally remains below 2A. Upon completion of the locking process, the turnout is secured, resulting in another decline in current value. The final stage involves the disconnection of the 1DQJ self-closing circuit, initiating the slow-down state. Simultaneously, the outdoor indication circuits establish a connection, maintaining the current value at approximately 0.5 A. After a defined duration, this phase concludes. The other types of fault data information are shown in Table 1.

Type . | Fault phenomena . | Sample size . |
---|---|---|
F1 | During the unlocking, the current value experiences continuous fluctuations before reaching its peak. | 90 |
F2 | Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0. | 90 |
F3 | The current continues to output, failing to decrease to its normal level at the conclusion of the locking. | 90 |
F4 | After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes. | 90 |
F5 | During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A. | 90 |
F6 | The duration of slowing for 1DQJ extends beyond the typical time range. | 90 |
Type . | Fault phenomena . | Sample size . |
---|---|---|
F1 | During the unlocking, the current value experiences continuous fluctuations before reaching its peak. | 90 |
F2 | Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0. | 90 |
F3 | The current continues to output, failing to decrease to its normal level at the conclusion of the locking. | 90 |
F4 | After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes. | 90 |
F5 | During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A. | 90 |
F6 | The duration of slowing for 1DQJ extends beyond the typical time range. | 90 |
Type . | Fault phenomena . | Sample size . |
---|---|---|
F1 | During the unlocking, the current value experiences continuous fluctuations before reaching its peak. | 90 |
F2 | Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0. | 90 |
F3 | The current continues to output, failing to decrease to its normal level at the conclusion of the locking. | 90 |
F4 | After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes. | 90 |
F5 | During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A. | 90 |
F6 | The duration of slowing for 1DQJ extends beyond the typical time range. | 90 |
Type . | Fault phenomena . | Sample size . |
---|---|---|
F1 | During the unlocking, the current value experiences continuous fluctuations before reaching its peak. | 90 |
F2 | Around the 1.8 s, the current value undergoes a rapid decline, plummeting from 2 A to 0. | 90 |
F3 | The current continues to output, failing to decrease to its normal level at the conclusion of the locking. | 90 |
F4 | After entering the slow release stage of 1DQJ, the current exhibits anomalies and subsequently vanishes. | 90 |
F5 | During the deceleration process, the current of 1DQJ significantly exceeds 0.5 A and consistently hovers between 0.7 A and 0.8 A. | 90 |
F6 | The duration of slowing for 1DQJ extends beyond the typical time range. | 90 |
Fig. 7 illustrates the correlation between current and time in various faults, primarily categorized into mechanical issues (F1, F3, F4) and suboptimal device performance (F2, F5, F6). For example, the fault F1 arises from mechanical jamming in the turning shaft of the switch circuit controller, preventing the action of the moving contact group during the initial startup. The fault F2 of poor performance of an open-phase protector (DBQ) leads to an interruption in output current, causing the 1DQJ self-closing circuit to be cut off and resulting in an instantaneous current value of 0. Notably, the abnormal performance of certain electronic devices can subtly create a high similarity between the fault current curve and the normal current curve. For instance, the current curve corresponding to F6 closely mirrors the waveform of the normal F0 curve. Analysis reveals that this similarity is due to inherent characteristics of electronic devices causing a short delay in the slow-down process. Therefore, incorporating experimental data akin to this enhances the persuasiveness of the proposed method in terms of classification performance.

4.2. Experimental verification and discussion
4.2.1. Selecting the right DRL algorithm
Presently, a multitude of algorithms tailored to task-specific demands exist and astute selections often yield exceptional training outcomes. This study employs fault data to experiment with both online-policy and offline-policy algorithms, aiming to identify the most suitable fault identification algorithm. Initially, the extracted 630 experiment data is split into 504 training data and 126 test data, adhering to an 8:2 ratio, detailed in Table 2. Subsequently, the comparison experiment includes online-policy algorithms such as REINFORCE, Actor-Critic, PPO and SAC (Soft Actor-Critic), while the offline-policy algorithm is DQN. The policy or value networks within these algorithms are configured as fully connected layers, with additional crucial parameter information provided in Table 3.
Type . | Size of training/testing data . | Label . |
---|---|---|
F0 | 72/18 | 0 |
F1 | 72/18 | 1 |
F2 | 72/18 | 2 |
F3 | 72/18 | 3 |
F4 | 72/18 | 4 |
F5 | 72/18 | 5 |
F6 | 72/18 | 6 |
Type . | Size of training/testing data . | Label . |
---|---|---|
F0 | 72/18 | 0 |
F1 | 72/18 | 1 |
F2 | 72/18 | 2 |
F3 | 72/18 | 3 |
F4 | 72/18 | 4 |
F5 | 72/18 | 5 |
F6 | 72/18 | 6 |
Type . | Size of training/testing data . | Label . |
---|---|---|
F0 | 72/18 | 0 |
F1 | 72/18 | 1 |
F2 | 72/18 | 2 |
F3 | 72/18 | 3 |
F4 | 72/18 | 4 |
F5 | 72/18 | 5 |
F6 | 72/18 | 6 |
Type . | Size of training/testing data . | Label . |
---|---|---|
F0 | 72/18 | 0 |
F1 | 72/18 | 1 |
F2 | 72/18 | 2 |
F3 | 72/18 | 3 |
F4 | 72/18 | 4 |
F5 | 72/18 | 5 |
F6 | 72/18 | 6 |
Parameter . | Symbol . | Value . |
---|---|---|
Learning rate | |$\alpha $| | 0.001 |
Discount factor | |$\gamma $| | 0.1 |
Greedy factor | |$\varepsilon $| | 0.02 |
The capacity threshold of the experience replay buffer | |${{R}_1},{{R}_2}$| | 100,10000 |
Batch size | B | 32 |
Number of neurons in the fully connected layer | H | 128 |
Epochs | I | 600 |
Interaction steps in a single epoch | T | 504 |
Update frequency of the target network | C | 10 |
Parameter . | Symbol . | Value . |
---|---|---|
Learning rate | |$\alpha $| | 0.001 |
Discount factor | |$\gamma $| | 0.1 |
Greedy factor | |$\varepsilon $| | 0.02 |
The capacity threshold of the experience replay buffer | |${{R}_1},{{R}_2}$| | 100,10000 |
Batch size | B | 32 |
Number of neurons in the fully connected layer | H | 128 |
Epochs | I | 600 |
Interaction steps in a single epoch | T | 504 |
Update frequency of the target network | C | 10 |
Note: The learning rate of the policy network in Actor-Critic, PPO and SAC is consistent with Table 3, and the learning rate of the value network is 1e-2.
Parameter . | Symbol . | Value . |
---|---|---|
Learning rate | |$\alpha $| | 0.001 |
Discount factor | |$\gamma $| | 0.1 |
Greedy factor | |$\varepsilon $| | 0.02 |
The capacity threshold of the experience replay buffer | |${{R}_1},{{R}_2}$| | 100,10000 |
Batch size | B | 32 |
Number of neurons in the fully connected layer | H | 128 |
Epochs | I | 600 |
Interaction steps in a single epoch | T | 504 |
Update frequency of the target network | C | 10 |
Parameter . | Symbol . | Value . |
---|---|---|
Learning rate | |$\alpha $| | 0.001 |
Discount factor | |$\gamma $| | 0.1 |
Greedy factor | |$\varepsilon $| | 0.02 |
The capacity threshold of the experience replay buffer | |${{R}_1},{{R}_2}$| | 100,10000 |
Batch size | B | 32 |
Number of neurons in the fully connected layer | H | 128 |
Epochs | I | 600 |
Interaction steps in a single epoch | T | 504 |
Update frequency of the target network | C | 10 |
Note: The learning rate of the policy network in Actor-Critic, PPO and SAC is consistent with Table 3, and the learning rate of the value network is 1e-2.
Fig. 8 illustrates the reward generated by the five algorithms under the condition of 600 epochs. After meticulous analysis, as the number of interaction epochs increases, the cumulative reward of the SAC algorithm experiences a gradual rise, yet it remains below 300 even at the maximum epoch. Both REINFORCE and Actor-Critic, when trained up to approximately 450 epochs, achieve stable reward values hovering around 432. Notably, the reward curves of two approaches nearly coincide during the entire training phase. In comparison to other algorithms, while PPO achieves a slightly higher cumulative reward than DQN when the training process is stable, it experiences a sharp decline around epoch 510, impacting algorithm stability. Conversely, the DQN exhibits rapid convergence speed and maintains a more stable cumulative reward. The reward curves indirectly validate that the offline-policy algorithm DQN effectively utilizes training data and excels in fault classification tasks. This success establishes a solid foundation for further exploration of a DRL framework suitable for railway point machine fault identification.

4.2.2. Constructing the DL module and designing the greedy strategy
To ensure that the constructed DL module optimally fits the action value, and the greedy strategy effectively balances the trade-off between exploration and exploitation, this study conducts experiments incorporating various DL network structures and greedy strategies. Combination 1 consists of a fully connected layer paired with a deterministic strategy. Combination 2 integrates a convolutional layer, a pooling layer and a fully connected layer with a deterministic strategy. Combination 3 utilizes a convolutional layer, a pooling layer and a fully connected layer with an exponential decay strategy. In these combinations, the convolutional layer's output channels are set to 16, utilizing a convolution kernel size of |$1 \times 3$| with a stride of 1. The pooling layer features a kernel size of |$1 \times 2$| with a stride of 2. Additionally, the fully connected layer is established with 128 neurons. In the deterministic strategy, we specify the greedy factor |$\varepsilon = 0.02$|. For the exponential decay strategy, both the exponential decay factor |$\lambda = 0.9$| and the initial greedy factor |${{e}_0} = 0.1$| are defined. Our selected combination involves two convolutional layers and two pooling layers, and a fully connected layer with a linear attenuation strategy. The number of output channels in the second convolutional layer has been augmented by 8, while the remaining parameters remain consistent with the aforementioned settings. In the linear decay strategy, we set the initial greedy factor |${{e}_0} = 0.02$|, the lower limit |${{e}_{\min }} = 0.008$| and linear decay factor |$\eta = 0.001$|. Once the combination configuration is finalized, it is integrated into the DQN for training. The training reward curves under four different combinations are shown in Fig. 9, and the determination of the optimal combination is based on the analysis of cumulative rewards.

After a comprehensive consideration involving the number of training samples and the reward mechanism outlined in section 3.2.2, it is evident that the cumulative reward range per training epoch spans from 0 to 504. This signifies that the maximum reward value is 504. As depicted in the results in Fig. 9 and the local view of the reward curve, under the consistent application of the greedy strategy, in contrast to combination 1, combination 2 displays a relatively slower convergence speed and a lower reward value in the initial stages of training, indicating a higher incidence of error recognition. While the reward curves of the two converge in the later stages, it is noteworthy that the design of the DL module exerts a discernible influence on the overall training outcomes. On the contrary, while combination 3 employs the exponential decay strategy, reaching 504 in certain local segments, the overall curve exhibits pronounced fluctuations, signifying a lack of stability in the training network. The curve obtained through the proposed combination not only closely aligns with the maximum reward curve but also indirectly reflects superior stability performance through reduced fluctuation levels.
To gauge the disparity in rewards between different combinations, we computed the average reward throughout the training process, as detailed in Table 4, revealing a clear trend: the proposed combination exhibits a higher average reward value compared to the other three combinations. Moreover, the average growth rate of rewards has achieved 1.44%. These findings not only align with the reward curve depicted in Fig. 9 but also show that 1DCNN and the linear decay greedy strategy designed can obtain satisfactory results. These insights affirm that the agent network showcases excellent prowess in classification decision-making. This success signifies the development of a proficient DRL framework for diagnosing faults.
Combination . | Overall average reward . | Maximum cumulative reward . |
---|---|---|
Combination 1 | 493.37 | 504 |
Combination 2 | 485.38 | 504 |
Combination 3 | 495.54 | 504 |
Our combination | 498.45 | 504 |
Combination . | Overall average reward . | Maximum cumulative reward . |
---|---|---|
Combination 1 | 493.37 | 504 |
Combination 2 | 485.38 | 504 |
Combination 3 | 495.54 | 504 |
Our combination | 498.45 | 504 |
Combination . | Overall average reward . | Maximum cumulative reward . |
---|---|---|
Combination 1 | 493.37 | 504 |
Combination 2 | 485.38 | 504 |
Combination 3 | 495.54 | 504 |
Our combination | 498.45 | 504 |
Combination . | Overall average reward . | Maximum cumulative reward . |
---|---|---|
Combination 1 | 493.37 | 504 |
Combination 2 | 485.38 | 504 |
Combination 3 | 495.54 | 504 |
Our combination | 498.45 | 504 |
4.2.3. Evaluation of the superiority of the proposed method
In this study, we juxtapose the proposed diagnostic methods with classification algorithms grounded in DL and DRL, and conduct a thorough evaluation by computing diverse performance indicators to substantiate the proposed method's superiority. The specific parameter configurations for these methods employed in the experiment are as follows:
Method 1 is the Backpropagation (BP) neural network. The hidden layer comprises 32 neurons, with a learning rate set at 0.01, a momentum factor of 0.9 and a maximum iteration limit of 50.
Method 2 employs a Long Short-Term Memory (LSTM) model. The hidden layer consists of 32 neurons, with a single network layer. The learning rate is set to 0.01, the maximum number of iterations is 100 and the batch size is 32. The network parameters are updated using the cross-entropy loss function and the stochastic gradient descent optimizer.
Method 3 is the 1DCNN model. This model comprises a convolutional layer, a pooling layer and a fully connected layer. The parameter configurations for the convolution layer and pooling layer are detailed in section 4.2.2. The fully connected layer consists of 32 neurons, with a maximum iteration limit of 50. The other parameter updates follow the same way as those in the LSTM model.
Method 4 entails a diagnosis model rooted in DRL, as introduced in reference [26]. The construction of the agent involves employing a multiscale residual convolutional neural network (MRCNN). Three independent one-dimensional convolution modules are utilized, featuring output channels of 32, 64 and 128, respectively. The convolution kernel size is set to |$1 \times 3$|, with a stride of 1, and the output channel of the residual unit is 224. In the reward mechanism designed for correct classification, a reward of 1 is assigned, while for incorrect classification the reward is set to −1. The update of greedy factor |$\varepsilon $| follows a linear decay way, transitioning from 0.01 to 0.008. The discount factor is 0.1 and the training epochs is 600.
Method 5 introduces a diagnosis model of SAE-based DRL, as proposed in reference [24]. The SAE comprises a network structure with four layers. The input layer's unit count is determined by the input sample's dimensionality. The two encoder layers consist of 32 and 16 units, respectively. The output layer's unit count is contingent upon the dimension of the Q value. The configuration of the reward mechanism, discount factor and training epochs align with the approaches detailed in method 4. The update of the greedy factor |$\varepsilon $| is consistent with the research in this paper.
The proposed method. Within the DRL fault diagnosis framework in this study, the 1DCNN model is incorporated as the DL component and the structural specifics and parameter details of the model align with the optimal combination outlined in section 4.2.2. The design of the greedy strategy within the RL component mirrors that elucidated in optimal combination of section 4.2.2. Additionally, other training parameters remain in accordance with those presented in Table 3.
Based on the designed comparative experiments, we input 504 sets of training data into various classification models to derive training outcomes, as depicted in Fig. 10. Subsequently, upon completion of model training, an additional 126 sets of test data are employed to assess the classification performance. The test results are presented in Table 5. Combining with the reward mechanism in this study, the formula for calculating classification accuracy is defined as
where |${{Y}_1}$| is the average cumulative reward in the training or testing process, that is, the number of correctly classified samples, and |${{Y}_2}$| denotes the maximum cumulative reward achievable during the training or testing, that is, the total number of training samples or test samples.

Classification method . | Testing accuracy (%) . |
---|---|
Method 1 | 88.09 |
Method 2 | 92.06 |
Method 3 | 96.83 |
Method 4 | 97.62 |
Method 5 | 97.62 |
Our method | 98.41 |
Classification method . | Testing accuracy (%) . |
---|---|
Method 1 | 88.09 |
Method 2 | 92.06 |
Method 3 | 96.83 |
Method 4 | 97.62 |
Method 5 | 97.62 |
Our method | 98.41 |
Classification method . | Testing accuracy (%) . |
---|---|
Method 1 | 88.09 |
Method 2 | 92.06 |
Method 3 | 96.83 |
Method 4 | 97.62 |
Method 5 | 97.62 |
Our method | 98.41 |
Classification method . | Testing accuracy (%) . |
---|---|
Method 1 | 88.09 |
Method 2 | 92.06 |
Method 3 | 96.83 |
Method 4 | 97.62 |
Method 5 | 97.62 |
Our method | 98.41 |
Fig. 10 shows the accuracy of various methods during the training phase. The BP neural network as a classical classification method achieves an accuracy of 90.87% through continuous parameter optimization. As the current popular intelligent algorithms, LSTM and 1DCNN have 4.93% and 7.35% higher accuracy than BP, respectively. The methods 4 and 5 further enhance training accuracy, underscoring the advantages of DRL in the domain of fault diagnosis. Concurrently, when compared to the aforementioned methods, the proposed approach achieves a remarkable training accuracy of 98.9%.
This outcome signifies significant progress in model enhancement, indicating the feasibility of obtaining an optimal fault classification strategy more effortlessly through the proposed study. Simultaneously, referencing the findings in Table 5, it becomes apparent that the proposed method achieves an impressive classification accuracy of 98.41% on the test data, showcasing a high level of consistency with the performance observed during the training phase. This strongly indicates that when employing the DRL method for the railway point machines fault diagnosis task, its inherent independent exploration capability not only enhances the intelligence of diagnosis but also yields more satisfactory results.
Upon analysing the training and test results across the various methods mentioned above, the superiority of the proposed method is substantiated to a certain extent. Next, we will employ classification result visualization techniques and construct a multi-classification confusion matrix to comprehensively illustrate the classification performance of different methods.
Fig. 11 intuitively reflects the classification outcomes of different methods on the test data. It is evident that the BP neural network misclassifies the majority of F6 fault data as F5, leading to a notable decrease in overall testing accuracy. The LSTM model shows a slight improvement in the classification of F6, but some F0 fault data is misclassified, resulting in marginal enhancement in testing accuracy. Similarly, the 1DCNN model exhibits F0 error recognition, but its classification results for other fault data are ideal. The overall number of misclassifications for methods 4 and 5 has been significantly reduced, with a primary focus on addressing fault types F2 and F6. In contrast, the proposed method only misclassifies two samples, showcasing superior classification performance.

The schematic diagram depicting correct and incorrect classifications: (a) method 1; (b) method 2; (c) method 3; (d) method 4; (e) method 5; (f) proposed method.
To quantitatively analyse the classification results for different faults, we have generated the multi-classification confusion matrix illustrated in Fig. 12. This matrix provides insights into the probability of each fault being correctly classified, based on the relationship between the actual label and the predicted label. Both BP and LSTM exhibit misclassification in two fault types, with particularly poor accuracy for F6 fault data at only 22% and 67%. The results for 1DCNN show relative improvement, and a detailed analysis of the current curves for F0 and F6 reveals highly similar waveforms, underscoring a limitation in 1DCNN's ability to classify such similar samples. While methods 4 and 5 also encompass error recognition for various fault types, their overall accuracy surpasses other approaches. Fortunately, the proposed method not only exhibits error recognition for a singular fault type but also maintains a commendable accuracy of 88%, demonstrating a more stable classification performance.

The confusion matrix diagram for multi-class classification: (a) method 1; (b) method 2; (c) method 3; (d) method 4; (e) method 5; (f) proposed method.
4.2.4. Evaluation of the reliability of the proposed method
The various comparative experiments conducted effectively validate the efficacy of the fault diagnosis method, built upon the DRL framework. This is evidenced by its strong performance in both training cumulative reward and classification test accuracy. However, it remains crucial to thoroughly validate the reliability of the proposed method. In our approach, we opted to vary the iterations, enabling the agent to execute multiple runs on the test data. This observation aimed to determine whether the accuracy demonstrated substantial differences across iterations. The classification test results across these diverse iterations are as follows.
Fig. 13 illustrates the reward distribution of test data after a single iteration, which represents the test process chosen by the proposed method in this study. Notably, only two out of 126 samples received the reward 0, indicating a misclassification count of 2.

Figs. 14 and 15 display the distribution of cumulative reward when iterating the test data 50 and 100 times, respectively. The cumulative reward values predominantly cluster within the range of 123 to 126. Combined with Table 6, the average reward values for these iterations are 124.82 and 124.97, corresponding to accuracies of 99.06% and 99.18%. Comparatively, against the 98.41% accuracy achieved through a single iteration, the classification accuracy demonstrates improvement and tends towards stability with increased iterations. This underscores that, guided by the optimal classification policy obtained during the training phase, the agent depends on the increasing familiarity with the data environment through multiple iterations to attain more stable and precise recognition results. Moreover, in the context of fault classification using DL, test outcomes exhibit a notable reliance on the trained model. Following multiple training sessions, test results might exhibit random fluctuations. Consequently, the DRL method introduced in this study adeptly integrates DL's perceptual capabilities with RL's autonomous exploration prowess in the railway point machines fault diagnosis task, thereby achieving a more excellent classification effect.


Iteration . | Average cumulative reward . | Maximum cum8lative reward . | Accuracy (%) . |
---|---|---|---|
1 | 124.00 | 126 | 98.41 |
50 | 124.82 | 126 | 99.06 |
100 | 124.97 | 126 | 99.18 |
Iteration . | Average cumulative reward . | Maximum cum8lative reward . | Accuracy (%) . |
---|---|---|---|
1 | 124.00 | 126 | 98.41 |
50 | 124.82 | 126 | 99.06 |
100 | 124.97 | 126 | 99.18 |
Iteration . | Average cumulative reward . | Maximum cum8lative reward . | Accuracy (%) . |
---|---|---|---|
1 | 124.00 | 126 | 98.41 |
50 | 124.82 | 126 | 99.06 |
100 | 124.97 | 126 | 99.18 |
Iteration . | Average cumulative reward . | Maximum cum8lative reward . | Accuracy (%) . |
---|---|---|---|
1 | 124.00 | 126 | 98.41 |
50 | 124.82 | 126 | 99.06 |
100 | 124.97 | 126 | 99.18 |
5. Conclusions
Considering the constraints posed by the current level of intelligence in railway point machines fault diagnosis, we refine the 1DCNN model to form the agent. Simultaneously, we optimize the DQN algorithm to enhance independent exploration capabilities, ultimately creating a robust DRL-based diagnostic framework.
Importantly, many comparative experiments underscore the perfect performance of the proposed method; the infusion of independent exploration ability will address fault diagnosis in a humanoid thought process, thereby substantially enhancing diagnostic intelligence. Furthermore, the flexibility of the Q network in the DQN allows for its adaptable design with reference to other DL models. This adaptability enables the proposed DRL framework to be effectively extended to diverse types of railway point machines, such as S700K, ZYJ7 and beyond. Notably, it is also crucial to explore enhanced reward mechanisms to align with the requirements of fault diagnosis scenarios. Looking to the future, we aim to engage in profound contemplation and detailed analysis from this standpoint. We are also committed to applying innovative fault diagnosis methods to delve into the research of additional crucial railway equipment, so as to systematically evaluate the health status of these components for enhanced effectiveness.
Acknowledgements
This research was supported by the Transportation Science and Technology Project of the Liaoning Provincial Department of Education (Grant No. 202243), the Provincial Key Laboratory Project (Grant No. GJZZX2022KF05) and the Natural Science Foundation of Liaoning Province (Grant No. 2019-ZD-0094).
Author contributions statement
Shuai Xiao designed the methodology. Shuai Xiao and Qingsheng Feng carried out data analysis and wrote the manuscript. Xue Li and Hong Li helped organize the manuscript.
Conflict of interest statement
There was no conflict of interest in the submission of the manuscript, and all authors agreed to publish it.