Research on intelligent fault diagnosis for railway point machines using deep reinforcement learning

Abstract

The advanced diagnosis of faults in railway point machines is crucial for ensuring the smooth operation of the turnout conversion system and the safe functioning of trains. Signal processing and deep learning-based methods have been extensively explored in the realm of fault diagnosis. While these approaches effectively extract fault features and facilitate the creation of end-to-end diagnostic models, they often demand considerable expert experience and manual intervention in feature selection, structural construction and parameter optimization of neural networks. This reliance on manual efforts can result in weak generalization performance and a lack of intelligence in the model. To address these challenges, this study introduces an intelligent fault diagnosis method based on deep reinforcement learning (DRL). Initially, a one-dimensional convolutional neural network agent is established, leveraging the specific characteristics of point machine fault data to automatically extract diverse features across multiple scales. Subsequently, deep Q network is incorporated as the central component of the diagnostic framework. The fault classification interactive environment is meticulously designed, and the agent training network is optimized. Through extensive interaction between the agent and the environment using fault data, satisfactory cumulative rewards and effective fault classification strategies are achieved. Experimental results demonstrate the proposed method's high efficacy, with a training accuracy of 98.9% and a commendable test accuracy of 98.41%. Notably, the utilization of DRL in addressing the fault diagnosis challenge for railway point machines enhances the intelligence of diagnostic process, particularly through its excellent independent exploration capability.

1. Introduction

China's railway is currently experiencing a phase of rapid development. The collaborative synergy between various pieces of key equipment significantly contributes to the secure and steady operation of trains [1–3]. The turnout conversion system holds a pivotal role in high-speed railway and heavy-haul railway transportation. Serving as a crucial component within the system, railway point machines precisely manage the train's running direction by adjusting the position of the turnout [4, 5]. In the event of a malfunction, the normal operation of point machines is imperative. Any failure in their operation can lead to a complete halt in the turnout conversion system, preventing high-speed trains from changing their operating lines as required. In more severe scenarios, such malfunctions can result in major accidents like train derailments, posing incalculable risks to personnel safety and economic well-being [6–9].

Research into methods for diagnosing faults in railway point machines has stimulated in-depth discussions among scholars. These studies primarily utilize data-driven approaches combined with signal-processing and supervised-learning methods. Huang et al. [10] employed the Fréchet distance and similarity function to measure the similarity between the standard current curve and the curve under examination, providing an initial assessment of the operational status of point machines. Wei et al. [11] introduced the ensemble empirical mode decomposition (EEMD) and fuzzy entropy theory to process the power signal of point machines, combining the grey correlation algorithm for swift diagnosis. However, the former can only determine the occurrence of a fault and cannot provide information regarding the specific fault type. The latter is susceptible to limitations, particularly when faced with a large dataset, as fault curves with high similarity may result in poor discrimination performance. The rapid advancement of machine learning (ML) and deep learning (DL) has propelled intelligent fault diagnosis to new heights [12]. Supported by substantial data, the algorithm is anticipated to overcome the previously mentioned issues. Chen et al. [13] proposed a deep residual convolutional neural network to extract local features from power curves, integrating a multi-head self-attention mechanism to focus on key features. The combination of these approaches facilitates efficient diagnosis of faults. Wang and Li [14] devised a composite model of neural network; by shallowly extracting and deeply mining the power signal characteristics, the classifier produces ideal recognition results. In contrast to the above diagnosis methods that rely on electrical signals, Cao et al. [15] focused on extracting sound signals from point machines, employing two-stage feature selection and ensemble learning to effectively classify faults. In a similar vein, Sun et al. [16] applied variational mode decomposition (VMD) to process the vibration signals associated with the fault states, synthesized multiscale features to construct a feature set and utilized support vector machine (SVM) for fault identification. The exploration of both sound and vibration signals introduces a novel perspective to the fault diagnosis of railway point machines. Particularly, the vibration signal, with its inherent stability and rich feature information, holds representative value in fault identification research.

A review of the literature reveals that the fusion of signal processing and ML can serve as an effective method for fault identification. However, when dealing with high-dimensional fault data, signal decomposition is not only time-consuming but also heavily reliant on manual expertise for feature selection and classifier parameter tuning. DL-based diagnostic methods can adaptively extract fault features and comprehensively analyse the intrinsic relationships between these features, leading to satisfactory outcomes [17, 18]. Yet, these methods have their limitations: (1) When addressing specific tasks, optimizing network structures and parameters often depends on substantial human intervention and effort, which can result in a network's poor generalization capability. (2) The essence of supervised learning mechanism lies in conducting multiple trainings within a given data distribution to identify an excellent classification model. The absence of additional incentives influencing the entire learning process also underscores the need for improvement in the independent exploration ability of DL methods in fault diagnosis. Fortunately, another product of artificial intelligence (AI), reinforcement learning (RL), has been effectively combined with DL, harnessing its remarkable self-exploration ability alongside the strong perceptual capabilities of DL, forming what is known as deep reinforcement learning (DRL) and further propelling the evolution of intelligent fault diagnosis [19, 20].

To date, DRL has found extensive application in areas such as automatic control, intelligent scheduling and unmanned driving [21–23]. However, a group of researchers has embarked on employing it to tackle the issue of mechanical fault identification, marking a novel approach in this domain. Ding et al. [24] successfully achieved fault diagnosis in rotating machinery by integrating a stacked automatic encoder (SAE) network with DQN (Deep Q Network), thereby providing substantial evidence for the effectiveness of the DRL model in this domain. Wang and Xuan [25] innovatively proposed a diagnosis method for bearings and tools, leveraging one-dimensional convolution and enhanced Actor-Critic algorithm to achieve superior recognition outcomes. Wang et al. [26] integrated DQN and transfer learning to achieve fault diagnosis of planetary gearboxes under varying operating conditions. Their findings validate that DRL possesses distinct advantages over some DL methods. However, when dealing with various types of fault data, delving deeper into finding a harmonious blend between DL's network structure and RL's reward mechanisms and strategies remains an area that requires further exploration to yield superior outcomes. To this end, this research introduces an intelligent fault diagnosis framework for railway point machines, utilizing a one-dimensional convolutional neural network (1DCNN) and an optimized DQN algorithm. The primary contributions in this work are as follows:

Combining the characteristics of the current data from point machines, a fault diagnosis agent based on the 1DCNN model is thoughtfully crafted to automatically extract multiscale features, mitigating the need for excessive dependence on manual design.
The DQN algorithm serves as the core element in the proposed diagnosis framework and is further optimized. This optimization encompasses the establishment of a classification interactive environment and the exploration of an agent training network to obtain more gratifying cumulative rewards.
The meticulously designed reward mechanism is closely tailored to the fault diagnosis scenario, which promotes the agent to generate excellent classification policy after extensive interaction with the data environment, ultimately leading to more favourable diagnostic outcomes.

The remaining sections of this study are outlined as follows. Section 2 introduces the relevant foundational theory of RL. Section 3 proposes the data processing and the construction of DRL fault diagnosis framework. A detailed analysis of the experimental configuration and diagnostic results is provided in section 4. Section 5 offers a comprehensive summary of the research.

2. Foundational theory of RL

2.1. Principle of RL

RL is a computational approach wherein an agent engages with the environment iteratively to achieve a desired objective [27]. Fig. 1 illustrates the detailed interaction process. Guided by a policy |$\pi $|⁠, the agent takes action |${{a}_t}$| to alter the present environmental state |${{s}_t}$|⁠. Simultaneously, the environment provides corresponding reward |${{r}_t}$| based on the action execution, and the feedback on the environmental state |${{s}_{t + 1}}$| is conveyed to the agent in the subsequent moment, creating a closed-loop process. In the field of RL, the environment is essentially conceptualized as a Markov decision process (MDP), detailing the evolution of state information and the mechanism of reciprocal transitions between states [28]. The interplay between policy |$\pi $|⁠, discount factor |$\gamma $| and reward function is considered to optimize the expectation of cumulative reward |${{R}_t}$|⁠, as in Eq. (1).

$$\begin{eqnarray} {{R}_t} = \sum\limits_{k = 0}^\infty {{{\gamma }^k}{{r}_{t + k}}} \end{eqnarray}$$

(1)

where |${{r}_{t + k}}$| is the reward values at different moments.

Fig. 1.

The dynamic interaction between the agent and the environment.

Open in new tab Download slide

2.2. DQN algorithm

DQN represents a novel network architecture rooted in Q-Learning [29]. Its primary methodology involves leveraging a neural network to model the action value function, denoted as |$Q( {s,a} )$|⁠. This neural network approximates the action values |$Q( {{{s}_t},{{a}_t}} )$|⁠, associated with all feasible actions |${{a}_t}$|⁠, within each state |${{s}_t}$|⁠. Consequently, the neural network employed for this purpose is often referred to as the Q network. DQN is particularly well-suited for addressing scenarios involving discrete actions. The algorithm efficiently processes the state |${{s}_t}$| by inputting it into the network, obtaining the corresponding action values for each possible action. The specific procedure of this algorithm is shown in Fig. 2.

Fig. 2.

The execution process of the DQN algorithm.

Open in new tab Download slide

To ensure superior and stable performance of the Q network during training, the algorithm incorporates an experience replay module and target network structure [30]. The experience replay module involves the continuous collection of trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| by the agent interacting with the environment, guided by a greedy strategy. Once the data volume in the buffer exceeds a predefined threshold, batches of data are randomly sampled for training the Q network. The target network is designed with a structure identical to the earlier trained network |${{Q}_\omega }( {{{s}_t},{{a}_t}} )$| but differs in the frequency of parameter updates. The parameters |$\varpi $| of target network |${{Q}_\varpi }( {{{s}_t},{{a}_t}} )$| are updated simultaneously with the parameters |$\omega $| of the earlier network every few steps C, enhancing the overall stability of the Q network training. Noteworthy is that the update of network parameters is achieved through a combination of minimizing the object loss function and utilizing gradient descent. The computation of loss function is defined as Eq. (2).

$$\begin{eqnarray} \it{Loss} = \frac{1}{N}\sum\limits_{i = 1}^N {{{{\left[ {{{r}_i} + \gamma {{{\max }}_a}{{Q}_\varpi }\left( {{{s}_{i + 1}},{{a}_{i + 1}}} \right) - {{Q}_\omega }\left( {{{s}_i},{{a}_i}} \right)} \right]}}^2}} \end{eqnarray}$$

(2)

where N signifies the aggregate count of trajectory data, |${{Q}_\omega }( {{{s}_i},{{a}_i}} )$| denotes the earlier training network, |${{Q}_\varpi }( {{{s}_{i + 1}},{{a}_{i + 1}}} )$| represents the target network, |$\omega $| and |$\varpi $| stand for the training parameters of the respective networks, and |${{r}_i} + \gamma {{\max }_a}{{Q}_\varpi }( {{{s}_{i + 1}},{{a}_{i + 1}}} )$| is the action value derived from the output of the target network.

3. Methodology

3.1. Data processing

Due to the discernible disparity in the overall action duration of railway point machines arising from diverse faults, the extracted experimental data falls short of meeting the dimension and numerical range criteria for the direct application of the diagnosis model. Addressing these shortcomings, the specific process consists of dimension adjustment and data normalization. On the one hand, the data dimension is reasonably determined by setting a threshold. If the experimental data dimension exceeds this threshold, redundant data is eliminated. Conversely, if the dimension is less than the threshold, vacant data is filled with zeros. At same time, the maximum-minimum normalization method is used to mitigate distribution differences among various data, ensuring compatibility with the proposed model. The normalization calculation is shown in Eq. (3).

$$\begin{eqnarray} {{x}_n} = \frac{{{{x}_i} - {{x}_{\min }}}}{{{{x}_{\max }} - {{x}_{\min }}}} \end{eqnarray}$$

(3)

where |${{x}_n}$| represents the processed data, |${{x}_i}$| denotes the data to be processed, |${{x}_{\max }}$| is the maximum value in the data to be processed and |${{x}_{\min }}$| represents the minimum value.

3.2. Developing the DRL framework for a fault diagnosis task

3.2.1. 1DCNN model

Within the proposed DRL diagnosis framework, the 1DCNN serves as the DL module primarily employed for constructing a fault diagnosis agent. Conventionally, convolutional neural network (CNN) has gained widespread recognition and application for processing two-dimensional data, with images being a typical example [31, 32]. Notably, 1DCNN demonstrates better processing capabilities when dealing with one-dimensional sequence data, such as current or vibration signals. As illustrated in Fig. 3, this convolution operation offers the flexibility to dynamically adjust the convolution kernel size, facilitating the rapid extraction of features.

Fig. 3.

One-dimensional convolution operation.

Open in new tab Download slide

Building upon the aforementioned description, the 1DCNN model designed in this study is depicted in Fig. 4. It comprises five key components: input, convolution layer, pooling layer, fully connected layer and output. The input module encompasses data representing various faults of point machines acquired through data processing. The convolutional structure consists of two layers, capturing diverse multiscale fault features by employing different numbers of convolution kernels. The two pooling layer utilizes the maximum pooling method to diminish feature dimensions and mitigate the impact of irrelevant features. The fully connected layer consolidates all features following convolution and pooling operations, computing the corresponding Q value for different fault types through a function mapping relationship. Hence, the established 1DCNN model can proficiently attain precise estimation of the agent's action value.

Fig. 4.

The proposed 1DCNN model in this study (assuming the length of input data is 165).

Open in new tab Download slide

3.2.2. Improved DQN algorithm

When applying DRL to the realm of fault diagnosis, it is imperative to select an appropriate algorithm tailored to the specific task [33]. Presently, the DRL algorithms predominantly fall into two categories: those based on value function and those based on policy, the typical online-policy gradient algorithms like REINFORCE, Actor-Critic and PPO (Proximal Policy Optimization). These algorithms in application imply a lower likelihood of reusing training samples, posing challenges in achieving satisfactory results, particularly in classification problems characterized by limited sample data. In contrast, the offline-policy algorithm, such as DQN, stands out for its experience replay buffer design. This design enables the collection of ample samples, facilitating enough learning by the agent. It effectively enhances the utilization of training samples while mitigating the correlation between them. However, the standard DQN algorithm often falls short of achieving desired results across different problems. In this research, we design a classification interactive environment and agent training network based on fault data type, sample size and distribution characteristics. The aim is to achieve great stability and satisfactory cumulative rewards.

Interactive environment for fault classification

The core of establishing the interactive environment lies in employing MDP to vividly articulate the fault classification task and delineate each component [34]. Assuming the fault dataset used for training is denoted as |$D = \{ {{{k}_1},{{k}_2},{{k}_3}, \cdot \cdot \cdot ,{{k}_n}} \}$|⁠, the set of label categories to which the fault sample belongs is |$L = \{ {{{l}_1},{{l}_2},{{l}_3}, \cdot \cdot \cdot ,{{l}_m}} \}$|⁠, and each sample can be associated with the corresponding label, |${{k}_i} = {{l}_t},i \in [ {1,n} ],t \in [ {1,m} ]$|⁠. The breakdown of each module in MDP is as follows:

State: there exists a specific correlation between the state set S and the training dataset D; each state |${{s}_t}$| at every time step corresponds to a sample |${{k}_i}$| in the dataset. Consequently, we can define the state set as

$$\begin{eqnarray} S = \left\{ {{{s}_1},{{s}_2},{{s}_3}, \cdot \cdot \cdot ,{{s}_n}} \right\} = \left\{ {{{k}_1},{{k}_2},{{k}_3}, \cdot \cdot \cdot ,{{k}_n}} \right\} \end{eqnarray}$$

(4)

where S represents the set of state, i.e. the data sample set. At the initiation of the interaction, the state |${{s}_1}$| corresponds to the sample |${{k}_1}$|⁠. Clearly, when the environment state is reset, it means that the fault sample is also randomly initialized.

Action: the classification actions executed by the agent are associated with fault categories, and these fault categories are consistent with the label categories. Consequently, the set of classification actions is defined as

$$\begin{eqnarray} A = \left\{ {{{a}_1},{{a}_2},{{a}_3}, \cdot \cdot \cdot ,{{a}_m}} \right\} \end{eqnarray}$$

(5)

where m represents the total number of classification actions, which is also equivalent to the total number of fault types and label categories.

Reward: the formulation of reward is the key link in MDP; a well-designed reward function frequently yields optimal training outcomes [35]. In this study, a reward mechanism is devised based on the peculiarities of the fault classification task. If the action executed by the agent in the current time step aligns with the sample label, a reward is granted. Conversely, if the action contradicts the label, the reward is 0. The rules governing rewards are designed as

$$\begin{eqnarray} r = \left\{ {\begin{array}{@{}*{2}{c}@{}} {1,}&{{{a}_t} = {{l}_t}}\\ {0,}&{{{a}_t} \ne {{l}_t}} \end{array}} \right. \end{eqnarray}$$

(6)

where |${{a}_t}$| denotes the action taken under the current state |${{s}_t}$|⁠, and |${{a}_t} \in A,t \in [ {1,m} ]$|⁠.

Discount factor: the discount factor |$\gamma $| typically falls within the range of |$[ {0,1} )$|⁠. The closer the value is to 1, the greater emphasis is placed on long-term rewards. Conversely, the closer it is to 0, the more focus is directed towards recent rewards. In the fault classification problem, where instant rewards hold more significance, a smaller value for |$\gamma $| is employed.

Agent training network

Once the interactive environment is successfully established, the agent and the environment engage in interactive learning. This is achieved by optimizing the training network to acquire the ideal action value |${{Q}^*}$|⁠. Subsequently, a satisfactory fault classification policy |${{\pi }^*}$| is derived, facilitating improved classification action selection. In this research, we will delve into two key aspects, enhancing the Q network and optimizing the greedy strategy.

Enhanced Q network: the Q network in the standard DQN algorithm comprises a basic fully connected layer structure, proving inadequate for robust data feature extraction. Therefore, we adopt the proposed 1DCNN model as a new Q network to better fit the action value. The preprocessed fault data serves as input, treated as a discrete state within the network. Following convolution, pooling and so on, the network yields corresponding Q values for various classification actions. The fitting relationship of the neural network is defined as

$$\begin{eqnarray} Q = {{f}_\theta }\left( {{{s}_t}} \right) \end{eqnarray}$$

(7)

where |${{s}_t}$| represents the state of the current time step, |${{f}_\theta }$| is the function mapping relationship of neural network, |$\theta $| is all the parameters of 1DCNN model and Q denotes the output action value.

Optimized greedy strategy. the greedy strategy is commonly employed to facilitate exploration and exploitation during the interaction, aiding action selection to maximize cumulative rewards. Nevertheless, in the standard DQN, the greedy factor |$\varepsilon $| remains static, lacking a balance between exploration and exploitation. In this study, |$\varepsilon $| is engineered to linearly decay in proportion to the number of interaction steps. The specific expression is defined as

$$\begin{eqnarray} \varepsilon = \max \left\{ {{{e}_{\min }},{{e}_0} - \eta u} \right\} \end{eqnarray}$$

(8)

$$\begin{eqnarray} {{a}_t} = \left\{ {\begin{array}{@{}*{2}{c}@{}} {\arg {{{\max }}_{a \in A}}\hat{Q}\left( a \right),}&{1 - \varepsilon }\\ {a \in A,}&\varepsilon \end{array}} \right. \end{eqnarray}$$

(9)

where |${{e}_{\min }}$| is the lower threshold of greedy factor, |${{e}_0}$| is the initial value, |$\eta $| signifies the decay rate of greedy factor per single time step, u stands for the current interaction step, |$\hat{Q}( a )$| is to the assessment of cumulative reward expectation, |$\arg {{\max }_{a \in A}}\hat{Q}( a )$| is the selection of an action from the reward expectation evaluation set with the highest value, based on the probability |$1 - \varepsilon $|⁠, and |$a \in A$| is the random selection of an action from the action set A, according to the probability |$\varepsilon $|⁠, used to calculate the expected reward value upon execution.

Building upon the analysis of each module, algorithm 1 elucidates the specific implementation of improved the DQN algorithm.

Open in new tab

Algorithm 1: Improved DQN algorithm

Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I

Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|

1: Randomly initialize the parameters |$\omega $| of the earlier training network

2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network

3: for epoch = 1 to I do

4: Initiating environment state set S

5: for t = 1 to T do

6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy

7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|

8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|

9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer

10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do

11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training

12: The training parameters |$\omega $| are updated by using the gradient descent and loss function

13: The target network parameters |$\bar{\omega }$| undergo an update every time step C

14: end for

15: end for