Open in new tab Algorithm 1 : Improved...

Algorithm 1: Improved DQN algorithm

Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I

Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|

1: Randomly initialize the parameters |$\omega $| of the earlier training network

2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network

3: for epoch = 1 to I do

4: Initiating environment state set S

5: for t = 1 to T do

6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy

7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|

8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|

9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer

10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do

11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training

12: The training parameters |$\omega $| are updated by using the gradient descent and loss function

13: The target network parameters |$\bar{\omega }$| undergo an update every time step C

14: end for

15: end for

This Feature Is Available To Subscribers Only