Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Algorithm 1: Improved DQN algorithm
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|⁠, batch size B for training data extracted from the buffer, learning rate |$\alpha $|⁠, discount factor |$\gamma $|⁠, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|⁠, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$|
1: Randomly initialize the parameters |$\omega $| of the earlier training network
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network
3: for epoch = 1 to I  do
4: Initiating environment state set S
5: for t = 1 to T  do
6: Based on the current state |${{s}_t}$|⁠, select action |${{a}_t}$| using the optimized greedy strategy
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$|
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$|
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C
14: end for
15: end for
Close
This Feature Is Available To Subscribers Only

Sign In or Create an Account

Close

This PDF is available to Subscribers Only

View Article Abstract & Purchase Options

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Close