Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
Algorithm 1: Improved DQN algorithm |
Input: Training sample state set S and label set L, experience replay buffer capacity threshold |${{R}_1},{{R}_2}$|, batch size B for training data extracted from the buffer, learning rate |$\alpha $|, discount factor |$\gamma $|, lower limit value |${{e}_{\min }}$| of greedy factor, initial value |${{e}_0}$| and attenuation |$\eta $|, update frequency C of target network, the number of interaction steps T in a single epoch and the total number of epoch I |
Output: The Q network parameters |$\theta = \{ {\omega ,\bar{\omega }} \}$| corresponding to the optimal action value |${{Q}^*}$| |
1: Randomly initialize the parameters |$\omega $| of the earlier training network |
2: Duplicate the parameters from |$\omega $| to |$\bar{\omega }$| and initialize the target network |
3: for epoch = 1 to I do |
4: Initiating environment state set S |
5: for t = 1 to T do |
6: Based on the current state |${{s}_t}$|, select action |${{a}_t}$| using the optimized greedy strategy |
7: Utilizing the action |${{a}_t}$| in conjunction with the predefined reward mechanism, obtain the reward through environmental feedback, denoted as |${{r}_t}$| |
8: After the completion of the agent's action, the environmental state transitions from |${{s}_t}$| to |${{s}_{t + 1}}$| |
9: The trajectory data |$\tau = ( {{{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}}} )$| is stored in the experience replay buffer |
10: if the total amount of data reaches the lower threshold |${{R}_1}$| of buffer capacity do |
11: According to the batch size B to extract the trajectory data |$\tau $| for Q network training |
12: The training parameters |$\omega $| are updated by using the gradient descent and loss function |
13: The target network parameters |$\bar{\omega }$| undergo an update every time step C |
14: end for |
15: end for |
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.