Optimal synchronization in pulse-coupled oscillator networks using reinforcement learning

Author Notes

Abstract

Spontaneous synchronization is ubiquitous in natural and man-made systems. It underlies emergent behaviors such as neuronal response modulation and is fundamental to the coordination of robot swarms and autonomous vehicle fleets. Due to its simplicity and physical interpretability, pulse-coupled oscillators has emerged as one of the standard models for synchronization. However, existing analytical results for this model assume ideal conditions, including homogeneous oscillator frequencies and negligible coupling delays, as well as strict requirements on the initial phase distribution and the network topology. Using reinforcement learning, we obtain an optimal pulse-interaction mechanism (encoded in phase response function) that optimizes the probability of synchronization even in the presence of nonideal conditions. For small oscillator heterogeneities and propagation delays, we propose a heuristic formula for highly effective phase response functions that can be applied to general networks and unrestricted initial phase distributions. This allows us to bypass the need to relearn the phase response function for every new network.

Subject

Computer Science and Engineering

Issue Section:

Physical Sciences and Engineering

Editor:

Significance Statement

Due to its simplicity and physical interpretability, pulse-coupled oscillators has emerged as one of the standard models to study synchronization in both biological and engineered networks. However, finding an optimal pulse-interaction mechanism is challenging, and generally intractable in practical scenarios involving random delays and frequency differences. By utilizing reinforcement learning strategies, we obtain pulse-interaction mechanisms that optimize both the speed and probability of synchronization even in the presence of random delays and frequency differences. The results give a general formula of the optimal interaction mechanism for arbitrary network structures, and hence enables predicting the optimal interaction mechanism for every new network without re-implementing the reinforcement learning process.

Introduction

Recent research has turned to using the pulse-coupled oscillator (PCO) model (1–10) that was initially proposed to describe biological neuronal networks (11–22) and cardiac pacemakers (23–26), but has subsequently found applications in many other systems, including artificial neural networks (27), social self-organization (28), and clock coordination of wireless sensor networks (29, 30). The PCO model explicitly incorporates the hybrid nature of network dynamics (in contrast to other models such as the Kuramoto oscillators (31)) and promises great potential for synchronizing engineered networks (29). For example, the PCO-based strategies have been found to be highly successful in achieving motion coordination in robot swarms (32–34). Since only simple and content-free pulses are sent between agents, PCO-based synchronization is naturally resilient to message corruption in communications and incurs little communication overhead, leading to improved network robustness and reduced communication latency (35, 36). These advantages make PCO-based synchronization particularly appealing for the coordination in engineering networks such as robot swarms and vehicle fleets that are subject to stringent reliability and real-time constraints.

The synchronizability of PCO networks depends crucially on the phase response function (PRF), which characterizes how an oscillator adjusts its phase when a pulse is received from a neighboring oscillator (37). The amount of adjustment depends on the current state of the receiving oscillator. Many analytical results on PRF have been reported under ideal conditions. For example, Klinglmayr et al. (38) and Lyu (39) investigated the synchronization of PCOs with stochastic PRFs. Lyu (40) further analyzed the synchronization time of PCOs on tree networks. Wang and Doyle (41) proposed a PRF that can maximize the speed of synchronization of PCOs when the initial oscillator phases are distributed within a half-cycle. However, these results are often obtained under ideal conditions, including zero time delays and identical oscillator frequencies. In fact, nonideal factors such as propagation delays and heterogeneous oscillator frequencies render the analytical design of PRF extremely difficult, if at all possible (9, 42–49, 50–52).

In this article, we propose a reinforcement learning (RL) approach to determine a highly effective PRF under both ideal and nonideal conditions. The interaction strategy found by our RL approach improves synchronization probability compared to previously proposed PRFs in (41, 42, 43, 47). Moreover, our results provide insights on a general design principle for PRF that can be adapted to general network topologies. Finally, the flexibility of our RL framework allows oscillators to adapt to changing network structures and environmental noise, ensuring robust synchronization under real-world conditions.

Results

Pulse-coupled oscillators

Let us consider a network of N PCOs, where $ϕ_{i} \in S^{1} = [0, 2 π)$ is the phase of oscillator $i \in V = {1, 2, \dots, N}$ ⁠. Each oscillator evolves its phase at a frequency $ω_{i}$ ⁠. When $ϕ_{i}$ reaches the threshold value $2 π$ ⁠, oscillator i fires a pulse and resets its phase to zero. Neighboring oscillators receive this pulse after some (random) time delay $τ_{i j}$ ⁠. This delay in receiving a pulse is primarily due to the finite propagation speed of the pulse, but it may also include the time required for a node to process the incoming pulse.

An oscillator responds to a received pulse by changing its phase $ϕ_{i}$ by

ψ_{i} = l F (ϕ_{i}) = lim_{τ ↓ 0} (ϕ_{i} (t + τ)) - ϕ_{i} (t) = ϕ_{i}^{+} - ϕ_{i}^{-},

(1)

where $ϕ_{i}^{-}$ and $ϕ_{i}^{+}$ represent the phase of oscillator i immediately before and after receiving a pulse, respectively. The function $F (ϕ)$ ⁠, which determines the amount that an oscillator will adjust its phase as a function of its phase value upon which the pulse is received, is called the phase response function (PRF). The jump in the value of the phase $ψ_{i}$ is determined not only by the PRF, but also by the coupling strength $l \in (0, 1)$ ⁠, which is introduced to facilitate the analysis and design of PCO-based synchronization in engineered systems (41, 42, 49, 53). It is worth noting that PRF is related to the phase transition curve (PTC) in (43) as $PTC = ϕ^{-} + l PRF$ ⁠.

To quantify the degree of synchronization of an oscillator network, we define the containing arc $Λ (ϕ)$ as the smallest interval in $S^{1}$ that contains all oscillator phases in the network:

Λ (ϕ) = 2 π - max_{i \in V} {min_{j \neq i} {(ϕ_{j} - ϕ_{i}) mod 2 π}} .

(2)

Following (54), we define an arc as a connected subset of the one dimension torus $S^{1}$ ⁠. Thus, $υ_{i} (ϕ) = min_{j \neq i} {(ϕ_{j} - ϕ_{i}) mod 2 π}$ in the preceding Eq. (2) denotes the length of the arc along $S^{1}$ to the first oscillator ahead in the phase of oscillator i. Note that $\sum_{i \in V} υ_{i} (ϕ) = 2 π$ always holds. Hence, $Λ (ϕ) = 2 π - max_{i \in V} υ_{i} (ϕ)$ is the smallest arc containing all oscillators and can be used to quantify the degree of synchronization.

We use RL to determine an optimal PRF $F (ϕ)$ that maximizes the probability of synchronization for a given number of oscillators:

{argmax}_{PRF} P_{G, ϕ_{0}} [ϕ_{t} on G synchronizes],

(3)

where G is an underlying graph with edges drawn randomly according to some distribution, $ϕ (0)$ is an initial phase configuration drawn uniformly at random, and $ϕ (t)$ is the phase trajectories of all oscillators under a given PRF and initial phase distribution.

Reinforcement learning

A number of works have recently been reported that use learning methods to investigate the dynamics of coupled oscillators (see, e.g. 55–58). However, all of these results consider smooth-interaction oscillators like Kuramoto oscillators. Recently the authors of (59) propose to use learning methods to predict if an oscillator network can synchronize or not, and they consider both smooth and pulse interactions. The approach in (59) considers interaction mechanisms that are given and predetermined. In this work, we leverage the exploration and adaptation properties of RL to optimize the interaction mechanism of PCOs. More specifically, we use RL to determine an optimal PRF $F (ϕ)$ that maximizes the probability to synchronize under both ideal and nonideal conditions. It is worth noting that formal analysis of PCO networks under general initial phase distributions and practical nonideal conditions, such as coupling delay and frequency heterogeneity, remains out of reach with current analytic techniques (47, 60). With RL strategies, we can model the nonideal factors, and let the oscillators evolve naturally in the network and gradually optimize their response to maximize synchronization probability.

A schematic of the approach is given in Fig. 1. Specifically, for each oscillator $i \in V$ ⁠, the state is its phase value $s = ϕ_{i}$ ⁠. When a pulse is received from a neighboring oscillator, oscillator i changes its phase value by an action $a = F (ϕ_{i})$ ⁠. The value of each possible state-action pair is described as a matrix $Q (s, a)$ ⁠. Based on the current state-action values, oscillator i chooses a policy $π_{i}$ under its environment, which consists of its neighboring oscillators along with the network dynamics. Each oscillator receives a set of reward values, which are used to update the state-action values. The RL process repeats until an optimal policy $π_{i}^{*}$ is obtained. By taking the average of $π_{i}^{*}$ ⁠, we find a best piece-wise linear fit as the optimal PRF. The detailed RL protocol, such as the designs of reward values and Q-value update are described in the “Materials and Methods” section.

Fig. 1.

Schematic of the reinforcement learning framework proposed for pulse-coupled oscillators.

Open in new tab Download slide

Optimal phase response functions under ideal conditions

We first consider the ideal case where oscillators have identical frequency $ω_{i} = 2 π$ and pulses are received instantaneously with no delay. We begin with a PCO network of $N = 2$ oscillators, where the containing arc is always within a half of a cycle. Fig. 2 shows the average learned policy over 10 experiments, which closely approximates the proposed PRF in (41) obtained analytically under the assumption that the containing arc is less than half of a cycle:

F (ϕ) = {\begin{matrix} - ϕ, & if 0 \leq ϕ \leq π, \\ 2 π - ϕ, & if π < ϕ \leq 2 π . \end{matrix}

(4)

Fig. 2.

Average optimal policy learned using a network of two oscillators with identical frequencies and zero delay. The dotted lines show the maximum variations in the learned policies. The learned PRF closely approximates the analytical solution in (41).

Open in new tab Download slide

We next consider $N = 6$ oscillators with all-to-all coupling (more results for $N = 3$ ⁠, 4, 5 and ER random graphs see Fig. S1–S6 in the supplementary materials). Fig. 3 shows the optimal average learned policy over 18 experiments, which is different from the analytical result in (41), primarily due to the unrestricted phase distributions. During the training phase, the randomness in the policy selection does not guarantee that the oscillators stay within a half-cycle,even if the phase values were initialized within a half-cycle. Thus, the assumptions for the analytical result in (41) do not hold in this case.

Fig. 3.

Average optimal policy learned using six oscillators in an all-to-all topology with identical frequencies and zero delay. The dotted lines show the maximum variations in the learned policies. Since the oscillators can have an unrestricted distribution of initial phases, the learned PRF is significantly different from the analytical prediction in (41) and is better modeled using Eq. (5).

Open in new tab Download slide

Design principle of phase response functions

The average learned policies from both Figs. 2 and 3 can be modeled using a simple form

F_{R L} (ϕ) = {\begin{matrix} - ϕ, & if 0 \leq ϕ \leq c_{1}, \\ \frac{c_{1}}{2 π - c_{1}} (ϕ - 2 π), & if c_{1} < ϕ \leq c_{2}, \\ 2 π - ϕ, & if c_{2} < ϕ \leq 2 π, \end{matrix}

(5)

where $c_{1} < 2 π$ and $c_{2} < 2 π$ are positive constants that will be determined later. The best-fits to the learned policies using Eq. (5) are shown in Figs. 2 and 3.

This PRF model offers important insight into the design principles of the highly effective phase response policy. When its phase value is close to the start or end of a cycle, an oscillator learns to take the maximum phase adjustment toward the threshold value, which has been shown analytically to decrease the synchronization time in (41). But when the phase value is near the middle of the cycle, the oscillator adjusts its phase proportionally to the distance to the end of the cycle, similar to the strategy used in (43), which has been shown to almost always lead to synchronization. This combination of two different strategies gives a phase response function that achieves synchronization efficiently.

We now use Erdős–Rényi–Gilbert (ERG) networks with parameters chosen at random to find the best parameters for Eq. (5). Fig. 4 shows the best-fit values for the parameters $c_{1}$ and $c_{2}$ ⁠, which fit well to exponential functions of the oscillator indegree $δ^{-}$ ⁠:

c_{i} = (π - b_{i, 2}) e^{- b_{i, 1} (δ^{-} - 1)} + b_{i, 2}, i = 1, 2,

(6)

where $b_{i, 1}$ and $b_{i, 2}$ are constants. This formula reduces the optimization of a function to the much simpler problem of optimizing two parameters. Equation (5) combined with Eq. (6) provides a powerful heuristic formula that can be used to predict the optimal PRF for oscillators in general networks without repeating the RL process.

Fig. 4.

Training and test data from best-fit values of learned phase response functions. The training data show the average and range of the best-fit values of the function parameters $c_{1}$ and $c_{2}$ in Eq. (5) to the optimal average learned policy of all oscillators on Erdős–Rényi–Gilbert (ERG) networks $G (n, p)$ with n and p generated randomly in the intervals $[6, 25]$ and $(0, 1)$ ⁠, respectively (discarding network realization that are not connected). It is also used to determine the best-fit exponential curve given by Eq. (6) for each function parameter. The testing data show the best-fit values of the function parameters for each oscillator in Watts-Strogatz networks $G (N, K, β)$ ⁠, where $N = 50$ ⁠, $β = 1$ ⁠, and K is varied from 1 to 25. Oscillators with identical indegree learn similar PRFs of the form given by Eq. (5) with similar learned function parameter values that are well-predicted by Eq. (6), regardless of network size or topology.

Open in new tab Download slide

Optimal phase response functions under nonideal conditions

When we consider experiments using various nonidentical oscillator frequencies (with differences within $10 %$ of the nominal frequency) and nonzero propagation delays (with delays within $10 %$ of an oscillation cycle), we find that learned PRFs are very similar to the ones for ideal environments (see Fig. S7–S14 in the supplementary materials for more details). Thus, we conclude that the same PRFs are optimal in achieving synchronization when the network is subject to moderate nonideal conditions.

Next, we compare our learned PRF based on Eq. (5) to previously proposed PRFs in (41, 42, 43, 47) under nonideal conditions. Since the synchronization strategies by Mirollo and Strogatz (43), Nishimura and Friedman (47), and Klinglmayr (42) do not include an explicit coupling strength parameter, we modify these algorithms to scale the change in phase adjustment by the coupling strength l, as in Eq. (1). These algorithms are recovered under $l = 1.0$ ⁠.

For our test, we consider an ER random graph with $N = 10$ oscillators and edge probability $p = 0.3$ ⁠, as illustrated in Fig. 5. We set the oscillator frequency $ω_{i}$ uniformly distributed in the range $[1.9 π, 2.1 π]$ and propagation delays $τ_{i j}$ uniformly distributed in the range $[0.01, 0.08]$ cycles. We randomly assign initial oscillator phase values in $S^{1}$ ⁠.

Fig. 5.

The Erdős–Rényi–Gilbert (ERG) graph used in Figs. 6 and 7 for the comparison between our learned PRF and previously proposed PRFs.

Open in new tab Download slide

Fig. 6 shows the average value of the containing arc at steady-state after 2,000 runs over a broad range of coupling strength values. Our learned PRF, Wang and Doyle’s delay-advance PRF in (41), and Nishimura and Friedman’s “strong type II” PRF in (47) are able to synchronize the network more closely than the other synchronization strategies due to the similarity among these PRFs. The addition of the coupling strength parameter allows for additional tuning to achieve a greater level of synchronization, especially in nonideal environments.

Fig. 6.

Average of the steady-state containing arc for ten oscillators in an ER random topology (Fig. 5) with nonidentical oscillator frequencies and random coupling delays.

Open in new tab Download slide

Moreover, if we look at how often the network is able to synchronize, we see that our learned PRF is able to synchronize more often at low coupling strengths than any of the other synchronization strategies, as shown in Fig. 7. Similar results are obtained when we vary the network size and topology, the amount of oscillator frequency heterogeneity, and the amount of coupling delay (see Figs. S15–S25 in the supplementary materials for more details).

Fig. 7.

Probability of synchronizing ten oscillators in an ER random topology (Fig. 5) with nonidentical frequencies and random delay. The learned PRF is able to synchronize more frequently than the other strategies, especially for smaller coupling strengths.

Open in new tab Download slide

Discussion

The proposed reinforcement learning framework provides a simple and versatile method for optimizing synchronization in pulse-coupled oscillator networks to maximize the degree and resilience of synchronization. Given that maintaining synchronization in the presence of message corruption and communication delays is crucial for numerous systems and processes, the results are expected to have broad applications in biological and engineered systems. For example, biological systems such as cardiac pacemakers and neuron networks can be effectively modeled by PCOs (61). In addition, the proposed method are well positioned to synchronize clocks in wireless sensor networks (41, 42, 53) and coordinate motions in robot networks (32, 34). Furthermore, to the best of our knowledge, this paper is the first to use a distributed reinforcement learning approach to optimize synchronization under noncontinuous pulse-based interactions, which is different from continuous-time smooth interactions in the Kuramoto model (62). It has direct ramifications for the deployment of reinforcement learning in general multiagent systems to optimize dynamical processes in general networks.

Future works include minimizing the synchronization time for PCOs and incorporating mechanisms to allow oscillators to adapt their intrinsic frequencies to deal with large frequency heterogeneity. Further improvements in the design of reward values, along with expanded action space (e.g. allow adjustments of both oscillator phases and frequencies), may lead to better control of both synchronization and desynchronization in networks of sensors and robots with discrete-time.

Materials and methods

Reinforcement learning protocol

State, action, and Q-value

Since the oscillator’s state and action values evolve on continuous intervals, parameterization and approximation decisions must be made to implement RL. We parameterize the continuous state interval into $P + 1$ evenly spaced parameters, $s_{0}, s_{1}, \dots, s_{P}$ ⁠. The state parameter $s_{p}$ corresponds to the phase value $\frac{2 π p}{P}$ in $S^{1}$ ⁠. Additionally, we discretize the actions with $A + 1$ values, $a_{0}, a_{1}, \dots, a_{A}$ ⁠, such that the actions are limited to phase changes that keep the oscillator within its current cycle. The possible actions that can be taken by an oscillator for $s_{p}$ can be expressed as $a_{k} = - s_{p} + \frac{2 π k}{A}$ for $k = 0, 1, \dots, A$ ⁠. We represent the value of each state-action pair with a $(P + 1) \times (A + 1)$ matrix $Q (s, a)$ ⁠. Each element of $Q (s, a)$ estimates the amount of expected reward by taking the action a at state s.

We implement episodic RL using an on-policy temporal-difference RL technique (see, e.g. 63). Off-policy RL techniques, such as Q-learning, tend to perform worse when there is a need to approximate continuous states and action spaces based on (64). A policy π for our MDP consists of a set of actions, one action for each state parameter $s_{p}$ ⁠, such that $π (s_{p}) = a_{p}$ ⁠. This policy represents a straight-line approximation of the PRF for an oscillator $F (ϕ)$ ⁠. For on-policy learning, we choose a policy before each episode. To avoid confusion with phase $2 π$ ⁠, we denote policies with subscripts, such that the policy used for episode t is $π_{t}$ ⁠. The choice for a policy is based on the current state-action value estimates $Q (s, a)$ ⁠. We use a soft-max, or Boltzmann, distribution to choose an action. Initially, the values for all state-action pairs are set to zero. Thus, the initial policy is equally likely to choose any action.

Reward design

Our goal is to have the oscillators synchronize their phases. Due to the dynamics of PCO networks, the choice of reward and how to update the values of state-action pairs are both critical. Since the length of the containing arc measures how well the network is synchronized, we reward actions that decrease the length of the containing arc and penalize actions that increase that length. Thus, the reward will include the decrease in the length of the containing arc, $Δ Λ (ϕ) = Λ (ϕ)^{-} - Λ (ϕ)^{+}$ ⁠, where $Λ (ϕ)^{-}$ and $Λ (ϕ)^{+}$ are the lengths of the containing arc before and after the action was taken, respectively.

Individual oscillators in a PCO network do not know the true state of the network. However, each oscillator can approximately estimate the state of the network by keeping track of the phase differences between itself and other oscillators when a pulse is received. Therefore, we can use the oscillator’s estimated state to approximate the change in the containing arc and, thus, determine the reward value.

Multiple actions can result in the same decrease in the containing arc. To encourage efficient synchronization, we penalize the oscillator by $f (a_{k}) = \frac{a_{k}^{2}}{2 π}$ based on the magnitude of the action $a_{k}$ ⁠. Therefore, when an oscillator receives its kth pulse and takes an action $a_{k}$ that decreases the (estimated) containing arc by an amount $Δ Λ_{k} (ϕ)$ ⁠, the total reward is given by

R_{k} = w_{Λ} Δ Λ_{k} (ϕ) - w_{a} f (a_{k}),

(7)

where $w_{Λ}$ and $w_{a}$ are positive weights.

During a training episode t, we let the network evolve for a fixed amount of time following a given policy $π_{t}$ ⁠. When an oscillator receives a pulse, it uses $π_{t}$ to determine its action, i.e. phase adjustment, based on a given coupling strength l and its current phase $ϕ_{i}$ ⁠.

Let us denote the two closest state parameters to $ϕ_{i}$ for the kth received pulse as $s_{L, k}$ and $s_{H, k}$ ⁠, respectively, where $s_{L, k} \leq ϕ_{i} \leq s_{H, k}$ holds. The corresponding actions from policy $π_{t}$ for those state parameters are denoted as $a_{L, k}$ and $a_{H, k}$ ⁠, respectively. To determine the action $ψ_{i}$ that the oscillator takes, we weight the actions of the two nearest state parameters based on the proximity of the current phase to those state variables. That is, the phase adjustment $ψ_{i}$ in Eq. (1) is

ψ_{i} = F (ϕ_{i}) = ρ_{L, k} a_{L, k} + ρ_{H, k} a_{H, k},

(8)

with $ρ_{L, k} = \frac{s_{H, k} - ϕ_{i}}{s_{H, k} - s_{L, k}}$ and $ρ_{H, k} = \frac{ϕ_{i} - s_{L, k}}{s_{H, k} - s_{L, k}}$ ⁠.

Before an oscillator adjusts its phase, it records its current state as $S_{k} = ϕ_{i}$ ⁠, where k is an index for the number of received pulses during an episode. The action taken, based on the policy $π_{t}$ ⁠, is recorded as $A_{k} = ψ_{i}$ ⁠. The oscillator calculates its reward $R_{k}$ based on Eq. (7) and then evolves freely until another pulse is received.

Q-value update

Once an episode for the network is complete, we use the resulting state-action-reward sequences to perform a batch update of the state-actions value matrix $Q (s, a)$ ⁠. The update is based on the Sarsa algorithm in (64), where the reward $R_{k}$ will apply to the state-action values for the two nearest state parameters to $S_{k}$ and their corresponding actions from the episode policy $π_{t}$ ⁠. The update for $s_{L, k}$ is

Q (s_{L, k}, a_{L, k}) = Q (s_{L, k}, a_{L, k}) + ρ_{L, k} α [R_{k} + γ Q_{E, k + 1} - Q (s_{L, k}, a_{L, k})],

(9a)

and the update for $s_{H, k}$ is

Q (s_{H, k}, a_{H, k}) = Q (s_{H, k}, a_{H, k}) + ρ_{H, k} α [R_{k} + γ Q_{E, k + 1} - Q (s_{H, k}, a_{H, k})],

(9b)

where α is the learning rate and γ is the discount rate. Here, $Q_{E, k + 1}$ denotes the average estimated value of the next state-action pair (⁠ $S_{k + 1}, A_{k + 1}$ ⁠) and is calculated as

Q_{E, k + 1} = ρ_{L, k + 1} Q (s_{L, k + 1}, a_{L, k + 1}) + ρ_{H, k + 1} Q (s_{H, k + 1}, a_{H, k + 1}) .

(10)

This update is performed for every state-action pair except for the final pair, which is the terminal state.

After all updates have been completed for each oscillator’s state-action-reward sequence, an episode of training is complete. With the updated state-action value matrix $Q (s, a)$ ⁠, a new policy is chosen for the next episode, and the process is repeated. After all episodes of training are complete, we use $Q (s, a)$ to determine the optimal policy $π^{*}$ ⁠, where $π^{*} (s_{p}) = {arg max}_{a_{i}} Q (s_{p}, a_{i})$ for each state parameter $s_{p}$ ⁠. We note that different oscillators in a network can have different optimal policies $π^{*}$ ⁠.

In the implementation, for each case, we let the network evolve for 15 cycles for each episode with initial phases randomly selected in $[0, 2 π)$ ⁠, and use a coupling strength $l = 1.0$ ⁠. We parameterize the state with 101 evenly spaced parameters and discretize the policy actions into 201 evenly spaced values. For each experiment, we simulate a network for 100,000 episodes. We use $w_{Λ} = \frac{N}{N - 1}$ and $w_{a} = \frac{1}{l}$ for Eq. (7).

Acknowledgments

We thank Steven Strogatz for helpful discussions.

Supplementary Material

Supplementary material is available at PNAS Nexus online.

Funding

The work was supported in part by the National Science Foundation under Grant 2012075.

Authors’ Contribution

T.A. and Y. W. conceived the project. Z.C, T.A., and Y.W. performed the research. Z.C, T.A., Y.Z. and Y.W. discussed the results. Z.C, T.A., Y.Z. and Y.W. wrote the paper.

Data Availability

All data is included in the manuscript and/or supporting information.

References

Abbott

van Vreeswijk

1993

Asynchronous states in networks of pulse-coupled oscillators

Phys Rev E

1483

Month:	Total Views:
March 2023	33
April 2023	266
May 2023	126
June 2023	91
July 2023	91
August 2023	63
September 2023	50
October 2023	83
November 2023	87
December 2023	66
January 2024	69
February 2024	36
March 2024	65
April 2024	78
May 2024	64
June 2024	76
July 2024	74
August 2024	38
September 2024	55
October 2024	52
November 2024	36
December 2024	36
January 2025	35
February 2025	18
March 2025	39
April 2025	27
May 2025	5

Article Contents

Optimal synchronization in pulse-coupled oscillator networks using reinforcement learning

Abstract

Introduction

Results

Pulse-coupled oscillators

Reinforcement learning

Optimal phase response functions under ideal conditions

Design principle of phase response functions

Optimal phase response functions under nonideal conditions

Discussion

Materials and methods

Reinforcement learning protocol

State, action, and Q-value

Reward design

Q-value update

Acknowledgments

Supplementary Material

Funding

Authors’ Contribution

Data Availability

References

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only