Accelerated FDPS: Algorithms to use accelerators with FDPS

Abstract

We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

1 Introduction

In this paper we describe new algorithms implemented in FDPS (Framework for Developing Particle Simulators: Iwasawa et al. 2016; Namekata et al. 2018), to make efficient use of accelerators such as GPGPUs (general-purpose computing on graphics processing units). FDPS is designed to make it easy for researchers to develop their own programs for particle-based simulations. To develop efficient parallel programs for particle-based simulations requires a very large amount of work, comparable with the work of a large team of people for many years. This is of course true not only for particle-based simulations, but for any large-scale parallel applications in computational science. The main cause of this problem is that modern high-performance computing (HPC) platforms have become very complex, requiring a lot of effort to develop complex programs to make efficient use of such platforms.

Typical modern HPC systems are actually a cluster of computing nodes connected through a network, each with typically one or two processor chips. The largest systems at present consist of around 10⁵ nodes, and we will see even larger systems soon. This extremely large number of nodes has made the design of the inter-node network very difficult, and the design of parallel algorithms has also become much harder. The calculation times of the nodes must be accurately balanced. The time necessary for communication must be small enough that the use of large systems is meaningful. The communication bandwidth between nodes is much lower than the main memory bandwidth, which itself is very small compared to the calculation speed of CPUs. Thus, it is crucial to avoid communications as much as possible. The calculation time can show an increase, instead of decreasing as it should, when we use a large number of nodes, unless we are careful to achieve good load balance between nodes and to minimize communication.

In addition, the programming environments available on present-day parallel systems are not easy to use. What is most widely used is MPI (Message-Passing Interface), which requires one to write explicitly how each node communicates with others in the system. Just to write and debug such a program is difficult, and it has become nearly impossible for any single person or even for a small group of people to develop large-scale simulation programs which run efficiently on modern HPC systems.

Moreover, this extremely large number of nodes is just one of the many difficulties of using modern HPC systems, since even within one node there are many levels of parallelism to be taken care of by the programmer. To make matters even more complicated, these multiple levels of parallelism are interwoven with multiple levels of memory hierarchy with varying bandwidth and latency. For example, the supercomputer Fugaku, which is under development in Japan at the time of writing, will have 48 CPUs (cores) in one chip. These 48 cores are divided into four groups, each with 12 cores. Cores in one group share one level-2 cache memory. The cache memories in different groups communicate with each other through a cache coherency protocol. Thus, the access of one core to the data which happens to be in its level-2 cache is fast, but that in the cache of another group can be very slow. Also, access to the main memory is much slower, and that to local level-1 cache is much faster. Thus, we need to take into account the number of cores and the size and speed of caches at each level to achieve acceptable performance. To make matters even worse, many modern microprocessors have level-3 and even level-4 caches.

As a result of these difficulties, only a small number of researchers (or groups of researchers) can develop their own simulation programs. In the case of cosmological and galactic N-body and smoothed particle hydrodynamics (SPH) simulations, Gadget (Springel 2005) and pkdgrav (Stadel 2001) are the most widely used. For star cluster simulations, NBODY6++ and NBODY6++GPU (Nitadori & Aarseth 2012) are effectively the standard. For planetary ring dynamics, REBOUND (Rein & Liu 2012) has been available. There has been no public code for simulations of the planetary formation process until recently.

This situation is clearly unhealthy. In many cases the physics that needs to be modeled is quite simple: particles interact through gravity, and with some other interactions such as physical collisions. Even so, almost all researchers are now forced to use existing programs developed by someone else, simply because HPC platforms have become too difficult to use. To add new functionality to existing programs can be very difficult. In order to make it possible for researchers to develop their own parallel code for particle-based simulations, we have developed FDPS (Iwasawa et al. 2016).

The basic idea of FDPS is to separate the code for parallelization and that for interaction calculation and numerical integration. FDPS provides the library functions necessary for parallelization, and using them researchers write programs very similar to what they would write for a single CPU. Parallelization on multiple nodes and on multiple cores in a single node are taken care of by FDPS.

FDPS provides three sets of functions. One is for domain decomposition. Given the data of particles in each node, FDPS performs the decomposition of the computational domain. The decomposed domains are assigned to MPI processes. The second set of functions is to let MPI processes exchange particles. Each particle should be sent to the appropriate MPI process. The third set of functions perform the interaction calculation. FDPS uses a parallel version of the Barnes–Hut algorithm for both long-range interactions such as gravitational interactions and short-range interactions such as intermolecular force or fluid interaction. The application program supplies the function to perform interaction calculations for two groups of particles (one group exerting forces on the other), and FDPS calculates the interaction using that function.

FDPS offers very good performance on large-scale parallel systems consisting of “homogeneous” multi-core processors, such as the K computer and Cray systems based on x86 processors. On the other hand, the architecture of large-scale HPC systems is moving from homogeneous multi-core processors to accelerator-based systems and heterogeneous multi-core processors.

GPGPUs are the most widely used accelerators, and are available on many large-scale systems. They offer price–performance ratios and performance per watt numbers significantly better than those of homogeneous systems, primarily by integrating a large number of relatively simple processors on a single accelerator chip. On the other hand, accelerator-based systems have two problems. One is that, for many applications, the communication bandwidth between CPUs and accelerators becomes the bottleneck. The second is that because CPUs and accelerators have separate memory spaces, the programming is complicated and we cannot use existing programs.

Though in general it is difficult to use accelerators, for particle-based simulations the efficient use of accelerators is not so difficult, and that is why the GRAPE families of accelerators specialized for gravitational N-body simulations have been successful (Makino et al. 2003). GPGPUs are also widely used both for collisional (Gaburov et al. 2009) and collisionless (Bédorf et al. 2012) gravitational N-body simulations. Thus, it is clearly desirable for FDPS to support accelerator-based architectures.

Though gravitational N-body simulation programs have achieved very good performance on large clusters of GPGPUs, achieving high efficiency for particle systems with short-range interactions is difficult. For example, there exist many high-performance implementations of SPH algorithms on a single GPGPU, or on a relatively small number of multiple GPGPUs (around six), but there are not many high-performance SPH programs for large-scale parallel GPGPU systems. Practically all the efficient GPGPU implementations of the SPH algorithm use GPGPUs to run the entire simulation code, in order to eliminate the communication overhead between GPGPU and CPU. The calculation cost of particle–particle interactions dominates the total calculation cost of SPH simulations. Thus, as far as the calculation cost is concerned, it is sufficient to let GPGPUs evaluate the interactions, and let CPUs perform the rest of the calculation. However, because of the relatively low communication bandwidth between CPUs and GPGPUs we need to avoid data transfer between them, and if we let GPGPUs do all the calculations it is clear that we can minimize the communication.

On the other hand, it is more difficult to develop programs for GPGPUs than for CPUs, and to develop MPI parallel programs for multiple GPGPUs is clearly even more difficult. To make such an MPI parallel program for GPGPUs run on a large cluster is close to impossible.

In order to add support for GPGPUs and other accelerators to FDPS, we decided to take a different approach. We kept the simple model in which accelerators do the interaction calculation only, and CPUs do all the rest. However, we try to minimize the communication between CPUs and accelerators as much as possible, without making the calculation on the accelerator side very complicated.

In this paper we describe our strategy of using accelerators, how application programmers can use FDPS to efficiently use accelerators, and the performance achieved. The paper is organized as follows: In section 2 we present an overview of FDPS. In section 3 we discuss the traditional approach of using accelerators for interaction calculation, and its limitations. In section 4 we summarize our new approach. In section 5 we show how users can use new APIs of FDPS to make use of accelerators. In section 6 we give the results of performance measurement for GPGPU-based systems, along with a performance prediction for hypothetical systems. Section 7 provides a summary and discussion.

2 Overview of FDPS

The basic idea (or the ultimate goal) of FDPS is to make it possible for researchers to develop their own high-performance, highly parallel particle-based simulation programs without spending too much time writing, debugging, and performance tuning the code. In order to achieve this goal, we have designed FDPS so that it provides all necessary functions for efficient parallel programming of particle-based simulations. FDPS uses MPI for inter-node parallelization and OpenMP for intra-node parallelization. In order to reduce the communication between computing nodes, the computational domain is divided using the recursive multisection algorithm (Makino 2004), but with weights for particles to achieve optimal load balancing (Ishiyama et al. 2009). The number of subdomains is equal to the number of MPI processes, and one subdomain is assigned to one MPI process.

Initially, particles are distributed to MPI processes in an arbitrary way. It is not necessary that the initial distribution is based on spatial decomposition, and it is even possible that initially just one process has all the particles, if it has sufficient memory. After the spatial coordinates of subdomains are determined, for each particle, the MPI process to which it belongs is determined, and it is sent to that process; this can all be achieved just by calling FDPS library functions. In order for the FDPS functions to get information on particles and copy or move them, they need to know the data structure of the particles. This is made possible by making FDPS “template based,” so that at compile time the FDPS library functions know the data structure of the particles.

After the particles are moved to their new locations, the interaction calculation is done through the parallel Barnes–Hut algorithm based on the local essential tree (Makino 2004). In this method, each MPI process first constructs a tree structure from its local particles (local tree). Then, it sends to all other MPI processes the information necessary for that MPI process to evaluate the interaction with its particles. This necessary information is called the local essential tree (LET).

After one process has received all the LETs from all other nodes, it constructs the global tree by merging the LETs. In FDPS, merging is not actually done but LETs are first reduced to arrays of particles and superparticles (hereafter called “SPJ”), and a new tree is constructed from the combined list of all particles. Here, an SPJ represents a node of the Barnes–Hut tree.

Finally, the interaction calculation is done by traversing the tree for each particle. Using Barnes’ vectorization algorithm (Barnes 1990), we traverse the tree for a group of local particles and create the “interaction list” for that group. Then, FDPS calculates the interaction exerted by particles and superparticles in this interaction list on particles in the group, by calling the user-supplied interaction function.

In the case of long-range interaction we use the standard Barnes–Hut scheme for treewalk. In the case of short-range interaction such as SPH interaction between particles, we still use treewalk but with the cell-opening criterion different from the standard opening angle.

Thus, users of FDPS can use the functions for domain decomposition, particle migration, and interaction calculation by passing their own particle data class and interaction calculation function to FDPS at compile time. The interaction calculation function should be designed as receiving two arrays of particles, one exerting the “force” on the other.

3 Traditional approach to using accelerators and its limitation

As we have already stated in the introduction, accelerators have been used for gravitational N-body simulations, both on single and parallel machines, with and without Barnes–Hut treecode (Barnes & Hut 1986). In the case of the tree algorithm, the idea is to use Barnes’ vectorization algorithm, which is what we defined as the interface between the user-defined interaction function and FDPS. Thus, in principle we can use accelerators just by replacing the user-defined interaction function with one that uses the accelerators. In the case of GRAPE processors, that would be the only thing we need to do. At the same time, this would be the only thing we can do.

On modern GPGPUs, however, we need to modify the interface and algorithm slightly. There are two main reasons for this modification. The first is that the software overhead of GPGPUs for data transfer and kernel startup is much larger than for GRAPE processors. Another difference is in the architecture. GRAPE processors consist of a relatively small number of highly pipelined, application-specific pipeline processors for interaction calculation, with hardware support for the fast summation of results from multiple pipelines. On the other hand, GPGPUs consist of a very large number of programmable processors, with no hardware support for summation of the results obtained on multiple processors. Thus, to make efficient use of GPGPUs we need to calculate interactions on a large number of particles by a single call to the GPGPU computing kernel. The vectorization algorithm has one adjustable parameter, n_grp, the number of particles which share one interaction list, and it is possible to make efficient use of GPGPUs by making n_grp large. However, using an excessively large n_grp causes an increase in the total calculation cost, and is thus not desirable.

Hamada et al. (2009) introduced an efficient way to use GPGPUs that they called the “multiwalk” method. In their method, the CPU first constructs multiple interaction lists for multiple groups of particles, and then sends them to the GPGPU in a single kernel call. The GPGPU performs the calculations for multiple interaction lists in parallel, and returns all the results in a single data transfer. In this way we can tolerate the large overhead of invoking computing kernels on GPGPUs and the lack of support for fast summation.

Even though this multiwalk method is quite effective, there is still room for improvement, meaning that on modern accelerators the efficiency we can achieve with the multiwalk method is rather limited.

The biggest remaining inefficiency comes from the fact that with the multiwalk method we send interaction lists for each particle group. One interaction list is an array of physical quantities (at least positions and masses) of particles. Typically, the number of particles in an interaction list is ∼10 times more than the number of particles for which that interaction list is constructed, and thus the transfer time of the interaction list is around 10 times longer than that of the particles which receive the force. This means that we are sending the same particles (and superparticles) multiple times when we send multiple interaction lists.

In the next section we discuss how we can reduce the amount of communication and also further reduce the calculation cost for the parts other than the force calculation kernel.

4 New algorithms

As described in the previous section, to send all the particles in the interaction list to accelerators is inefficient because we send the same particles and SPJs multiple times. In subsection 4.1 we will describe new algorithms to overcome this inefficiency. In subsection 4.2 we will also describe new algorithms to further reduce the calculation cost for the parts other than the force calculation kernel. In subsection 4.3 we will describe the actual procedures with and without the new algorithms.

4.1 Indirect addressing of particles

When we use the interaction list method on systems with accelerators, in the simplest implementation, for each group of particles and its interaction list, we send the physical quantities necessary for interaction calculation, such as positions and masses in the case of gravitational force calculation. Roughly speaking, the number of particles in the interaction list is around ten times longer than that in one group. Thus, we are sending around 10n particles, where n is the number of particles per MPI process, at each timestep. Since there are only n local particles and the number of particles and tree nodes in an LET is generally much smaller than n, this means that we are sending the same data many times, and that we should be able to reduce the communication by sending particle and tree node data only once. Some GRAPE processors, including GRAPE-2A, MDGRAPE-x, and GRAPE-8, have hardware support for this indirect addressing (Makino & Daisaka 2012).

In the case of programmable accelerators, this indirect addressing can be achieved by first sending arrays of particles and tree nodes, and then sending the interaction list (here the indices of particles and tree nodes indicating their location in the arrays). The user-defined interaction calculation function should be modified so that it uses indirect addressing to access particles. Examples of such code are included in the current FDPS distribution (version 4.0 and later), and we plan to develop template routines which can be used to generate code on multiple platforms from user-supplied code for the interaction calculation.

The interaction list is usually an array of 32-bit integers (four bytes), and one item of particle data is at least 16 bytes (when positions and masses are all in single precision numbers), but can be much larger in the case of SPH and other methods. Thus, with this method we can reduce the communication cost by a large factor.

One limitation of this indirect addressing method is that all the particles in one process should fit in the memory of the accelerator. Most accelerators have relatively small memories. In such cases we can still use this method, by dividing the particles into blocks small enough to fit the accelerator memory. For each block, we construct the “global” tree structure similar to that for all particles in the process, and interaction lists for all groups under the block.

4.2 Reuse of interaction Lists

For both long-range and short-range interactions, FDPS constructs the interaction lists for groups of particles. It is possible to keep using the same interaction lists for multiple timesteps if particles do not move large distances in a single timestep. In the case of SPH or molecular dynamics simulations, it is guaranteed that particles move only a small fraction of the interparticle distance in a single timestep, since the size of the timestep is limited by the stability condition. Thus, in such cases we can safely use the interaction lists for several timesteps.

Even in the case of gravitational many-body simulations, there are cases where the change of the relative distance between particles in a single timestep is small. For example, in simulations of planetary formation processes or planetary rings, the random velocities of particles are very small, and thus, even though particles move large distances, there is no need to reconstruct the tree structure at each timestep because the changes in the relative positions of particles are small.

In the case of galaxy formation simulation using the Nbody+SPH technique, generally the timestep for the SPH part is much smaller than that for the gravity part, and thus we should be able to use the same tree structure and interaction lists for multiple SPH steps.

If we use this algorithm (hereafter we call it the reuse algorithm), the procedures for interaction calculation for the step with and without tree construction are different. The procedure for the tree construction step is:

Construct the local tree.
Construct the LET for all other processes. These LETs should be list of indices of particles and tree nodes, so that they can be used later.
Exchange LETs. Here, the physical information for tree nodes and particles should be exchanged.
Construct the global tree.
Construct the interaction lists.
Perform the interaction calculation for each group using the constructed list.

The procedure for reusing steps is:

Update the physical information of the local tree.
Exchange LETs.
Update the physical information of the global tree.
Perform the interaction calculation for each group using the constructed list.

In many cases we can keep using the same interaction list for around 10 timesteps. In the case of planetary ring simulation, using the same list for a much larger number of timesteps is possible, because the stepsize of planetary ring simulation using the “soft-sphere” method (Iwasawa et al. 2018) is limited by the hardness of the “soft” particles and is thus much smaller than the usual timescale determined by the local velocity dispersion and interparticle distance.

With this reuse algorithm, we can reduce the cost of the following steps: (a) tree construction, (b) LET construction, and (c) interaction list construction. The calculation costs of steps (a) and (c) are O(N) and O(Nlog N), respectively. Thus they are rather large for simulations with a large number of particles. Moreover, by reducing the cost of step (c), we can make the group size n_grp small, which results in a decrease of the calculation cost due to the use of the interaction list. Thus, the overall improvement in efficiency is quite significant.

The construction and maintenance of interaction lists and other necessary data structures are all done within FDPS. Therefore, user-developed application programs can use this reuse algorithm just by calling the FDPS interaction calculation function with one additional argument indicating reuse/construction. The necessary change for an application program is very small.

4.3 Procedures with or without the new algorithms

In this section we describe the actual procedures of simulations using FDPS with or without the new algorithms. Before describing the procedures, let us introduce the four particle data types FDPS uses: FP (Full Particle), EPI (Essential Particle I), EPJ (Essential Particle J), and FORCE. FP is the data structure containing all the information on a a particle, EPI (J) is used for the minimal data of particles that receive (give) the force, and the FORCE type stores the calculated interaction. FDPS uses these additional three data types to minimize memory access during the interaction calculation. We first describe the procedure for the calculation without the reuse algorithm and then describe that for the reuse algorithm.

At the beginning of one timestep, the computational domains assigned to MPI processes are determined and all processes exchange particles so that all particles belong to their appropriate domains. Then, the coordinates of the root cell of the tree are determined using the positions of all particles. After the determination of the root cell, each MPI process constructs its local tree. The local tree construction consists of the following four steps:

Generate Morton keys for all particles.
Sort key–index pairs in Morton order by radix sort.
Reorder FPs in Morton order referencing the key–index pairs and copy the particle data from FPs to EPIs and EPJs.
For each level of the tree, from top to bottom, allocate tree cells and link their child cells. In each level we use binary search to find the cell boundaries.

In the case of the reusing step, these steps are skipped.

After the construction of the local tree, multipole moments of all local tree cells are calculated, from the bottom to the top of the tree. The calculation of the multipole moments is performed even at the reusing step, because the physical quantities of particles are updated at each timestep.

After the calculation of the multipole moments of the local tree, each MPI process constructs LETs and sends them to other MPI processes. When the reusing algorithms is used, at the tree construction step each MPI process saves the LETs and their destination processes.

After the exchange of LETs, each MPI process constructs the global tree from received LETs and its local tree. The procedure is almost the same as that for the local tree construction.

After the construction of the global tree, each MPI process calculates the multipole moments of all cells of the global tree. The procedure is the same as that for the local tree.

After the calculation of the moments of the global tree, each MPI process constructs the interaction lists and uses them to perform the force calculation. If we do not use the multiwalk method, each MPI process makes the interaction lists for one particle group and then the user-defined force kernel calculates the forces from EPJs and SPJs in the interaction list on the EPIs in the particle group.

When we use the multiwalk method, each MPI process makes multiple interaction lists for multiple particle groups. When the indirect addressing method is not used, each MPI process gives multiple groups and multiple interaction lists to the interaction kernel on the accelerator. Thus we can summarize the force calculation procedure without the indirect addressing method for multiple particle groups as follows:

Construct the interaction list for multiple particle groups.
Copy EPIs and the interaction lists to the send buffer for the accelerator. Here, the interaction list consists of EPJs and SPJs.
Send particle groups and their interaction lists to the accelerator.
Let the accelerator calculate interactions on the particle groups sent at step 3.
Receive the results calculated at step 4 and copy them back to the FPs, integrate the orbits of FPs, and copy the data from FPs to EPIs and EPJs.

To calculate the forces on all particles, the above steps are repeated until all particle groups are processed. Note that the construction of the interaction list (step 1), sending the data to the accelerator (step 3), the actual calculation (step 4), and receiving the calculated result (step 5) can all be overlapped.

On the other hand, when the indirect addressing method is used, before the construction of the interaction lists, each MPI process sends the data of all cells of the global tree to the accelerator. Thus, at the beginning of the interaction calculation it should send them to the accelerator. After that, the accelerator receives the data of particle groups and their interaction lists, but here the interaction list contains the indices of EPJs and SPJs and not their physical quantities. Thus, the calculation procedure with the indirect addressing method is the same as that without indirect addressing except that all the global tree data are sent at the beginning of the calculation and the interaction lists sent during the calculation contain only indices of tree cells and EPJs.

We can use the reusing method both with and without the indirect addressing method. For the construction step, the procedures are the same. For the reusing steps, we can skip the steps for constructing the interaction list (step 1). When we use the indirect addressing method, we can also skip sending them since the lists of indices are unchanged during reuse.

5 APIs for using accelerators

In this section we describe the APIs (application programming interfaces) of FDPS for using accelerators and how to use them by showing sample code developed for NVIDIA GPGPUs. Part of the user kernel is written in CUDA.

FDPS has high-level APIs to perform all the procedures for interaction calculation in a single API call. For the multiwalk method, FDPS provides calcForceAllAndWriteBackMultiWalk or calcForceAllAndWriteBackMultiWalkIndex. The difference between these two functions is that the former dose not use the indirect addressing method. These two API functions can be used to replace calcForceAllAndWriteBack, which is another top-level function provided by FDPS distribution version 1.0 or later. A user must provide two force kernels: the “dispatch” and “retrieve” kernels. The “dispatch” kernel is used to send EPIs, EPJs, and SPJs to accelerators and call the force kernel. The “retrieve” kernel is used to collect FORCEs from accelerators. The reason FDPS needs two kernels is to allow overlap of the calculation on the CPU with the force calculation on the accelerator as described in the previous section.

The reusing method can be used with all three top-level API calls described above. The only thing users do to use the reusing method is to give an appropriate FDPS-provided enumerated type value to these functions so that the reusing method is enabled. The values FDPS provides are MAKE_LIST, MAKE_LIST_FOR_REUSE, and REUSE_LIST. At the construction step the application program should pass MAKE_LIST_FOR_REUSE to the top-level API function so that FDPS constructs the trees and the interaction lists and saves them. At the reusing step, the application program should pass REUSE_LIST so that FDPS skips the construction of the trees and reuses the interaction lists constructed at the last construction step. In the case of MAKE_LIST, FDPS also constructs the trees and the interaction lists but does not save them. Thus the users cannot use the reusing method. Figure 1 shows an example of how to use the reusing method. Here, the trees and the interaction lists are constructed once every eight steps. While the same list is being reused, particles should remain in the same MPI process as at the time of list construction. Thus, exchangeParticle should be called only just before the tree construction step.

Fig. 1.

Example of how to use the reusing method. (Color online)

Open in new tab Download slide

Figure 2 shows an example of the dispatch kernel without the indirect addressing method. FDPS gives the dispatch kernel the arrays of pointers to the EPIs, EPJs, and SPJs as arguments of the kernel (lines 3, 5, and 7). Each pointer points to the address of the first element of the arrays of EPIs, EPJs, and SPJs for one group and its interaction list. The sizes of these arrays are given by |${\rm n\_epi}$| (line 4), |${\rm n\_epj}$| (line 6) and |${\rm n\_spj}$| (line 8). FDPS also passes “tag” (the first argument) and “n_walk” (the second argument). The argument “tag” is used to specify individual accelerators if multiple accelerators are available. However, in the current version of FDPS, “tag” is disabled and FDPS always passes 0. The argument “n_walk” is the number of particle groups and interaction lists.

Fig. 2.

Example of the dispatch kernel without the indirect addressing method. (Color online)

Open in new tab Download slide

To overlap the actual force calculation on the GPGPU with the data transfer between the GPGPU and the CPU we use a CUDA stream, which is a sequence of operations executed on the GPGPU. In this example we used N_STREAM CUDA streams. In this paper we used eight CUDA streams because the performance of our simulations does not improve even if we use more. In each stream, |${\rm n\_walk / N\_STREAM}$| interaction lists are handled. The particle data types, EPIs (lines 28–33), EPJs (lines 36–41), and SPJs (lines 48–48), are copied to the send buffers for GPGPUs. Here, we use the same buffer for EPJs and SPJs because the EPJ and SPJ types are the same. In lines 55 and 56, the EPIs and EPJs are sent to the GPGPU. Then the force kernel is called in line 60.

Figure 3 shows an example of the dispatch kernel with the indirect addressing method. This kernel is almost the same as that without the indirect addressing method except for two differences. One difference is that at the beginning of one timestep, all data of the global tree (EPJs and SPJs) are sent to the GPGPU (lines 15 and 16). Whether or not the application program sends EPJs and SPJs to the GPGPU is specified by the 13th argument, “send_flag.” If “send_flag” is true, the application program sends all EPJs and SPJs. Another difference is that indices of EPJs and SPJs are sent (lines 48–58 and 67–69) instead of the physical quantities of EPJs and SPJs. Here, we use the user-defined global variable CONSTRUCTION_STEP to specify whether the current step is a construction or reusing step. At the construction step CONSTRUCTION_STEP becomes unity and the user program sends the interaction list to the GPGPU and saves them in the GPGPU. On the other hand, at the reusing step, the user program does not send the list and reuses the interaction list saved in the GPGPU.

Fig. 3.

Example of the dispatch kernel with the indirect addressing method. The boxes with a solid line indicate the differences from the kernel without the indirect addressing method. (Color online)

Open in new tab Download slide

Figure 4 shows an example of the retrieve kernel. The same retrieve kernel can be used with and without the indirect addressing method. In line 12, the GPGPU sends the interaction results to the receive buffer of the host (D2H copy). To let the CPU wait until all functions in the same stream on the GPGPU are completed, cudaStreamSynchronize is called in line 13. Finally, the interaction results are copied to FORCEs. Note that this retrieve kernel could be optimized by launching all D2H copies at once and then processing stream 1 once it is finished, while D2H copies for other streams are still in flight. However, in most particle simulations the execution time for the force calculation is much longer than that for D2H copies. Thus we think that using this optimized kernel would not change the performance.

Fig. 4.

Example of the retrieve kernel. (Color online)

Open in new tab Download slide

6 Performance

In this section we present the measured performance and the performance model of a simple N-body simulation code implemented using FDPS on CPU-only and CPU + GPGPU systems.

6.1 Measured performance

To measure the performance of FDPS, we performed simple gravitational N-body simulations both with and without the accelerator. The initial model is a cold uniform sphere. This system will collapse in a self-similar way. Thus we can use the reusing method. The number of particles (per process) n is 2²² (4M). The opening criterion of the tree, θ, is between 0.25 and 1.0, the maximum number of particles in a leaf cell is 16, and the maximum number of particles in the EPI group, n_grp, is between 64 and 16384. We performed these simulations with three different methods, listed in table 1. In the case of the reusing method, the number of reusing steps between the construction steps n_reuse is between 2 and 16. For all simulations we used a monopole-only kernel.

Table 1.

Open in new tab

All methods we used to perform cold collapse simulations with n = 2²².

Method	System	Multiwalk	Indirect addressing	Reusing
C1	CPU	No	No	No
G1	CPU + GPGPU	Yes	No	No
G2	CPU + GPGPU	Yes	Yes	No
G3	CPU + GPGPU	Yes	Yes	Yes

Method	System	Multiwalk	Indirect addressing	Reusing
C1	CPU	No	No	No
G1	CPU + GPGPU	Yes	No	No
G2	CPU + GPGPU	Yes	Yes	No
G3	CPU + GPGPU	Yes	Yes	Yes

Table 1.

Open in new tab

All methods we used to perform cold collapse simulations with n = 2²².

Method	System	Multiwalk	Indirect addressing	Reusing
C1	CPU	No	No	No
G1	CPU + GPGPU	Yes	No	No
G2	CPU + GPGPU	Yes	Yes	No
G3	CPU + GPGPU	Yes	Yes	Yes

Method	System	Multiwalk	Indirect addressing	Reusing
C1	CPU	No	No	No
G1	CPU + GPGPU	Yes	No	No
G2	CPU + GPGPU	Yes	Yes	No
G3	CPU + GPGPU	Yes	Yes	Yes

We used an NVIDIA TITAN V as an accelerator. Its peak performance is 13.8 Tflops for single precision calculation. The host CPU is a single Intel Xeon E5-2670 v3 with a peak speed of 883 Gflops for single precision calculation. The GPGPU is connected to the host CPU through a PCI Express 3.0 bus with 16 lanes. The main memory of the host computer is DDR4-2133. The theoretical peak bandwidth is 68 GB s⁻¹ for the host main memory and 15.75 GB s⁻¹ for the data transfer between the host and GPGPU. All particle data is double precision. The force calculation on the GPGPU and the data transfer between the CPU and GPGPU are performed in single precision.

Figure 5 shows the average elapsed time per step for methods C1, G1, G2, and G3 with reuse intervals n_reuse of 2, 4, and 16 plotted against the average number of particles which share one interaction list 〈n_i〉. We also plot the elapsed times for method G3 for the reusing step (i.e., this corresponds to the time with n_reuse = ∞). The opening angle is θ = 0.5.

Averaged elapsed time per step against 〈ni〉, in the case of θ = 0.5 and n = 222. Open symbols indicate the results of the runs with method G3. Open squares, open upward triangles, and open inverted triangles indicate the results with nreuse of 2, 4, and 16, respectively. Open circles and open diamonds indicate the elapsed times of the construction step and the reusing step, respectively. Filled circles and filled squares indicate the elapsed times for methods C1 and G1, respectively. (Color online)

Fig. 5.

Averaged elapsed time per step against 〈n_i〉, in the case of θ = 0.5 and n = 2²². Open symbols indicate the results of the runs with method G3. Open squares, open upward triangles, and open inverted triangles indicate the results with n_reuse of 2, 4, and 16, respectively. Open circles and open diamonds indicate the elapsed times of the construction step and the reusing step, respectively. Filled circles and filled squares indicate the elapsed times for methods C1 and G1, respectively. (Color online)

Open in new tab Download slide

We can see that in the case of method G3, the elapsed time becomes smaller as the reuse interval n_reuse increases, and approaches the time of the reusing step. The minimum time of method G3 at the reusing step is ten times smaller than method C1 and four times smaller than method G1. The performance of method G3 with n_reuse of 16 is 3.7 Tflops (27% of the theoretical peak).

We can also see that the optimal value of 〈n_i〉 becomes smaller as n_reuse increases. When we do not use the reuse method, the tree construction and traversal are done at every step. Thus, their costs are large, and to make it small we should increase 〈n_i〉. In order to do so, we need to use large n_crit, which is the maximum number of particles in the particle group. If we make 〈n_i〉 too large, the calculation cost increases (Makino 1991). Thus there is an optimal 〈n_i〉. When we use the reuse method, the relative cost of tree traversal becomes smaller. Thus, the optimal 〈n_i〉 becomes smaller and the calculation cost also becomes smaller. We will give a more detailed analysis in subsection 6.2.

Table 2 shows the breakdown of the calculation time for different methods in the case of θ = 0.5. For the runs with C1 and G1, we show the breakdown at the optimal values of 〈n_i〉, which are 230 and 1500, respectively. For the runs with G2 and G3 at the reusing step, we show the breakdowns for 〈n_i〉 = 230. We can see that the calculation time for G3 at the reusing step is four times smaller than for G2. Thus, if n_reuse ≫ 4, the contribution of the construction step to the total calculation time is small.

Table 2.

Open in new tab

Breakdown of the total time per step for the runs with various methods in the case of θ = 0.5 and n = 2²².*

Method	C1	G1	G2	G3 (reusing step)
〈n_i〉	230	1500	230	230
Set root cell	0.0064	0.0066	0.0066	—
Make local tree	0.091	0.092	0.095	—
Calculate key	0.0084	0.0085	0.0093	—
Sort key	0.042	0.043	0.044	—
Reorder ptcl	0.030	0.030	0.030	—
Link tree cell	0.011	0.010	0.011	—
Calculate multipole moment of local tree	0.0053	0.0058	0.0053	0.0062
Make global tree	0.094	0.094	0.096	0.0071
Calculate key	0.0	0.0	0.0	—
Sort key	0.040	0.041	0.042	—
Reorder ptcl	0.036	0.036	0.037	0.0071
Link tree cell	0.011	0.011	0.011	—
Calculate multipole moment of global tree	0.0066	0.0064	0.0065	0.0068
Calculate force	0.76	0.15	0.21	0.072
Make interaction list (EPJ and SPJ)	(0.16)	0.063	—	—
Make interaction list (id)	—	—	0.13	—
Copy all particles and tree cells	—	—	(0.0065)	(0.0065)
Copy EPI	—	(0.0037)	(0.0037)	(0.0037)
Copy interaction list (EPJ and SPJ)	—	(0.020)	—	—
Copy interaction list (id)	—	—	(0.013)	—
Copy FORCES	—	(0.0060)	(0.0060)	(0.0060)
Force kernel	(0.43)	(0.11)	(0.043)	(0.043)
H2D all particles and tree cells	—	—	(0.0073)	(0.0073)
H2D EPI	—	(0.0042)	(0.0042)	(0.0042)
H2D interaction list (EPJ and SPJ)	—	(0.034)	—	—
H2D interaction list (id)	—	—	(0.022)	—
D2H FORCE	—	(0.0067)	(0.0067)	(0.0067)
Write back + integration (+ copy ptcl)	0.015	0.016	0.021	0.025
Total	0.98	0.36	0.42	0.095

Method	C1	G1	G2	G3 (reusing step)
〈n_i〉	230	1500	230	230
Set root cell	0.0064	0.0066	0.0066	—
Make local tree	0.091	0.092	0.095	—
Calculate key	0.0084	0.0085	0.0093	—
Sort key	0.042	0.043	0.044	—
Reorder ptcl	0.030	0.030	0.030	—
Link tree cell	0.011	0.010	0.011	—
Calculate multipole moment of local tree	0.0053	0.0058	0.0053	0.0062
Make global tree	0.094	0.094	0.096	0.0071
Calculate key	0.0	0.0	0.0	—
Sort key	0.040	0.041	0.042	—
Reorder ptcl	0.036	0.036	0.037	0.0071
Link tree cell	0.011	0.011	0.011	—
Calculate multipole moment of global tree	0.0066	0.0064	0.0065	0.0068
Calculate force	0.76	0.15	0.21	0.072
Make interaction list (EPJ and SPJ)	(0.16)	0.063	—	—
Make interaction list (id)	—	—	0.13	—
Copy all particles and tree cells	—	—	(0.0065)	(0.0065)
Copy EPI	—	(0.0037)	(0.0037)	(0.0037)
Copy interaction list (EPJ and SPJ)	—	(0.020)	—	—
Copy interaction list (id)	—	—	(0.013)	—
Copy FORCES	—	(0.0060)	(0.0060)	(0.0060)
Force kernel	(0.43)	(0.11)	(0.043)	(0.043)
H2D all particles and tree cells	—	—	(0.0073)	(0.0073)
H2D EPI	—	(0.0042)	(0.0042)	(0.0042)
H2D interaction list (EPJ and SPJ)	—	(0.034)	—	—
H2D interaction list (id)	—	—	(0.022)	—
D2H FORCE	—	(0.0067)	(0.0067)	(0.0067)
Write back + integration (+ copy ptcl)	0.015	0.016	0.021	0.025
Total	0.98	0.36	0.42	0.095

*The second row shows the 〈n_i〉 used in this measurement. The times in parentheses are the estimated times using the performance model discussed in the following sections, because the calculations corresponding to these times are hidden by other calculations and we cannot measure them.

Table 2.

Open in new tab

Breakdown of the total time per step for the runs with various methods in the case of θ = 0.5 and n = 2²².*

Method	C1	G1	G2	G3 (reusing step)
〈n_i〉	230	1500	230	230
Set root cell	0.0064	0.0066	0.0066	—
Make local tree	0.091	0.092	0.095	—
Calculate key	0.0084	0.0085	0.0093	—
Sort key	0.042	0.043	0.044	—
Reorder ptcl	0.030	0.030	0.030	—
Link tree cell	0.011	0.010	0.011	—
Calculate multipole moment of local tree	0.0053	0.0058	0.0053	0.0062
Make global tree	0.094	0.094	0.096	0.0071
Calculate key	0.0	0.0	0.0	—
Sort key	0.040	0.041	0.042	—
Reorder ptcl	0.036	0.036	0.037	0.0071
Link tree cell	0.011	0.011	0.011	—
Calculate multipole moment of global tree	0.0066	0.0064	0.0065	0.0068
Calculate force	0.76	0.15	0.21	0.072
Make interaction list (EPJ and SPJ)	(0.16)	0.063	—	—
Make interaction list (id)	—	—	0.13	—
Copy all particles and tree cells	—	—	(0.0065)	(0.0065)
Copy EPI	—	(0.0037)	(0.0037)	(0.0037)
Copy interaction list (EPJ and SPJ)	—	(0.020)	—	—
Copy interaction list (id)	—	—	(0.013)	—
Copy FORCES	—	(0.0060)	(0.0060)	(0.0060)
Force kernel	(0.43)	(0.11)	(0.043)	(0.043)
H2D all particles and tree cells	—	—	(0.0073)	(0.0073)
H2D EPI	—	(0.0042)	(0.0042)	(0.0042)
H2D interaction list (EPJ and SPJ)	—	(0.034)	—	—
H2D interaction list (id)	—	—	(0.022)	—
D2H FORCE	—	(0.0067)	(0.0067)	(0.0067)
Write back + integration (+ copy ptcl)	0.015	0.016	0.021	0.025
Total	0.98	0.36	0.42	0.095

Method	C1	G1	G2	G3 (reusing step)
〈n_i〉	230	1500	230	230
Set root cell	0.0064	0.0066	0.0066	—
Make local tree	0.091	0.092	0.095	—
Calculate key	0.0084	0.0085	0.0093	—
Sort key	0.042	0.043	0.044	—
Reorder ptcl	0.030	0.030	0.030	—
Link tree cell	0.011	0.010	0.011	—
Calculate multipole moment of local tree	0.0053	0.0058	0.0053	0.0062
Make global tree	0.094	0.094	0.096	0.0071
Calculate key	0.0	0.0	0.0	—
Sort key	0.040	0.041	0.042	—
Reorder ptcl	0.036	0.036	0.037	0.0071
Link tree cell	0.011	0.011	0.011	—
Calculate multipole moment of global tree	0.0066	0.0064	0.0065	0.0068
Calculate force	0.76	0.15	0.21	0.072
Make interaction list (EPJ and SPJ)	(0.16)	0.063	—	—
Make interaction list (id)	—	—	0.13	—
Copy all particles and tree cells	—	—	(0.0065)	(0.0065)
Copy EPI	—	(0.0037)	(0.0037)	(0.0037)
Copy interaction list (EPJ and SPJ)	—	(0.020)	—	—
Copy interaction list (id)	—	—	(0.013)	—
Copy FORCES	—	(0.0060)	(0.0060)	(0.0060)
Force kernel	(0.43)	(0.11)	(0.043)	(0.043)
H2D all particles and tree cells	—	—	(0.0073)	(0.0073)
H2D EPI	—	(0.0042)	(0.0042)	(0.0042)
H2D interaction list (EPJ and SPJ)	—	(0.034)	—	—
H2D interaction list (id)	—	—	(0.022)	—
D2H FORCE	—	(0.0067)	(0.0067)	(0.0067)
Write back + integration (+ copy ptcl)	0.015	0.016	0.021	0.025
Total	0.98	0.36	0.42	0.095

Figure 6 shows the average elapsed time at optimal 〈n_i〉 plotted against θ for methods C1, G1, G2, and G3 with n_reuse of 2, 4, and 16. We also plot the elapsed times for the reusing step of method G3. The slope for method C1 is steeper than for the other methods. This is because the time for the force kernel dominates the total time in the case of method C1 and it strongly depends on θ.

Total times for various methods at the optimal 〈ni〉 against θ, in the case of n = 222. The symbols are as in figure 5. (Color online)

Fig. 6.

Total times for various methods at the optimal 〈n_i〉 against θ, in the case of n = 2²². The symbols are as in figure 5. (Color online)

Open in new tab Download slide

6.2 Performance model on a single node

In the following we present the performance model of the N-body simulation with FDPS with the monopole approximation on a single node with and without an accelerator. The total execution times per step on a single node for the construction step, T_step,const, and for the reusing step, T_step,reuse, are given by

$$\begin{eqnarray} T_{\,\rm step, const} & \sim & T_{\,\rm root} + T_{\,\rm const\, lt} + T_{\,\rm mom\, lt} + T_{\,\rm const\, gt} \\ && +\,T_{\,\rm mom\, gt} + T_{\,\rm force, const}, \end{eqnarray}$$

(1)

$$\begin{eqnarray} T_{\,\rm step, reuse} & \sim & T_{\,\rm mom\, lt} + T^{\,\rm reuse}_{\,\rm reorder\, gt} + T_{\,\rm mom\, gt} + T_{\,\rm force, reuse} , \end{eqnarray}$$

(2)

where T_root, T_const lt, T_mom lt, T_const gt, T_mom gt, T_force,const, |$T^{\,\rm reuse}_{\,\rm reorder\, gt}$|⁠, and T_force,reuse are the times for the determination of the root cell of the tree, the construction of the local tree, the calculation of the multipole moments of the cells of the local tree, the construction of the global tree, the calculation of the multipole moments of the cells of the global tree, the force calculation for the construction step, the reordering of the particles for the global tree, and the force calculation for the reusing step. The force calculation times T_force,const and T_force,reuse include the times for the construction of the interaction list, the actual calculation of the interactions, copying the interaction results from FORCEs to FPs, the integration of orbits of FPs, and copying the particle data from FPs to EPIs and EPJs. In the reuse step, the tree is reused and therefore T_root, T_const lt, and T_const gt do not appear in T_step,reuse. On the other hand, |$T^{\,\rm reuse}_{\,\rm reorder\, gt}$| appears in T_step,reuse. This is because in the reusing step the particles are sorted in the Morton order of the local tree and the particles should be reordered in the Morton order of the global tree. In the following sub-subsections we will discuss the elapsed times of each component in equations (1) and (2).

6.2.1 Model of T_root

At the beginning of the construction step, the coordinate of the root cell of the tree is determined. In order to do so, the CPU cores read the data of all particles (FPs) from main memory, and on modern machines the effective main memory bandwidth determines the calculation time. The actual computation in determining the root cell coordinate is much faster compared to the main memory access. Thus T_root is given by

$$\begin{equation} T_{\,\rm root} \sim \frac{n \alpha _{\rm root} b_{\rm FP}}{B_{\rm host}}, \end{equation}$$

(3)

where n is the number of local particles, α_root is the coefficient for the memory access speed, b_FP is the FP data size, and B_host is the effective bandwidth of the main memory of the host as measured by the STREAM benchmark. Note that we used the “copy” kernel of the STREAM benchmark. In other words, the bandwidth B_host indicates the mean bandwidth of reading and writing. On our system, the reading bandwidth is slightly faster than that of writing. Thus, for a reading-dominant procedure α would be smaller than unity. For our N-body code, we found α_root ∼ 0.7. All the coefficients appearing in our performance model, such as α_root and b_FP, and their typical values are listed in table 3.

Table 3.

Open in new tab

All parameters appearing in the performance model.

Symbol	Value	Explanation
α_const list	19	Slowdown factor for constructing the interaction list.
α_cp all	0.87	Slowdown factor for copying EPJs and SPJs to the send buffer.
α_cp EPI	1.0	Slowdown factor for copying EPIs to the send buffer.
α_cp FORCE	1.2	Slowdown factor for copying FORCEs from the receive buffer.
α_cp list	1.0	Slowdown factor for copying the interaction lists to the send buffer.
α_D2H FORCE	1.2	Slowdown factor for sending FORCEs from the GPGPU to the host.
α_dc sort	7.7	Slowdown factor for determining domains.
α_(all)gather	0.62	Slowdown factor for MPI_(All)gather.
α_GPU calc	1.7	Slowdown factor for calculating interactions on the GPGPU.
α_{GPU transfer}	2.7	Slowdown factor for sending EPJs from the main memory of the GPGPU.
α_H2D all	1.0	Slowdown factor for sending EPJs and SPJs from host to GPGPU.
α_H2D EPI	1.0	Slowdown factor for sending EPIs from host to GPGPU.
α_H2D list	0.86	Slowdown factor for sending the interaction lists from host to GPGPU.
α_key	0.85	Slowdown factor for constructing the key–index pairs.
α_link	3.4	Slowdown factor for linking tree cells.
α_mom	1.8	Slowdown factor for calculating multipole moments of tree cells.
α_p2p	1.0	Slowdown factor for communicating neighboring nodes.
α_reorder lt	1.1	Slowdown factor for reordering particles for the local tree.
\|$\alpha ^{\rm const}_{\rm reorder\, gt}$\|	5.0	Slowdown factor for reordering particles for the global tree at the construction step.
\|$\alpha ^{\rm reuse}_{\rm reorder\, gt}$\|	1.0	Slowdown factor for reordering particles for the global tree at the reuse step.
α_root	0.70	Slowdown factor for searching the root cell.
α_sort	1.1	Slowdown factor for sorting the key–index pairs.
α_wb+int+cp	1.4	Slowdown factor for writing back FORCEs, integrating orbits, and copying variables
		from FP to EPI and EPJ.
b _EPI	24 bytes	Size of an EPI.
b _EPI buf	12 bytes	Size of an EPI in the receive buffer.
b _EPJ	32 bytes	Size of an EPJ.
b _EPJ buf	16 bytes	Size of an EPJ in the receive buffer.
b _FORCE	32 bytes	Size of a FORCE.
b _FORCE buf	16 bytes	Size of a FORCE in the receive buffer.
b _FP	88 bytes	Size of an FP.
b _index	4 bytes	Size of an array index for EPJ and SPJ.
b _ID	4 bytes	Size of an interaction list index for EPJ and SPJ.
b _ID buf	4 bytes	Size of a receive buffer index for EPJ and SPJ.
b _key	16 bytes	Size of a key–index pair.
b _pos	24 bytes	Size of the position of an FP.
n _op	23	The number of operations per interaction.
τ_{(all)gather, startup}	1.2 × 10⁻⁵ s	Start-up time for MPI_(All)gather of the K computer.
τ_p2p,startup	1.0 × 10⁻⁵ s	Start-up time for communicating with neighboring nodes of the K computer.

Symbol	Value	Explanation
α_const list	19	Slowdown factor for constructing the interaction list.
α_cp all	0.87	Slowdown factor for copying EPJs and SPJs to the send buffer.
α_cp EPI	1.0	Slowdown factor for copying EPIs to the send buffer.
α_cp FORCE	1.2	Slowdown factor for copying FORCEs from the receive buffer.
α_cp list	1.0	Slowdown factor for copying the interaction lists to the send buffer.
α_D2H FORCE	1.2	Slowdown factor for sending FORCEs from the GPGPU to the host.
α_dc sort	7.7	Slowdown factor for determining domains.
α_(all)gather	0.62	Slowdown factor for MPI_(All)gather.
α_GPU calc	1.7	Slowdown factor for calculating interactions on the GPGPU.
α_{GPU transfer}	2.7	Slowdown factor for sending EPJs from the main memory of the GPGPU.
α_H2D all	1.0	Slowdown factor for sending EPJs and SPJs from host to GPGPU.
α_H2D EPI	1.0	Slowdown factor for sending EPIs from host to GPGPU.
α_H2D list	0.86	Slowdown factor for sending the interaction lists from host to GPGPU.
α_key	0.85	Slowdown factor for constructing the key–index pairs.
α_link	3.4	Slowdown factor for linking tree cells.
α_mom	1.8	Slowdown factor for calculating multipole moments of tree cells.
α_p2p	1.0	Slowdown factor for communicating neighboring nodes.
α_reorder lt	1.1	Slowdown factor for reordering particles for the local tree.
\|$\alpha ^{\rm const}_{\rm reorder\, gt}$\|	5.0	Slowdown factor for reordering particles for the global tree at the construction step.
\|$\alpha ^{\rm reuse}_{\rm reorder\, gt}$\|	1.0	Slowdown factor for reordering particles for the global tree at the reuse step.
α_root	0.70	Slowdown factor for searching the root cell.
α_sort	1.1	Slowdown factor for sorting the key–index pairs.
α_wb+int+cp	1.4	Slowdown factor for writing back FORCEs, integrating orbits, and copying variables
		from FP to EPI and EPJ.
b _EPI	24 bytes	Size of an EPI.
b _EPI buf	12 bytes	Size of an EPI in the receive buffer.
b _EPJ	32 bytes	Size of an EPJ.
b _EPJ buf	16 bytes	Size of an EPJ in the receive buffer.
b _FORCE	32 bytes	Size of a FORCE.
b _FORCE buf	16 bytes	Size of a FORCE in the receive buffer.
b _FP	88 bytes	Size of an FP.
b _index	4 bytes	Size of an array index for EPJ and SPJ.
b _ID	4 bytes	Size of an interaction list index for EPJ and SPJ.
b _ID buf	4 bytes	Size of a receive buffer index for EPJ and SPJ.
b _key	16 bytes	Size of a key–index pair.
b _pos	24 bytes	Size of the position of an FP.
n _op	23	The number of operations per interaction.
τ_{(all)gather, startup}	1.2 × 10⁻⁵ s	Start-up time for MPI_(All)gather of the K computer.
τ_p2p,startup	1.0 × 10⁻⁵ s	Start-up time for communicating with neighboring nodes of the K computer.

Table 3.

Open in new tab

All parameters appearing in the performance model.

Symbol	Value	Explanation
α_const list	19	Slowdown factor for constructing the interaction list.
α_cp all	0.87	Slowdown factor for copying EPJs and SPJs to the send buffer.
α_cp EPI	1.0	Slowdown factor for copying EPIs to the send buffer.
α_cp FORCE	1.2	Slowdown factor for copying FORCEs from the receive buffer.
α_cp list	1.0	Slowdown factor for copying the interaction lists to the send buffer.
α_D2H FORCE	1.2	Slowdown factor for sending FORCEs from the GPGPU to the host.
α_dc sort	7.7	Slowdown factor for determining domains.
α_(all)gather	0.62	Slowdown factor for MPI_(All)gather.
α_GPU calc	1.7	Slowdown factor for calculating interactions on the GPGPU.
α_{GPU transfer}	2.7	Slowdown factor for sending EPJs from the main memory of the GPGPU.
α_H2D all	1.0	Slowdown factor for sending EPJs and SPJs from host to GPGPU.
α_H2D EPI	1.0	Slowdown factor for sending EPIs from host to GPGPU.
α_H2D list	0.86	Slowdown factor for sending the interaction lists from host to GPGPU.
α_key	0.85	Slowdown factor for constructing the key–index pairs.
α_link	3.4	Slowdown factor for linking tree cells.
α_mom	1.8	Slowdown factor for calculating multipole moments of tree cells.
α_p2p	1.0	Slowdown factor for communicating neighboring nodes.
α_reorder lt	1.1	Slowdown factor for reordering particles for the local tree.
\|$\alpha ^{\rm const}_{\rm reorder\, gt}$\|	5.0	Slowdown factor for reordering particles for the global tree at the construction step.
\|$\alpha ^{\rm reuse}_{\rm reorder\, gt}$\|	1.0	Slowdown factor for reordering particles for the global tree at the reuse step.
α_root	0.70	Slowdown factor for searching the root cell.
α_sort	1.1	Slowdown factor for sorting the key–index pairs.
α_wb+int+cp	1.4	Slowdown factor for writing back FORCEs, integrating orbits, and copying variables
		from FP to EPI and EPJ.
b _EPI	24 bytes	Size of an EPI.
b _EPI buf	12 bytes	Size of an EPI in the receive buffer.
b _EPJ	32 bytes	Size of an EPJ.
b _EPJ buf	16 bytes	Size of an EPJ in the receive buffer.
b _FORCE	32 bytes	Size of a FORCE.
b _FORCE buf	16 bytes	Size of a FORCE in the receive buffer.
b _FP	88 bytes	Size of an FP.
b _index	4 bytes	Size of an array index for EPJ and SPJ.
b _ID	4 bytes	Size of an interaction list index for EPJ and SPJ.
b _ID buf	4 bytes	Size of a receive buffer index for EPJ and SPJ.
b _key	16 bytes	Size of a key–index pair.
b _pos	24 bytes	Size of the position of an FP.
n _op	23	The number of operations per interaction.
τ_{(all)gather, startup}	1.2 × 10⁻⁵ s	Start-up time for MPI_(All)gather of the K computer.
τ_p2p,startup	1.0 × 10⁻⁵ s	Start-up time for communicating with neighboring nodes of the K computer.

Symbol	Value	Explanation
α_const list	19	Slowdown factor for constructing the interaction list.
α_cp all	0.87	Slowdown factor for copying EPJs and SPJs to the send buffer.
α_cp EPI	1.0	Slowdown factor for copying EPIs to the send buffer.
α_cp FORCE	1.2	Slowdown factor for copying FORCEs from the receive buffer.
α_cp list	1.0	Slowdown factor for copying the interaction lists to the send buffer.
α_D2H FORCE	1.2	Slowdown factor for sending FORCEs from the GPGPU to the host.
α_dc sort	7.7	Slowdown factor for determining domains.
α_(all)gather	0.62	Slowdown factor for MPI_(All)gather.
α_GPU calc	1.7	Slowdown factor for calculating interactions on the GPGPU.
α_{GPU transfer}	2.7	Slowdown factor for sending EPJs from the main memory of the GPGPU.
α_H2D all	1.0	Slowdown factor for sending EPJs and SPJs from host to GPGPU.
α_H2D EPI	1.0	Slowdown factor for sending EPIs from host to GPGPU.
α_H2D list	0.86	Slowdown factor for sending the interaction lists from host to GPGPU.
α_key	0.85	Slowdown factor for constructing the key–index pairs.
α_link	3.4	Slowdown factor for linking tree cells.
α_mom	1.8	Slowdown factor for calculating multipole moments of tree cells.
α_p2p	1.0	Slowdown factor for communicating neighboring nodes.
α_reorder lt	1.1	Slowdown factor for reordering particles for the local tree.
\|$\alpha ^{\rm const}_{\rm reorder\, gt}$\|	5.0	Slowdown factor for reordering particles for the global tree at the construction step.
\|$\alpha ^{\rm reuse}_{\rm reorder\, gt}$\|	1.0	Slowdown factor for reordering particles for the global tree at the reuse step.
α_root	0.70	Slowdown factor for searching the root cell.
α_sort	1.1	Slowdown factor for sorting the key–index pairs.
α_wb+int+cp	1.4	Slowdown factor for writing back FORCEs, integrating orbits, and copying variables
		from FP to EPI and EPJ.
b _EPI	24 bytes	Size of an EPI.
b _EPI buf	12 bytes	Size of an EPI in the receive buffer.
b _EPJ	32 bytes	Size of an EPJ.
b _EPJ buf	16 bytes	Size of an EPJ in the receive buffer.
b _FORCE	32 bytes	Size of a FORCE.
b _FORCE buf	16 bytes	Size of a FORCE in the receive buffer.
b _FP	88 bytes	Size of an FP.
b _index	4 bytes	Size of an array index for EPJ and SPJ.
b _ID	4 bytes	Size of an interaction list index for EPJ and SPJ.
b _ID buf	4 bytes	Size of a receive buffer index for EPJ and SPJ.
b _key	16 bytes	Size of a key–index pair.
b _pos	24 bytes	Size of the position of an FP.
n _op	23	The number of operations per interaction.
τ_{(all)gather, startup}	1.2 × 10⁻⁵ s	Start-up time for MPI_(All)gather of the K computer.
τ_p2p,startup	1.0 × 10⁻⁵ s	Start-up time for communicating with neighboring nodes of the K computer.

6.2.2 Model of T_constlt

The time for the construction of the local tree T_const lt is given by

$$\begin{equation} T_{\,\rm const\, lt} \sim T_{\,\rm key\, lt} + T_{\,\rm sort\, lt} + T_{\,\rm reorder\, lt} + T_{\,\rm link\, lt}, \end{equation}$$

(4)

where T_key lt, T_sort lt, T_reorder lt, and T_link lt are the elapsed times for the calculation of Morton keys; sorting of key–index pairs; reordering of FPs, EPIs, and EPJs using the sorted key–index pairs; and linking of tree cells.

The time for key construction is determined by the time to read FPs and write key–index pairs. Thus, T_key lt is given by

$$\begin{equation} T_{\,\rm key\, lt} \sim \frac{n \alpha _{\rm key} ( b_{\rm FP}+b_{\rm key} )}{B_{\rm host}}, \end{equation}$$

(5)

where b_key is the data size of the key–index pair.

For sorting, we use the radix sort algorithm (Knuth 1997) with a chunk size of 8 bits for the 64-bit keys. Thus we need to apply the basic procedure of the radix sort eight times. For each chunk, the data are read twice and written once. Thus the total number of memory accesses is 24. Therefore, T_sort lt is given by

$$\begin{equation} T_{\,\rm sort\, lt} \sim n \frac{\alpha _{\rm sort} 24 b_{\rm key}}{B_{\rm host}}. \end{equation}$$

(6)

For reordering, the key–index pair and FP are read once, and FP, EPI, and EPJ are written, so

$$\begin{equation} T_{\,\rm reorder\, lt} \sim n \frac{\alpha _{\rm reorder\, lt} (2b_{\rm FP}+b_{\rm EPI}+b_{\rm EPJ}+b_{\rm key})}{B_{\rm host}}, \end{equation}$$

(7)

where the size parameters b_EPI and b_EPJ are those of the EPI and EPJ, respectively.

In the linking part, we generate the tree structure from the sorted keys. In order to do so, for each cell of the tree in each level, we determine the location of the first particle by the binary search method. Thus the cost is proportional to n_celllog₂(n_cell), where n_cell is the total number of tree cells. For the usual Barnes–Hut tree we have n_cell ∼ n/4. Thus T_link lt is given by

$$\begin{equation} T_{\,\rm link\, lt} \sim \frac{n}{4} \log _2{ \left( \frac{n}{4} \right) } \frac{\alpha _{\rm link} b_{\rm key} }{B_{\rm host}}. \end{equation}$$

(8)

The reason why α_link lt is much larger than unity is that in the binary search the address to access depends on the data just read.

6.2.3 Model of T_mom lt

The time for the calculation of the multipole moments of the local tree is determined by the time to read EPJ, and therefore T_mom lt is given by

$$\begin{equation} T_{\,\rm mom\, lt} \sim n \frac{\alpha _{\rm mom} b_{\rm EPJ}}{B_{\rm host}}. \end{equation}$$

(9)

6.2.4 Model of T_const gt

In the current implementation of FDPS, the global tree is constructed even if MPI is not used. The procedures for the construction of the global tree are essentially the same as for the local tree, except for reordering particles. In reordering particles, EPJ and SPJ in all LETs are first written to separate arrays. The indices of the arrays for EPJ and SPJ are also saved in order to efficiently reorder the particles at the reusing step. Thus T_const gt is given by

$$\begin{eqnarray} T_{\,\rm const\, gt} & \sim & T_{\,\rm key\, gt} + T_{\,\rm sort\, gt} + T^{\,\rm const}_{\rm reorder\, gt} + T_{\,\rm link\, gt} , \end{eqnarray}$$

(10)

$$\begin{eqnarray} T_{\,\rm key\, gt} & \sim & \displaystyle\frac{n_{\rm LET} \alpha _{\rm key} (b_{\rm key}+b_{\rm EPJ} )}{B_{\rm host}} , \end{eqnarray}$$

(11)

$$\begin{eqnarray} T_{\,\rm sort\, gt} & \sim & \displaystyle\frac{\left(n+ n_{\rm LET} \right) \alpha _{\rm sort} b_{\rm key}}{B_{\rm host}} , \end{eqnarray}$$

(12)

$$\begin{eqnarray} T^{\,\rm const}_{\,\rm reorder\, gt} & \sim & \displaystyle\frac{(n+ n_{\rm LET}) \alpha ^{\rm const}_{\rm reorder\, gt} ( 2b_{\rm EPJ} + b_{\rm index} ) }{B_{\rm host}} , \end{eqnarray}$$

(13)

$$\begin{eqnarray} T_{\,\rm link\, gt} & \sim & \displaystyle\frac{ \displaystyle\frac{(n+ n_{\rm LET})}{4} \alpha _{\rm link} \log _2\bigg[ \displaystyle\frac{(n+ n_{\rm LET})}{4} \bigg]}{B_{\rm host}} , \end{eqnarray}$$

(14)

where n_LET is the number of LETs and b_index is the size of one array index for EPJ and SPJ in bytes. Note that for the case of center-of-mass approximation used here, the type of SPJ is the same as that of EPJ and thus the size of SPJ is equal to b_EPJ. From table 3, we can see that |$\alpha ^{\rm const}_{\rm reorder\, gt}$| is larger than α_reorder lt because for each node of LET we need to determine whether it is EPJ or SPJ.

6.2.5 Model of T_reorder gt

At the reusing step, we also reorder the particles in the Morton order of the global tree by using the array index for EPJ and SPJ constructed at the construction step. Thus T_const gt is dominated by the times to read the indices for EPJ and SPJ once and to write EPJ and SPJ once, and therefore T_const gt is given by

$$\begin{equation} T^{\,\rm reuse}_{\,\rm reorder\, gt} \sim \frac{(n+ n_{\rm LET}) \alpha ^{\rm reuse}_{\rm reorder\, gt} ( 2b_{\rm EPJ}+b_{\rm index} )}{B_{\rm host}}. \end{equation}$$

(15)

In table 3 we can see that |$\alpha ^{\rm reuse}_{\rm reorder\, gt}$| is much smaller than |$\alpha ^{\rm const}_{\rm reorder\, gt}$| because we use the array indices for EPJ and SPJ saved at the construction step to reorder the particles.

6.2.6 Model of T_mom gt

The procedure for the calculation of the multipole moments of the cells of the global tree is almost the same as that of the local tree. Thus T_mom gt is given by

$$\begin{eqnarray} T_{\,\rm mom\, gt} \sim \frac{\left( n + n_{\rm LET} \right) \alpha _{\rm mom} b_{\rm EPJ}}{B_{\rm host}}. \end{eqnarray}$$

(16)

6.2.7 Models of T_force,const and T_force,reuse

The elapsed times for the force calculation at the construction step, T_force,const, and at the reusing step, T_force,reuse, are given by

$$\begin{eqnarray} T_{\,\rm force, const} &\sim &\, \max ( T_{\,\rm cp\, all},\: T_{\,\rm H2D\, all} ) \\ && + \max ( T_{\,\rm kernel},\: T_{\,\rm H2D\, EPI}+T_{\,\rm H2D\, list},\: T_{\,\rm D2H\, FORCE}, \\ && \qquad \quad T_{\,\rm const\, list} + T_{\,\rm cp\, EPI}+T_{\,\rm cp\, list}+T_{\,\rm wb+int+cp} ) \\ && +\, T_{\,\rm cp\, FORCE}, \end{eqnarray}$$

(17)

$$\begin{eqnarray} T_{\,\rm force, reuse} &\sim & \max ( T_{\,\rm cp\, all},\: T_{\,\rm H2D\, all} ) \\ && + \max ( T_{\,\rm kernel},\: T_{\,\rm H2D\, EPI},\: T_{\,\rm D2H\, FORCE},\: T_{\,\rm cp\, EPI} \\ && +\,T_{\,\rm wb+int+cp} )+ T_{\,\rm cp\, FORCE}, \end{eqnarray}$$

(18)

where T_cp all, T_H2D all, T_kernel, T_H2D EPI, T_H2D list, T_D2H FORCE, T_const list, T_cp EPI, T_cp list, and T_wb+int+cp are the times for copying EPJs and SPJs to the send buffer, sending EPJs and SPJs from the host to the GPGPU, the force kernel on GPGPU, sending EPIs to the GPGPU, sending the interaction lists to the GPGPU, receiving FORCEs from the GPGPU, constructing the interaction list, copying EPIs to the send buffer, copying the interaction lists, and copying the FORCE data to FPs, integrating the orbits of FPs, and copying the FP data to EPIs and EPJs, respectively.

The components in equations (17) and (18) are given by

$$\begin{eqnarray} T_{\,\rm cp\, all} & \sim & \frac{( n + n_{\rm LET}) \alpha _{\rm cp\, all} ( b_{\rm EPJ}+b_{\rm EPJ\, buf} )}{B_{\rm host}} , \end{eqnarray}$$

(19)

$$\begin{eqnarray} T_{\,\rm H2D\, all} & \sim & \frac{\left(n + n_{\rm LET} \right) \alpha _{\rm H2D\, all} b_{\rm EPJ\, buf} }{B_{\rm transfer}} , \end{eqnarray}$$

(20)

$$\begin{eqnarray} T_{\,\rm kernel} & \sim & n \langle n_{\rm list} \rangle \left[ \frac{\alpha _{\rm GPU, kernel} n_{\rm op}}{F_{\rm GPU}} + \frac{\alpha _{\rm GPU, mm} ( b_{\rm ID} + b_{\rm EPJ} ) }{\langle n_i \rangle B_{\rm GPU}} \right], \end{eqnarray}$$

(21)

$$\begin{eqnarray} T_{\,\rm H2D\, EPI} & \sim & \frac{n \alpha _{\rm H2D\, EPI} b_{\rm EPI\, buf}}{B_{\rm transfer}}, \end{eqnarray}$$

(22)

$$\begin{eqnarray} T_{\,\rm H2D\, list} & \sim & \frac{ n \langle n_{\rm list} \rangle \alpha _{\rm H2D\, list} b_{\rm ID}}{\langle n_i \rangle B_{\rm transfer}} , \end{eqnarray}$$

(23)

$$\begin{eqnarray} T_{\,\rm D2H\, FORCE} & \sim & \frac{n \alpha _{\rm D2H, FORCE} b_{\rm FORCE\, buf}}{B_{\rm transfer}} , \end{eqnarray}$$

(24)

$$\begin{eqnarray} T_{\,\rm const\, list} & \sim & \frac{n \langle n_{\rm list} \rangle \alpha _{\rm const\, list} b_{\rm ID} }{\langle n_{\rm i} \rangle B_{\rm host}}, \end{eqnarray}$$

(25)

$$\begin{eqnarray} T_{\,\rm cp\, EPI} & \sim & \frac{n \alpha _{\rm cp\, EPI} \left( b_{\rm EPI}+b_{\rm EPI\, buf} \right) }{B_{\rm host}} , \end{eqnarray}$$

(26)

$$\begin{eqnarray} T_{\,\rm cp\, list} & \sim & \frac{n \langle n_{\rm list} \rangle \alpha _{\rm cp\, list} b_{\rm ID}}{\langle n_i \rangle B_{\rm host}}, \end{eqnarray}$$

(27)

$$\begin{eqnarray} T_{\,\rm wb+int+cp} & \sim & \frac{n \alpha _{\rm wb+int+cp} ( b_{\rm FORCE}+b_{\rm FP}+b_{\rm EPI}+b_{\rm EPJ} ) }{B_{\rm host}}, \end{eqnarray}$$

(28)

$$\begin{eqnarray} T_{\,\rm cp\, FORCE} & \sim & \frac{n \alpha _{\rm cp\, FORCE} \left( b_{\rm FORCE}+b_{\rm FORCE\, buf} \right)}{B_{\rm host}} , \end{eqnarray}$$

(29)

$$\begin{eqnarray} \langle n_{\rm list} \rangle & \sim & \langle n_{\rm i} \rangle + \frac{14\langle n_{\rm i} \rangle ^{2/3}}{\theta } + \frac{21\pi \langle n_{\rm i} \rangle ^{1/3}}{\theta ^2} \\ && +\,\frac{28\pi }{3\theta ^3} {\rm log_2} \left[ \frac{\theta }{2.8}\left(n^{1/3} - \langle n_{\rm i} \rangle ^{1/3}\right)\right], \end{eqnarray}$$

(30)

where b_EPI(J) buf, b_ID, b_FORCE, and b_FORCE buf are the data sizes of EPI(J) in the send buffer, the index of EPJ (and SPJ) in the interaction list, and the FORCE in the receive buffer, B_transfer and B_GPU are the effective bandwidths of the data bus between host and GPGPU and the main memory of the GPGPU, F_GPU is the peak speed of floating point operations of the GPGPU, n_op is the number of floating point operations per interaction, and 〈n_list〉 is the average size of the interaction list. The elapsed times are determined by the bandwidth of the main memory of the host or the data bus between the host and the GPGPU, except for the time for the force calculation, T_kernel. The model of T_kernel is a bit complicated. We will describe how to construct this model in appendix 1.

In table 3 we can see that α_const list and α_{GPU transfer} are much larger than unity because random access of the main memory is slow for both the host and the GPGPU.

6.2.8 Model of T_step,const and T_step,reuse

In the previous sub-subsections we showed the performance models of each step of the interaction calculation. By using these models, the total execution time is expressed by equation (31) for the construction step and by equation (32) for the reusing step.

$$\begin{eqnarray} T_{\,\rm step, const} &\sim & \big \lbrace n [ \alpha _{\rm root} b_{\rm FP}+ \alpha _{\rm key} ( b_{\rm FP} + b_{\rm key} )\\ && +\, \alpha _{\rm reorder} ( 2b_{\rm FP} + b_{\rm EPI} + b_{\rm EPJ} + b_{\rm key} )\\ && +\,\alpha _{\rm cp\, FORCE} \left( b_{\rm FORCE} + b_{\rm FORCE\, buf} \right) ]+ (n+n_{\rm LET}) \\ && \times \bigg[ \alpha _{\rm key} ( b_{\rm EPJ} + b_{\rm key} ) + \alpha ^{\rm const}_{\rm reorder\, gt} ( 2b_{\rm EPJ} + b_{\rm index} ) \bigg] \\ && +\, (2n+n_{\rm LET})(\alpha _{\rm sort} 24 b_{\rm key} + \alpha _{\rm mom} b_{\rm EPJ} ) \big \rbrace / B_{\rm host} \\ && +\, \alpha _{\rm link} \lbrace ( n/4 ) \log _2( n/4) + [ (2n+n_{\rm LET})/4 ] \\ && \times \log _2[ (2n+n_{\rm LET})/4] \rbrace / B_{\rm host} \\ && +\, (n+n_{\rm LET}) \max [ \alpha _{\rm cp\, all} ( b_{\rm EPJ} \!+\! b_{\rm EPJ\, buf}) / B_{\rm host}, \\ &&\alpha _{\rm H2D\, all} b_{\rm EPJ\, buf} / B_{\rm transfer} ]\\ &&+\, n\ \max \Bigg \lbrace n_{\rm list} \left[ \frac{\alpha _{\rm GPU\, calc} n_{\rm op}}{F_{\rm GPU}} {+} \frac{\alpha _{\rm GPU} ( b_{\rm ID} + b_{\rm EPJ}) }{\langle n_i \rangle B_{\rm GPU}}\right], \\ && \Big (\alpha _{\rm H2D\, EPI} b_{\rm EPI\, buf} + \frac{n_{\rm list}}{\langle n_{\rm i}\rangle } \alpha _{\rm H2D\, list} b_{\rm ID\, buf} \Big ) / B_{\rm transfer}, \\ && \alpha _{\rm D2H\, FORCE} b_{\rm FORCE\, buf} / B_{\rm transfer}, \\ && \big [ \alpha _{\rm cp\, EPI} (b_{\rm EPI} \!+\! b_{\rm EPI\, buf} ) \!+\! \frac{n_{\rm list}}{\langle n_{\rm i}\rangle } (\alpha _{\rm cp\, list} b_{\rm ID} \!+\! \alpha _{\rm const\, list} b_{\rm ID}) \\ && +\, \alpha _{\rm wb+int+cp} (b_{\rm FORCE} \!+\! b_{\rm FP} \!+\! b_{\rm EPI} \!+\! b_{\rm EPJ})\! \big ] / B_{\rm host} \Bigg \rbrace .\\ \\ \end{eqnarray}$$

(31)

$$\begin{eqnarray} T_{\,\rm step, reuse} &\sim & \big[ n \alpha _{\rm cp\, FORCE} ( b_{\rm FORCE} + b_{\rm FORCE\, buf})\\ && +\, (n+n_{\rm LET}) \alpha ^{\rm reuse}_{\rm reorder\, gt} ( 2b_{\rm EPJ} + b_{\rm index}) \\ && +\, (2n+n_{\rm LET}) \alpha _{\rm mom} b_{\rm EPJ} \big ] / B_{\rm host} \\ && + (n+n_{\rm LET}) \max [ \alpha _{\rm cp\, all} ( b_{\rm EPJ} + b_{\rm EPJ\, buf})/ B_{\rm host}, \\ && \alpha _{\rm H2D\, all} b_{\rm EPJ\, buf} / B_{\rm transfer} ] \\ && +\, n \max \Bigg \lbrace n_{\rm list} \left[ \frac{\alpha _{\rm GPU\, calc} n_{\rm op}}{F_{\rm GPU}} \right.\\ &&+\, \left.\frac{\alpha _{\rm GPU\, transfer} (b_{\rm ID} + b_{\rm EPJ} ) }{\langle n_i \rangle B_{\rm GPU}} \right], \\ &&\!\!\!\! \alpha _{\rm H2D\, EPI} b_{\rm EPI\, buf} / B_{\rm transfer}, \alpha _{\rm D2H\, FORCE} b_{\rm FORCE\, buf} / B_{\rm transfer}, \\ && [ \alpha _{\rm cp\, EPI} \left( b_{\rm EPI} + b_{\rm EPI\, buf}\right) + \alpha _{\rm wb+int+cp}\\ &&\times ( b_{\rm FORCE} + b_{\rm FP} + b_{\rm EPI} + b_{\rm EPJ} ) ] / B_{\rm host} \Bigg \rbrace . \end{eqnarray}$$

(32)

Thus, the mean execution time per step, T_step,single, substituting the efficiency parameters α and the size parameters b with the measured values listed in table 3, is given by

$$\begin{eqnarray} T_{\,\rm step, single} & \sim & \frac{1}{n_{\rm reuse}}\left[ T_{\,\rm step, const} + (n_{\rm reuse}-1)T_{\,\rm step, reuse} \right] \\ &\sim & 68(n+n_{\rm LET})/B_{\rm host} \\ &&+\, ( 174n + 58n_{\rm LET} ) \max ( 42/B_{\rm host}, 16/B_{\rm transfer} ) \\ && +\, n\ \max \lbrace n_{\rm list} [ 39.1/F_{\rm GPU} + 440/(\langle n_i \rangle B_{\rm GPU}) ], \\ && 19.2/B_{\rm transfer}, 282 / B_{\rm host} \rbrace + \frac{1}{n_{\rm reuse}} [ ( 1580n \\ &&+\, 1157n_{\rm LET} + 54.4 \lbrace ( n/4 ) \log _2( n/4) \\ && + [ (2n+n_{\rm LET})/4 ] \log _2 [ (2n+n_{\rm LET})/4] \rbrace ) / B_{\rm host} \\ && +\, n ( \max \lbrace n_{\rm list} [ 39.1/F_{\rm GPU} + 440/(\langle n_i \rangle B_{\rm GPU}) ], \\ && ( 12 + 3.4 n_{\rm list}/\langle n_i \rangle ) / B_{\rm transfer}, 19.2/B_{\rm transfer}, \\ && ( 282 + 80n_{\rm list}/\langle n_i \rangle ) / B_{\rm host} \rbrace , \\ &&- \max \lbrace n_{\rm list} [ 39.1/F_{\rm GPU} + 440/(\langle n_i \rangle B_{\rm GPU}) ], \\ &&19.2/B_{\rm transfer}, 282/B_{\rm host} \rbrace ) ]. \end{eqnarray}$$

(33)

In order to check whether the model we constructed is reasonable or not, in figure 7 we compare the times predicted by equations (31) and (32) with the measured times on the systems which consist of an Intel Xeon E5-2670 v3 and NVIDIA TITAN V (left panel), and an Intel Core i9-9820x and NVIDIA RTX2080 (right panel). We can see that the predicted times agree very well with the measured times on both systems.

Fig. 7.

Total elapsed time for both the construction and reusing steps on systems consisting of an Intel Xeon E5-2670 v3 and NVIDIA TITAN V (left panel) and an Intel Core i9-9820x and NVIDIA RTX2080 (right panel). The solid (dashed) curve and filled circles (triangles) indicate respectively the predicted time by using the model and the measured time, in the case of the construction (reusing) step. (Color online)

Open in new tab Download slide

In the following, we analyze the performance of the N-body simulations on hypothetical systems with various B_host, B_transfer, F_GPU, and B_GPU by using the performance model. Unless otherwise noted, we assume a hypothetical system with B_host = 100 GB s⁻¹, B_transfer = 10 GB s⁻¹, B_GPU = 500 GB s⁻¹, and F_GPU = 10 Tflops as a reference system. This reference system can be regarded as a modern HPC system with a high-end accelerator.

Figure 8 shows the calculation time per timestep on the reference system for 4M particles and θ = 0.5 for n_reuse = 1, 4, and 16. We can see that the difference in the performance for n_reuse = 4 and n_reuse = 16 is relatively small. In the rest of the section we use n_reuse = 16 and θ = 0.5.

Fig. 8.

Calculation time per timestep on the reference system for various n_reuse. (Color online)

Open in new tab Download slide

We consider four different types of hypothetical systems: GPU2X, GPU4X, LINK4X, and LINK0.5X. Their parameters are listed in table 4. Figure 9 shows the calculation times per timestep for the four hypothetical systems. We can see that increasing the bandwidth between CPU and accelerator (LINK4X) has a relatively small effect on the performance. On the other hand, increasing the overall accelerator performance has a fairly large impact.

Fig. 9.

Mean execution time per step, T_{step single}, on various hypothetical systems against 〈n_i〉. (Color online)

Open in new tab Download slide

Table 4.

Open in new tab

Parameters of hypothetical systems.

System	F _GPU	B _host	B _GPU	B _transfer
Reference	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	10 GB s⁻¹
GPU2X	20 Tflops	100 GB s⁻¹	1 TB s⁻¹	10 GB s⁻¹
GPU4X	40 Tflops	100 GB s⁻¹	2 TB s⁻¹	10 GB s⁻¹
LINK4X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	40 GB s⁻¹
LINK0.5X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	5 GB s⁻¹

System	F _GPU	B _host	B _GPU	B _transfer
Reference	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	10 GB s⁻¹
GPU2X	20 Tflops	100 GB s⁻¹	1 TB s⁻¹	10 GB s⁻¹
GPU4X	40 Tflops	100 GB s⁻¹	2 TB s⁻¹	10 GB s⁻¹
LINK4X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	40 GB s⁻¹
LINK0.5X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	5 GB s⁻¹

Table 4.

Open in new tab

Parameters of hypothetical systems.

System	F _GPU	B _host	B _GPU	B _transfer
Reference	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	10 GB s⁻¹
GPU2X	20 Tflops	100 GB s⁻¹	1 TB s⁻¹	10 GB s⁻¹
GPU4X	40 Tflops	100 GB s⁻¹	2 TB s⁻¹	10 GB s⁻¹
LINK4X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	40 GB s⁻¹
LINK0.5X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	5 GB s⁻¹

System	F _GPU	B _host	B _GPU	B _transfer
Reference	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	10 GB s⁻¹
GPU2X	20 Tflops	100 GB s⁻¹	1 TB s⁻¹	10 GB s⁻¹
GPU4X	40 Tflops	100 GB s⁻¹	2 TB s⁻¹	10 GB s⁻¹
LINK4X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	40 GB s⁻¹
LINK0.5X	10 Tflops	100 GB s⁻¹	500 GB s⁻¹	5 GB s⁻¹

Figure 10 shows the relative execution time of hypothetical systems in the two-dimensional plane of B_host and B_transfer. The contours indicate the execution time relative to the reference system. In other words, systems on the contour with a value of X are X times slower than the reference system.

Fig. 10.

Relative execution time of hypothetical systems in the two-dimensional plane of relative B_host and relative B_transfer. The contours indicate the execution time relative to the reference system. The values of the contours and the x- and y-axes are normalized by those of the reference system.

Open in new tab Download slide

We can see that an increase of B_host or B_transfer, or even both, would not give significant performance improvement, while an increase of the accelerator performance gives a significant speedup. Even when both B_host and B_transfer are reduced by a factor of ten, the overall speed is reduced by a factor of four. Thus, if the speed of accelerator is improved by a factor of ten, the overall speedup is four. We can therefore conclude that the current implementation of FDPS can provide good performance not only on the current generation of GPGPUs but also for future generations of GPGPUs or other accelerator-based systems, which will have relatively poor data transfer bandwidth compared to their calculation performance.

6.3 Performance model on multiple nodes

In this section we discuss the performance model of the parallelized N-body simulation with method G3. Here, we assume the network is the same as that of the K computer. Detailed communication algorithms are given in Iwasawa et al. (2016).

The time per step is given by

$$\begin{eqnarray} T_{\,\rm step, para} &=& T_{\,\rm step, single} + T_{\,\rm exch}/n_{\rm reuse} + T_{\,\rm dc}/n_{\rm dc}\\ && + T_{\,\rm LET, const}/n_{\rm reuse} + T_{\,\rm LET, exch}, \end{eqnarray}$$

(34)

where T_exch, T_dc, T_LET,const, and T_LET,exch are the times for the exchange of particles, the domain decomposition, the construction of LETs, and the exchange of LETs, and n_dc is the number of steps for which the same domain decomposition is used. We consider the case when particles do not move much in a single step and thus ignore T_exch.

The time for the domain decomposition is given by

$$\begin{equation} T_{\,\rm dc} \sim T_{\,\rm dc,\, collect} + T_{\,\rm dc,\, sort}, \end{equation}$$

(35)

where T_dc,collect and T_dc,sort are the times to collect sample particles to root processes and to sort particles on the root processes.

According to Iwasawa et al. (2016), T_dc,collect and T_dc,sort are given by

$$\begin{eqnarray} T_{\,\rm dc,collect} &\sim & n^{1/6}_{\rm p} \tau _{\rm gather, startup} + \frac{n_{\rm smp} n^{2/3}_{\rm p} \alpha _{\rm gather} b_{\rm pos}}{B_{\rm inj}}, \end{eqnarray}$$

(36)

$$\begin{eqnarray} T_{\,\rm dc,sort} &\sim & \frac{n_{\rm smp} n^{2/3}_{\rm p} \log \left( n^3_{\rm smp} n^{5/3}_{\rm p}\right) \alpha _{\rm dc, sort} b_{\rm pos}}{B_{\rm host}}, \end{eqnarray}$$

(37)

where n_p is the number of processes, n_smp is the number of sample particles to determine the domains, b_pos is the data size of the position of the particle, B_inj is the effective injection bandwidth, τ_{gather,startup} is the startup time for MPI_Gather and α_gather is the efficiency parameter for communicating data with MPI_Gather. We will describe how to measure the parameters τ_{gather,startup} and α_gather in appendix 2. The coefficients for equation (37) are listed in table 3. Since we used a quick sort here, α_dc sort is much larger than unity.

In the original implementation of FDPS, MPI_Alltoallv was used for the exchange of LETs and it was the main bottleneck in the performance for large n_p (Iwasawa et al. 2016). Thus, we recently developed a new algorithm for the exchange of LETs to avoid the use of MPI_Alltoallv (Iwasawa et al. 2018). The new algorithm is as follows:

Each process sends the multipole moment of the top-level cell of its local tree to all processes using MPI_Allgather.
Each process calculates the distances from all other domains.
If the distance between processes i and j is large enough that process i can be regarded as one cell from process j, that domain already has the necessary information. If not, process i constructs a LET for process j and sends it to process j.

With this new algorithm, the times for the LET exchange is expressed as

$$\begin{eqnarray} T_{\,\rm LET, const} & \sim & \frac{( n_{\rm LET} - n_{\rm p} + n_{\rm p, close} ) \alpha _{\rm const\, list} b_{\rm ID}}{B_{\rm host}}, \end{eqnarray}$$

(38)

$$\begin{eqnarray} T_{\,\rm LET, exch} & \sim & T_{\,\rm LET, allgather} + T_{\,\rm LET, p2p}, \end{eqnarray}$$

(39)

$$\begin{eqnarray} T_{\,\rm LET, allgather} & \sim & n^{1/4}_{\rm p} \tau _{\rm allgather, startup} + \frac{n_{\rm p} \alpha _{\rm allgather} b_{\rm EPJ}}{B_{\rm inj}} , \end{eqnarray}$$

(40)

$$\begin{eqnarray} T_{\,\rm LET, p2p} & \sim & n_{\rm p, close} \tau _{\rm p2p,startup} + \frac{2 ( n_{\rm LET} - n_{\rm p} + n_{\rm p, close} ) \alpha _{\rm p2p} b_{\rm EPJ}}{B_{\rm inj}} ,\\ \end{eqnarray}$$

(41)

$$\begin{eqnarray} n_{\rm LET} & \sim & \frac{14n^{2/3}}{\theta } + \frac{21\pi n^{1/3}}{\theta ^2} \\ && + \frac{28\pi }{3\theta ^3} {\rm log_2}\left\lbrace \frac{\theta }{2.8}\left[\left(nn_{\rm p}\right)^{1/3} - n^{1/3}\right] \right\rbrace n_{\rm p} - n_{\rm p, close}, \\ \end{eqnarray}$$

(42)

$$\begin{eqnarray} n_{\rm p, close} & \sim & 6\left( \frac{1}{\theta } + 1 \right) + 3\pi \left( \frac{1}{\theta } + 1 \right)^2 + \frac{4\pi }{3} \left( \frac{1}{\theta } + 1 \right)^3, \\ \end{eqnarray}$$

(43)

where T_{LET,allgather} and T_LET,p2p are the times for the exchange of LETs using MPI_Allgather and MPI_Isend/Irecv, n_p,close is the number of processes to exchange LETs with MPI_Isend/Irecv, τ_{p2p,allgather} and τ_p2p,startup are the startup times for MPI_Allgather and MPI_Isend/Irecv, and α_allgather and α_p2p are the efficiency parameters for LET exchange with MPI_Allgather and MPI_Isend/Irecv, respectively. The parameters for equations (40) and (41) are listed in table 3.

Figure 11 shows the weak-scaling performance for N-body simulations in the case of the number of particles per process n = 2²⁰. Here we assume that n_reuse = n_dc = 16, n_smp = 30, 〈n_i〉 = 250, and θ = 0.5. We can see that T_step,para is almost constant for n_p ≲ 10⁵. For n_p ≳ 10⁵, T_step,para slightly increases because T_{LET,allgather} becomes large. Roughly speaking, when n is much larger than n_p the parallel efficiency is high.

Weak-scaling calculation time plotted as a function of np. The number of particles per node is 220. (Color online)

Fig. 11.

Weak-scaling calculation time plotted as a function of n_p. The number of particles per node is 2²⁰. (Color online)

Open in new tab Download slide

Figure 12 shows the strong-scaling performance in the case of the total number of particles N = 2³⁰ (left panel) and N = 2⁴⁰ (right panel). In the case of N = 2⁴⁰, T_step,para scales well up to n_p = 10⁶. On the other hand, in the case of N = 2³⁰, for n_P ≳ 3000 the slope of T_step,para becomes shallower because T_LET,p2p becomes relatively large. For n_P ≳ 50000, T_step,para increases because T_{LET,allgather} becomes relatively large. We can also see that T_step,single increases linearly with n_p for n_p ≳ 10⁵. This is because T_step,single depends on n_LET, which is proportional to n_p for large n_p. Thus, to improve the scalability further we need to reduce T_{LET,allgather} and T_LET,p2p. We will discuss ideas for reducing them in subsections 7.1 and 7.2.

Fig. 12.

Strong-scaling calculation time plotted as a function of n_p. The left and right panels show the results in the cases of the total number of particles N of 2³⁰ and 2⁴⁰, respectively. (Color online)

Open in new tab Download slide

7 Discussion and summary

7.1 Further reduction of communication

For simulations on multiple nodes, the communication of the LETs with neighboring nodes (T_LET,p2p) becomes a bottleneck for very large n_p. Thus, it is important to reduce the data size for this communication.

An obvious way to reduce the amount of data transfer is to apply some data compression techniques. For example, the physical coordinates of neighboring particles are similar, and thus there is clear room for compression. However, for many particle-based simulations, compression in the time domain might be more effective. In the time domain we can make use of the fact that the trajectories of most particles are smooth, so we can construct fitting polynomials from several previous timesteps. When we send the data of one particle at a new timestep, we send only the difference between the prediction from the fitting polynomial and the actual value. Since this difference is many orders of magnitude smaller than the absolute value of the coordinate itself, we should be able to use a much shorter word format for the same accuracy. We probably need some clever way to use a variable-length word format. We can apply compression in the time domain not only for coordinates but for any physical quantities of particles, including the calculated interactions such as gravitational potential and acceleration. We can also apply the same compression technique to communication between the CPU and the accelerator.

7.2 Tree of domains

As we have seen in figure 12, for large n_p the total elapsed time increases linearly with n_p because the elapsed times for the exchange of LETs and the construction of the global tree are proportional to n_p if n_p is very large. To remove this linear dependency on n_p we can introduce a tree structure to the computational domains (tree of domains) (Iwasawa et al. 2018). By using the tree of domains and exchanging the locally combined multipole moments between distant nodes, we can remove MPI_Allgather among all processes to exchange LETs. This means that the times for the exchange of LETs and the construction of the global tree do not increase linearly with n_p.

7.3 Further improvement in single-node performance

Considering the trends in HPC systems, the overall performance of the accelerator (F_GPU and B_GPU) increases faster than the bandwidths of the host main memory (B_host) and the data bus between the host and the accelerator (B_transfer). Therefore, in the near future the main performance bottleneck could be B_host and B_transfer.

The amount of data copied in the host main memory and data transfer between the host and the accelerator for the reusing step are summarized in tables 5 and 6. We can see that the amounts of data copied and transfered are about 15n and 3n. One reason for the large amount of data access is that there are four different particle data types (FP, EPI, EPJ, and FORCE) and data are copied between different data types. If we use only one particle data type we could avoid data copying in procedures (e) and (g) in table 5 and procedure (B) in table 6. If we do so, the amount of data copying in the main memory and data transfer between host and accelerator could be reduced by 40% and 33%, respectively.

Table 5.

Open in new tab

Amount of particle data copied in the host main memory for the specified procedures (left column).

Procedure	Amount of data
(a) Calculate multipole moments of local tree	n
(b) Reorder particles for global tree	3n
(c) Calculate multipole moments of global tree	n
(d) Copy EPJ and SPJ to the send buffer	2n
(e) Copy EPI to the send buffer	2n
(f) Copy FORCE from the receive buffer	2n
(g) Write back FORCEs to FPs, integrate orbits of FPs, and copy FPs to EPIs and EPJs	4n

Procedure	Amount of data
(a) Calculate multipole moments of local tree	n
(b) Reorder particles for global tree	3n
(c) Calculate multipole moments of global tree	n
(d) Copy EPJ and SPJ to the send buffer	2n
(e) Copy EPI to the send buffer	2n
(f) Copy FORCE from the receive buffer	2n
(g) Write back FORCEs to FPs, integrate orbits of FPs, and copy FPs to EPIs and EPJs	4n

Table 5.

Open in new tab

Amount of particle data copied in the host main memory for the specified procedures (left column).

Procedure	Amount of data
(a) Calculate multipole moments of local tree	n
(b) Reorder particles for global tree	3n
(c) Calculate multipole moments of global tree	n
(d) Copy EPJ and SPJ to the send buffer	2n
(e) Copy EPI to the send buffer	2n
(f) Copy FORCE from the receive buffer	2n
(g) Write back FORCEs to FPs, integrate orbits of FPs, and copy FPs to EPIs and EPJs	4n

Procedure	Amount of data
(a) Calculate multipole moments of local tree	n
(b) Reorder particles for global tree	3n
(c) Calculate multipole moments of global tree	n
(d) Copy EPJ and SPJ to the send buffer	2n
(e) Copy EPI to the send buffer	2n
(f) Copy FORCE from the receive buffer	2n
(g) Write back FORCEs to FPs, integrate orbits of FPs, and copy FPs to EPIs and EPJs	4n

Table 6.

Open in new tab

Amount of particle data transfer between host and accelerator for the specified procedures (left column).

Procedure	Amount of data
(A) Send EPJ and SPJ	n
(B) Send EPI	n
(C) Receive FORCE	n

Table 6.

Open in new tab

Amount of particle data transfer between host and accelerator for the specified procedures (left column).

Procedure	Amount of data
(A) Send EPJ and SPJ	n
(B) Send EPI	n
(C) Receive FORCE	n

Another way to improve the performance is to implement all procedures on the accelerator, because the bandwidth of the device memory (B_GPU) is much faster than B_host and B_transfer. In this case, the performance would be determined by the overall performance of the accelerator.

7.4 Summary

In this paper we have described the algorithms we implemented in FDPS to allow efficient and easy use of accelerators. Our algorithm is based on Barnes’ vectorization, which has been used both on general-purpose computers (and thus previous versions of FDPS) and on GRAPE special-purpose processors and GPGPUs. However, we have minimized the amount of communication between the CPU and the accelerator by an indirect addressing method, and we further reduce the amount of calculation on the CPU side by interaction list reusing. The performance improvement over the simple method based on Barnes’ vectorization on the CPU can be as large as a factor of ten on the current generation of accelerator hardware. We can also expect a fairly large performance improvement on future hardware, even if the relative communication performance is expected to degrade.

The version of FDPS described in this paper is available at 〈https://github.com/FDPS/FDPS〉.

Acknowledgements

We are indebted to an anonymous referee for valuable comments. Numerical computations were in part carried out on the K computer at RIKEN Center for Computational Science through the HPCI System Research project (Project ID:ra000008) and on the Cray XC50 at the Center for Computational Astrophysics, National Astronomical Observatory of Japan. Part of the research covered in this paper was funded by MEXT’s program for the Development and Improvement for the Next Generation Ultra High-Speed Computer System, under its Subsidies for Operating the Specific Advanced Large Research Facilities. M.I. is supported by JSPS KAKENHI Grant Numbers JP18K11334 and JP18H04596.

Appendix 1. Performance model of force kernel on GPGPU

Here, we construct the performance model of the force kernel on the GPGPU. Figure 13 shows the measured time for the force kernel per interaction against 〈n_i〉 for various θ. We can see that the elapsed times are independent of θ (i.e., independent of 〈n_list〉) and depend on 〈n_i〉. For small 〈n_i〉, the time decreases as 〈n_i〉 increases. This is because the times are determined by the bandwidth of the main GPGPU memory (B_GPU). For large 〈n_i〉, the elapsed times are almost constant because these times are determined by the speed of the GPGPU floating point operation (F_GPU). Thus the time for the force kernel T_kernel is given by

$$\begin{eqnarray} T_{\,\rm kernel} &=& n \langle n_{\rm list} \rangle \left[ \frac{\alpha _{\rm GPU\, calc} n_{\rm op}}{F_{\rm GPU}} + \frac{\alpha _{\rm GPU\, transfer} (b_{\rm EPJ}+b_{\rm ID} )}{ \langle n_i \rangle {B_{\rm GPU}}} \right]. \\ \end{eqnarray}$$

(A1)

Fig. 13.

Elapsed time for interaction calculations per interaction. The circles, squares, and triangles indicate the results for θ = 1.0, 0.5, and 0.25, respectively. The solid curve indicates the fitted model. (Color online)

Open in new tab Download slide

To determine the coefficients for equation (A1), we assume that F_GPU is 13.8 Tflops and B_GPU is about 550 GB s⁻¹, which is measured with bandwidthTest in the NVIDIA SDK. These coefficients are listed in table 3.

Appendix 2. Performance of MPI_Gather and MPI_Allgather on the K Computer

Here, we construct performance models of MPI_Gather and MPI_Allgather on the K computer. On this computer the performance of MPI_Gather is almost the same as that of MPI_Allgather. Thus, in the following we only consider MPI_Allgather.

The elapsed time for MPI_Allgather, T_allgather, as a function of message size b and the number of processes n_p is given by

$$\begin{eqnarray} T_{\,\rm allgather} (b, n_{\rm p}) &=& T_{\,\rm allgather, startup} (n_{\rm p}) + T_{\,\rm allgather, words} (b, n_{\rm p}), \\ \end{eqnarray}$$

(A2)

where T_{allgather,startup} is the start-up time, which depends only on n_p, and T_{allgather,words} is the time for message transfer, which depends on both n_p and b. For a small message size, T_allgather ∼ T_{allgather,startup}. Thus, to determine T_{allgather,startup} we measured the times for MPI_Allgather with a short message size.

Figure 14 shows the elapsed time for MPI_Allgather to send a message of two bytes against n_p. We can see that T_{allgather,startup} is proportional to |$n_{\rm p}^{1/4}$|⁠. Thus, T_{allgather,startup} is given by

$$\begin{equation} T_{\,\rm allgather, startup} \sim n^{1/4}_{\rm p} \tau _{\rm allgather, startup}. \end{equation}$$

(A3)

Fig. 14.

Elapsed time for MPI_Allgather to send a two-byte message against the number of processes. The circles indicate the measurement values and the solid curve indicates our fitting model.

Open in new tab Download slide

For a large message size, T_allgather should be determined by the injection bandwidth B_inj. Thus T_{allgather,words} is given by

$$\begin{equation} T_{\,\rm allgather, words} \sim \frac{\alpha _{\rm allgather} b n_{\rm p}}{B_{\rm inj}}. \end{equation}$$

(A4)

Thus the elapsed time for MPI_Allgather is given by

$$\begin{equation} T_{\,\rm allgather} \sim n^{1/4}_{\rm p} \tau _{\rm allgather, startup} + \frac{\alpha _{\rm allgather} b n_{\rm p}}{B_{\rm inj}}. \end{equation}$$

(A5)

Figure 15 shows the measured and predicted times for MPI_Allgather against the message size. Here, we assume B_inj = 4.8 GB s⁻¹ from the results of a point-to-point communication test. The parameters in equation (A5) are listed in table 3. Our model agrees reasonably well with the measured data. Note that we can see a disconnect between b = 100 and 1000. We think that this disconnect is caused by a change in algorithm of the MPI library.

Fig. 15.

Elapsed times for MPI_Allgather plotted against the message size. The circle, square, and triangle indicate the results with n_p = 4096, 512, and 64, respectively. The solid, dashed, and dotted curves indicate our fitting models for n_p = 4096, 512, and 64, respectively.

Open in new tab Download slide

References

Barnes

Hut

1986

Nature

324

446

Crossref

Search ADS

Barnes

J. E.

1990

J. Comput. Phys.

161

Crossref

Search ADS

Bédorf

Gaburov

Portegies Zwart

2012

J. Comput. Phys.

231

2825

Crossref

Search ADS

Gaburov

Harfst

Portegies Zwart

2009

New Astron.

630

Crossref

Search ADS

Hamada

Narumi

Yokota

Yasuoka

Nitadori

Taiji

2009

Proc. Conf. on High Performance Computing Networking, Storage and Analysis

Ishiyama

Fukushige

Makino

2009

PASJ

1319

Crossref

Search ADS

Iwasawa

et al. .

2018

, in

Proc. Computational Science (ICCS 2018)

, ed.

Shi

et al. . (

Cham

Springer

483

Iwasawa

Tanikawa

Hosono

Nitadori

Muranushi

Makino

2016

PASJ

Crossref

Search ADS

Knuth

1997

The Art of Computer Programming

, Vol.

(

Reading, MA

Addison-Wesley

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Makino

1991

PASJ

621

Makino

2004

PASJ

521

Crossref

Search ADS

Makino

Daisaka

2012

Proc. Int. Conf. on High Performance Computing, Networking, Storage and Analysis

(

Los Alamitos

IEEE Computer Society Press

)

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Makino

Fukushige

Koga

Namura

2003

PASJ

1163

Crossref

Search ADS

Namekata

et al. .

2018

PASJ

Crossref

Search ADS

Nitadori

Aarseth

S. J.

2012

MNRAS

424

545

Crossref

Search ADS

Rein

Liu

S.-F.

2012

A&A

537

A128

Crossref

Search ADS

Springel

2005

MNRAS

364

1105

Crossref

Search ADS

Stadel

J. G.

2001

Ph.D. thesis

University of Washington

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
February 2020	109
March 2020	30
April 2020	24
May 2020	20
June 2020	49
July 2020	23
August 2020	21
September 2020	39
October 2020	15
November 2020	26
December 2020	25
January 2021	22
February 2021	19
March 2021	30
April 2021	38
May 2021	26
June 2021	17
July 2021	55
August 2021	24
September 2021	19
October 2021	14
November 2021	16
December 2021	17
January 2022	20
February 2022	21
March 2022	31
April 2022	16
May 2022	20
June 2022	13
July 2022	19
August 2022	35
September 2022	20
October 2022	26
November 2022	16
December 2022	7
January 2023	29
February 2023	12
March 2023	11
April 2023	17
May 2023	11
June 2023	10
July 2023	17
August 2023	17
September 2023	9
October 2023	13
November 2023	10
December 2023	18
January 2024	28
February 2024	11
March 2024	16
April 2024	14
May 2024	13
June 2024	11
July 2024	35
August 2024	13
September 2024	9
October 2024	5
November 2024	10
December 2024	13
January 2025	30
February 2025	16
March 2025	18
April 2025	23
May 2025	6

Article Contents

Accelerated FDPS: Algorithms to use accelerators with FDPS

Abstract

1 Introduction

2 Overview of FDPS

3 Traditional approach to using accelerators and its limitation

4 New algorithms

4.1 Indirect addressing of particles

4.2 Reuse of interaction Lists

4.3 Procedures with or without the new algorithms

5 APIs for using accelerators

6 Performance

6.1 Measured performance

6.2 Performance model on a single node

6.2.1 Model of Troot

6.2.2 Model of Tconstlt

6.2.3 Model of Tmom lt

6.2.4 Model of Tconst gt

6.2.5 Model of Treorder gt

6.2.6 Model of Tmom gt

6.2.7 Models of Tforce,const and Tforce,reuse

6.2.8 Model of Tstep,const and Tstep,reuse

6.3 Performance model on multiple nodes

7 Discussion and summary

7.1 Further reduction of communication

7.2 Tree of domains

7.3 Further improvement in single-node performance

7.4 Summary

Acknowledgements

Appendix 1. Performance model of force kernel on GPGPU

Appendix 2. Performance of MPI_Gather and MPI_Allgather on the K Computer

References

Citations

Views

Altmetric

Email alerts

Astrophysics Data System

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

6.2.1 Model of T_root

6.2.2 Model of T_constlt

6.2.3 Model of T_mom lt

6.2.4 Model of T_const gt

6.2.5 Model of T_reorder gt

6.2.6 Model of T_mom gt

6.2.7 Models of T_force,const and T_force,reuse

6.2.8 Model of T_step,const and T_step,reuse