# \$105/Gflops Astrophysical *N*-Body Simulation with Reconfigurable Add-in Card and Hierarchical Tree Algorithm

Atsushi Kawai\* Saitama Institue of Technology

## Abstract

As an entry for the 2006 Gordon Bell price/performance prize, we report an astrophysical N-body simulation performed with reconfigurable computing systems and a hierarchical tree algorithm. Each of our systems consists of a host PC and a reconfigurable add-in card attached via high-speed interface, PCI Express or PCI-X. The reconfigurable add-in card houses one FPGA chip, into which we integrated 16 pipeline processors specialized for gravitational force calculation. Other operations, such as tree construction, tree traverse and time integration, are performed on the host PC. On our system, we performed a cosmological N-body simulation with 2.1 million particles, which sustained a performance of 15.39 Gflops averaged over 4.33 hours. The price of our system is \$2,384 in total and, price per performance obtained is \$158/Gflops, which is 44 times better than that obtained by GRAPE-5 special-purpose hardware, the 1999 Gordon Bell winner.

Note: after the deadline of paper submission (July 2006), we brushed up the system further, and obtained price per performance figure of **\$105/Gflops**.

**Keywords:** reconfigurable processor, many-body simulation, cosmological simulation, tree algorithm

# 1 Introduction

Astrophysical *N*-body simulation is one of the most widely used technique to investigate formation and evolution of astronomical objects, such as galaxies, galaxy clusters and large scale structures of the universe. In such simulations, we calculate gravitational force on each particle from all other particles, and integrate the orbit of each particle according to

SC2006 November 2006, Tampa, Florida, USA 0-7695-2700-0/06 \$20.00 ©2006 IEEE Toshiyuki Fukushige<sup>†</sup> University of Tokyo K&F Computing Research Co.

Newton's equation of motion. We investigate structural and dynamical properties of the simulated object.

The astrophysical *N*-body simulation has been one of grand challenge problems in computational sciences. The Gordon Bell prizes were awarded to cosmological *N*-body simulations [1][2][3][4][5] in years 1992, 96, 97, 98, and 99, to *N*-body simulation of a black hole systems in years 1995, 2000 and 2001[6][7][8], and to simulation of a proto-planets system in 2003[9].

The calculation cost of the astrophysical *N*-body simulation rapidly increases for large *N*, because it is proportional to  $N^2$ if we use a straightforward approach. The gravity is a longrange attractive force. A particle feel the forces from all other particles, no matter how they are far away. We cannot use a cutoff technique which is widely used in MD simulation (e.g. [10]). In order to reduce the calculation time, various fast algorithms have been developed.

Hierarchical tree algorithm [11] is one of fast algorithms which reduce the calculation cost from  $O(N^2)$  to  $O(N \log N)$ . In this algorithm, particle are organized in the form of a tree, and each node of the tree represents a group of particles. The force from a distance node is replaced by the force from its center of mass. The Gordon Bell prizes of years 1992, 97, 98, and 99 were awarded to *N*-body simulations with this tree algorithm [1][3][4][5], which were performed on Intel Touchstone Delta, ASCI-Red, PC cluster, an Alpha cluster, and GRAPE-5. Recently, very large-scale cosmological simulations (up to 10 billion particles) using the tree algorithm on massively parallel supercomputer (IBM Regatta [12]) and Beowulf-type PC-cluster [13] [14] were reported.

In this paper, we report astronomical *N*-body simulations performed on reconfigurable computing systems with the tree algorithm. The reconfigurable computing systems using FPGA (Field-Programmable Gate Array) chips have become an attractive option for scientific computing. FPGA is a mass-produced LSI chip, consisting of a large number of logic elements and a switching network. By programming them we can implement an arbitrary logic design. Such reconfigurable computing system could offer both flexibility and high performance in some applications. The FPGA-based reconfigurable system has been an active area of research since Splash [15], and several groups have tried to apply the idea to astronomical *N*-body simulations [16] [17]. With rapid advance of semiconductor technol-

<sup>\*</sup>e-mail: kawai@sit.ac.jp

<sup>&</sup>lt;sup>†</sup>e-mail: fukushig@kfcr.jp

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

ogy, recent computing power of FPGA chips have dramatically increased. Several vendors have developed or are developing high performance reconfigurable computer system. Such systems include Cray XD1, SRC MAPstation, and SGI RASC.

The reconfigurable computing systems on which we performed simulations are composed of a Pentium-based host PC and a reconfigurable add-in card. The reconfigurable add-in card contains one recent large-scale FPGA chip and high-speed interfaces to the host PC, PCI-X and PCI Express. We implement a design of specialized pipelines for the gravitational-force calculation, which is the most expensive part of the tree algorithm, to the FPGA chip. Actually, the logic design implemented is the same as that used for specialized hardware for gravitational interaction, GRAPE (GRAvity piPE)[18][19]. The reconfigurable addin card calculates forces among particles using the dedicated pipeline. Other operations, such as tree construction, tree traverse and time integration, are performed on the host PC. Both are connected via a high speed interface and the add-in card serves as a hardware accelerator for the force calculation.

We performed a 2.1 million-particle cosmological simulation using the tree algorithm on two different reconfigurable systems. The best sustained performance achieved is 15.39 Gflops, and the price per performance is **\$158/Gflops**. The price per performance is 44 times better than that obtained in our 1999 Gordon Bell entry run which won the first prize. The run was performed on the GRAPE-5 special-purpose hardware. The simulation settings, such as the initial particle distribution and calculation parameters, was the same as that of our new runs. The figure \$158/Gflops also exceeds the performance record \$246/Gflops in the past Gordon Bell price/performance entry, which were obtained for calculations of neural network training experiments [20].

In the rest of this paper, we describe our reconfigurable systems and the tree algorithm on them, and report price per performance of our simulations.

# 2 Reconfigurable Computing Systems

We briefly describe our reconfigurable systems used for the simulations reported in this paper. We performed simulations on two different reconfigurable computing system, namely, System-A and System-B. These two systems are different mainly in the reconfigurable add-in card.

#### 2.1 System-A

Figure 1 illustrates the configuration of System-A. It consists of a host PC and a reconfigurable add-in card, PXPD-KSXGX40 (PCI Express Design Kit) by PLD Applications Co., France[21] (hereafter PLDA PXPDKSXGX40). Figure 3 and 4 show photographs of the add-in card PLDA PX-PDKSXGX40 and System-A. PLDA PXPDKSXGX40 is attached to the PCI Express 4-lane slot of the host PC. It houses one FPGA chip, Altera Stratix GX (EP1SGX40GF1020C5), into which 10 gravitational-force pipeline, particle memory, and PCI Express interface logic are integrated. Each pipeline calculates gravitational forces between two particles in a single clock cycle. The particle memory stores the position coordinates and mass of up to 8k particles which exert the force, and supplies them to the pipelines. The interface logic is for communication with the host. As the interface logic, we used PCI Express EZ IP Core also developed by PLD Applications Co.. The host PC performs everything except force calculation, such as tree construction, tree traverse and time integration. In Table 1, specification of System-A is summarized. For the host computer, we used a selfassembled PC.

The logic design of the pipeline is basically the same as that used in the custom LSI for GRAPE-5 [5] [22] hardware. The pipeline is designed to calculate a pair-wise force with a relative error of about 0.3%. This might sound rather low, but detailed theoretical analysis [23] and numerical experiment [24] have shown that it is more than enough. The average error of the force in our simulation is around 0.1%, which is dominated by the approximation made in the tree algorithm, not by the accuracy of the hardware. The relative accuracy was practically the same, when we performed the same force calculation using standard 64-bit floating point arithmetic.

PLDA PXPDKSXGX40 operates at 125MHz clock cycle. Each pipeline performs 38 operations in a single clock cycle, if we use the same counting convention as used in [3][4]. And thus the theoretical peak speed of the card is 47.5 Gflops (= $10 \times 125$  MHz  $\times 38$  flo).

Prices of constituent elements for System-A are summarized in table 2. PLDA PXPDKSXGX40 is currently available at the price of 1,990 USD, which includes free licenses of the IP core. Although we bought it at 1,119 USD in July 2005 from Japanese distributor, we use the current price. Parts for the self-assembled host PC cost 52k JYE. Taking into account the present exchange rate of 1 USD = 118 JYE (as of April 2006), the total price of System-A is calculated as 2,429 USD.

#### 2.2 System-B

Figure 2 illustrates the configuration of System-B. It consists of a host PC and another reconfigurable add-in card,

GRAPE-7 model 100 by K&F Computing Research Co., Japan[25] (hereafter KFCR GRAPE-7). Figure 3 and 4 show photographs of the add-in card KFCR GRAPE-7 and System-B. KFCR GRAPE-7 is attached to the PCI-X 64bit/133MHz slot of the host PC. It houses one FPGA chip, Altera CycloneII (EP2C70F672C6), into which 16 gravitational-force pipeline, particle memory, and PCI-X interface logic are integrated. The logic design of the pipeline is the same as that in System-A, except for the number. As the interface logic, PCI-X IP Core developed by PLD Applications Co.. are used. In Table 1, specification of System-B is summarized. KFCR GRAPE-7 operates at 133MHz clock cycle. Thus, the theoretical peak speed of the card is 80.9 Gflops (= $16 \times 133$  MHz  $\times 38$  flo).

Prices of constituent elements for System-B are summarized in table 2. KFCR GRAPE-7 is available at the price of 186k JYE, which includes configuration data for logic design of FPGA chip. The price of the host PC is 94k JYE. The overall cost is 2,384 USD. Note that among constituent elements of System-B, Pentium D and 2GB memory may be too much since one halves of them are idle during the simulations.

Table 2: Price of two systems (JYE).

|                   | System-A        | System-B |
|-------------------|-----------------|----------|
| Add-in Card       | (\$1990)234,820 | 187,619  |
| Host              |                 |          |
| CPU               | 19,046          | 29,504   |
| Mother board      | 12,857          | 37,123   |
| Main memory       | 9,486           | 16,152   |
| Hard disc drive   | 5,219           | 6,172    |
| Power supply unit | 5,219           | 4,742    |
| Total             | 286,622         | 281,312  |
| (\$1 = 118JYE)    | \$2429          | \$2384   |

### 3 Hierarchical Tree Algorithm

Our code [26] is based on the Barnes' modified tree algorithm [27]. The code has been used for actual astrophysical research [28]. The implementation of the modified tree algorithm on a system with dedicated force-calculation pipelines were discussed in [29]. Using this algorithm, the calculation cost on the host computer is greatly reduced from that of the original algorithm and the forces exerted on multiple particles can be calculated in parallel. In the original algorithm, the interaction list is created for each particle. In the modified tree algorithm, neighboring particles are grouped and one interaction list is shared among the particles in the same group. Forces from particles in the same group is directly calculated.

The modified tree algorithm reduces the calculation cost of the host computer by roughly a factor of  $n_g$ , where  $n_g$  is the



Figure 1: Block diagram of System-A.



Figure 2: Block diagram of System-B.

|                                | System-A                          | System-B                       |
|--------------------------------|-----------------------------------|--------------------------------|
| Add-in Card                    | PLDA PXPDKSXGX40                  | KFCR GRAPE-7                   |
| # of pipeline processors       | 10                                | 16                             |
| Clock cycle of the pipelines   | 125 MHz                           | 133 MHz                        |
| Force-calculation speed (peak) | 47.5 Gflops                       | 80.9 Gflops                    |
| Communication interface        | PCI Express (4-lane)              | PCI-X (64-bit, 133 MHz)        |
| Communication speed (peak)     | 2.00 GB/s (full duplex)           | 1.07 GB/s                      |
| Particle memory size           | 8k particles                      | 2k particles                   |
| Host PC                        |                                   |                                |
| CPU                            | Pentium4 630, 3.0GHz              | PentiumD 920, 2.8GHz           |
| Mother board                   | GIGA-BYTE GA-8I945G               | ASUS P5WDG2-WS                 |
| Main memory                    | DDR2, 533MHz, $0.5$ GB $\times$ 2 | DDR2, 533MHz, 1.0GB $\times$ 2 |

Table 1: Specification of two reconfigurable computing systems.



Figure 3: Photograph of the add-in card for System-A (PLDA PXPDKSXGX40, top), and System-B (KFCR GRAPE-7, bottom).



Figure 4: Photograph of System-A (top) and System-B (bottom).

average number of particles in a group. On the other hand, the amount of work on the force pipelines increases as we increase  $n_g$ , since the interaction list becomes longer. There is, therefore, an optimal  $n_g$  at which the total computation time is minimum. The optimal  $n_g$  strongly depends on the ratio of the speed of the host computer, force pipelines, and communication. For the present configuration of both System-A and System-B, the optimal  $n_g$  is around 500.

Note that our modified tree algorithm performs larger number of operations than the tree algorithm on a general purpose computer. When we will estimate the performance in Section 4, we will make correction. Note also that the our modified tree algorithm is more accurate than the original tree algorithm for the same accuracy parameter, as shown in [27] [30].

# 4 Cosmological *N*-body Simulation

We report the performance statistics of the astrophysical *N*body simulations performed with the tree algorithm on two reconfigurable systems. The performance numbers are based on the wall-clock time obtained from system timer on the host computers.

We performed a cosmological *N*-body simulation of a sphere of radius 50Mpc (mega parsec) with 2,159,038 particles for 999 timesteps. We assigned the initial position and velocities to particles in a spherical region selected from a discrete realization of density contrast field based on a standard cold dark matter scenario using COSMICS package [31]. A particle represents  $1.7 \times 10^{10}$  solar masses. We integrated the particle orbits from z = 24, where z is redshift, to the present time. The total number of the particle-particle interactions is  $1.49 \times 10^{13}$ . This implies that the average length of the interaction list is 6,916. These simulation settings are identical to those of our 1999 Gordon Bell prize entry[5]. Figure 5 shows snapshots of the simulation.

We performed the same run on both systems, and found System-B gives slightly better performance. Table 3 summarizes the performance statistics. On System-A and System-B, the whole simulation including file I/O took 18,232 and 15,590 seconds (5.06 and 4.33 hours), respectively, resulting in the average computing speed of 31.06 and 36.31 Gflops. Here we use the operation count of 38 per interaction[3][4].

As we described in section 3, our modified tree algorithm performs larger number of operations than the tree algorithm on a general purpose computer. In order to make correction, we estimated the operation count necessary to perform the same simulation without dedicated hardware. Using four snapshots, we measured the computation time without force pipelines for various values of  $n_g$ . We found  $n_g$  around 10 gives the minimal computation time. Then, for the four snapshots, we measured the length of the interaction list at



Figure 5: Snapshots of the simulation at z = 18.8 (top), z = 4.5 (second), z = 2.6 (third), z = 0.0 (bottom, present time).

 $n_g \approx 10$ . We averaged them over, and estimated the interaction count of the overall simulation as  $6.31 \times 10^{12}$ . Based on this interaction count, we obtained the corrected sustained speed 13.16 and **15.39 Gflops**, and the price/performance \$185/Gflops and **\$158/Gflops**, respectively.

Table 3: Performance statistics for cosmological simulation.

|                             | System-A              | System-B              |
|-----------------------------|-----------------------|-----------------------|
| Interaction count           | $1.49 \times 10^{13}$ | $1.49 \times 10^{13}$ |
| Corrected interaction count | $6.31	imes10^{12}$    | $6.31	imes10^{12}$    |
| Elapsed time                | 5.06 hours            | 4.33 hours            |
| Performance (sustained)     | 31.06 Gflops          | 36.31 Gflops          |
| Performance (corrected)     | 13.16 Gflops          | 15.39 Gflops          |
| Price                       | \$2,429               | \$2,384               |
| Price per performance       | \$185/Gflops          | \$158/Gflops          |

# 5 Comparison with 1999 Gordon Bell Entry

We compare our results with the 1999 Gordon Bell entry, which was performed with special-purpose hardware, GRAPE-5[5]. Table 4 summarizes the differences among three systems. Performance statistics are for the cosmological simulation reported above. Note that we performed exactly same simulation on all three systems. We used the same initial particle distribution, the same integration scheme and timestep, and the same accuracy parameter.

The high efficiency of System-A and System-B are mainly due to the improved speed of communication and the host PC. Table 4 shows that the peak performance of the GRAPE-5 system is the highest. Even though, the simulation on the System-A, and B run 65% and 93% faster, respectively. The speed of them are not limited by the speed of communication, nor by the host, but by the pipelines (table 5). In order to achieve better performance, we can simply increase the number of pipelines. In other words, we only need to replace the FPGA chip with the larger one, so that more pipelines can be integrated.

Table 5: Breakdown of the computation time for the initial timestep.

|               | System-A | System-B |
|---------------|----------|----------|
| Host          |          |          |
| Tree creation | 1.4      | 1.4      |
| Tree traverse | 2.9      | 2.7      |
| Pipeline      | 12.7     | 7.5      |
| Communication | 2.0      | 4.6      |
| Total         | 19.0     | 16.2     |

### 6 A Small Parallel System

We also constructed a small parallel system using the addin cards. Figure 6 shows its structure. The cluster system consists of two PCs (same as those used for System-B), each of which has two KFCR GRAPE-7s. The two PCs are connected to each other via a Gigabit Ethernet. Prices of the add-in cards (\$1590 each) and PCs (\$729 each) are the same as those shown in table 2 for System-B. The total cost of the cluster is 7,903 USD, including additional cost for the Gigabit network (\$85).

On the cluster, we performed a cosmological run similar to that described in section 4, but with a larger number of particles, 5,894,159. The treecode we used is parallelized by essentially the same scheme as Hashed Oct-Tree algorithm [32] (with Peano-Hilbert ordering). The run took 1,279 timesteps and 21,379 seconds (5.94 hours). The total number of the particle-particle interactions is  $4.42 \times 10^{13}$ , the average length of the interaction list is 5,871, and the resulting computing speed is 79.02 Gflops. Following the procedure given in section 4, the corrected speed 39.10 Gflops, and the price/performance \$202/Gflops, are estimated. This result implies that our system can be scaled up with a modest increase of price/performance.

### 7 Postdeadline Achievement

We have been refining the system-B, even after the submission deadline of this paper (July 2006). At present (Oct. 2006), the system marked the price per performance of **\$105/Gflops**, which is obtained as a result of the improvements described in the following sections.

The performance of the cluster system given in Section 6 should also be improved, though we could not have enough time to construct a new cluster by the end of Oct. 2006.

#### 7.1 Fine-tuned pipeline

We optimized the design of the pipeline processor, so that it can take full advantage of the FPGA chip we used (Altera CycloneII EP2C70F672C6). In the FPGA, 300 DSP units are embedded, and each of them can be configured as a 9-bit fixed-point multiplier. We modified the numerical expression used inside the pipeline processor, so that it can utilize these dedicated multipliers.

With the modification, we succeeded to save the Logic Elements (combinational logic and registers) and the memory resources of the pipeline processor about 30% and 80%, respectively.

As a result, we successfully integrated additional 6 pipeline processors (22 in total), and additional 2k particle memory

| Table 4: Comparison of entries in 2006 and 1999. |                     |                         |                       |
|--------------------------------------------------|---------------------|-------------------------|-----------------------|
| Entry year                                       | 2006                | 2006                    | 1999                  |
|                                                  | System-A            | System-B                | GRAPE-5 system        |
| System configuration                             |                     |                         |                       |
| Processor technology                             | FPGA (90nm)         | FPGA (90nm)             | ASIC (500nm)          |
| # of pipeline processors                         | 10                  | 16                      | 32                    |
| Clock cycle of the pipelines                     | 125 MHz             | 133 MHz                 | 90 MHz                |
| Communication interface                          | PCI Express(4-lane) | PCI-X (64-bit, 133 MHz) | PCI (32-bit,33MHz)    |
| Host CPU                                         | Pentium 4/3.0GHz    | Pentium D/2.8GHz        | Alpha 21264/466MHz    |
| Performance statistics                           |                     |                         |                       |
| Force calculation (peak)                         | 47.5 Gflops         | 80.9 Gflops             | 109.4 Gflops          |
| Force calculation (effective)                    | 13.6 Gflops         | 15.39 Gflops            | 5.92 Gflops           |
| Communication (peak)                             | 2.00 GB/s           | 1.07 GB/s               | 133 MB/s              |
| Communication (effective)                        | $\sim$ 500 MB/s     | $\sim$ 500 MB/s         | $\sim 40 \text{MB/s}$ |
| Elapsed time                                     | 5.06 hours          | 4.33 hours              | 8.37 hours            |
| Cost                                             | \$2,429             | \$2,384                 | \$40,900              |
| Price per performance                            | \$185/Gflops        | \$158/Gflops            | \$7000/Gflops         |



Figure 6: A small parallel system.

(4k in total). The extra pipeline processors reduce the calculation time, and the extra particle memory reduces the amount of communication between the host and the add-in card. See table 6 for the timing details.

#### 7.2 Faster host computer

We replaced the Pentium-D based host computer (PentiumD 920, ASUS P5WDG2-WS) with a Core 2 Duo based one (Core 2 Duo E6400, ASUS P5WDG2-WS PRO). The new host peforms tree creation 35% faster, and tree traverse 20% (table 6). The cost of the new host is about 4% cheaper.

|                                | System-B in Oct. 2006          | Improvement since July 2006  |
|--------------------------------|--------------------------------|------------------------------|
| Specification                  |                                |                              |
| Add-in Card                    | KFCR GRAPE-7                   |                              |
| # of pipeline processors       | 22                             | +6                           |
| Clock cycle of the pipelines   | 133 MHz                        |                              |
| Force-calculation speed (peak) | 111.2 Gflops                   | +30.3 Gflops                 |
| Communication interface        | PCI-X (64-bit, 133 MHz)        | -                            |
| Communication speed (peak)     | 1.07 GB/s                      |                              |
| Particle memory size           | 4k particles                   | +2k particles                |
| Host PC                        | -                              | -                            |
| CPU                            | Core 2 Duo E6400               | replaced from PentiumD 920   |
| Mother board                   | ASUS P5WDG2-WS PRO             | replaced from ASUS P5WDG2-WS |
| Main memory                    | DDR2, 533MHz, 1.0GB $\times$ 2 | -                            |
|                                |                                |                              |
| Price (JYE)                    |                                |                              |
| Add-in Card                    | 187,619                        |                              |
| Host                           |                                |                              |
| CPU                            | 28,000                         | -1,504                       |
| Mother board                   | 36,172                         | -951                         |
| Main memory                    | 16,152                         |                              |
| Hard disc drive                | 6,172                          |                              |
| Power supply unit              | 4,742                          |                              |
| Total                          | 278,857                        | -2,455                       |
| (\$1 = 118JYE)                 | \$2363                         | -\$21                        |
| Dorformence                    |                                |                              |
| Letonation                     | 1.401013                       |                              |
| Interaction count              | $1.49 \times 10^{12}$          |                              |
| Corrected interaction count    | $6.31 \times 10^{12}$          | 1 20 1                       |
| Elapsed time                   | 2.95 hours                     | -1.38 hours                  |
| Performance (sustained)        | 53.16 Ghops                    | +16.85 Gnops                 |
| Performance (corrected)        | 22.59 Gflops                   | +7.2 Gflops                  |
| Price per performance          | \$105/Ghops                    | -\$53/Gnops                  |
| Timing (s)                     |                                |                              |
| Host                           |                                |                              |
| Tree creation                  | 0.9                            | -0.5                         |
| Tree traverse                  | 2.2                            | -0.5                         |
| Pipeline                       | 5.4                            | -2.1                         |
| Communication                  | 2.5                            | -2.1                         |
| Total                          | 11.0                           | -5.2                         |
|                                |                                |                              |

Table 6: Summary of the the System-B improvement

### References

[1] M. S. Warren and J. K. Salmon, in *Proceedings of Supercomputing '92*, IEEE Computer Society Press, Los Alamitos, 1992, p 570-576.

[2] T. Fukushige and J. Makino, in *Proceedings of Supercomputing* '96, IEEE Computer Society Press, Los Alamitos, 1996.

[3] M. S. Warren, J. K. Salmon, D. J. Becker, M. P. Goda, T. Sterling, and G. S. Winckelmans, in *Proceedings of Supercomputing '97*, IEEE Computer Society Press, Los Alamitos, 1997.

[4] M. S. Warren, T. C. Germann, P. S. Lamdahl, D. M. Beazley, J. K. Salmon, in *Proceedings of Supercomputing* '98, IEEE Computer Society Press, Los Alamitos, 1998.

[5] A. Kawai, T. Fukushige, and J. Makino, in *Proceedings* of SC99, IEEE Computer Society Press, Los Alamitos, 1999.

[6] J. Makino and M. Taiji, in *Proceedings of Supercomputing* '95, IEEE Computer Society Press, Los Alamitos, 1995.

[7] J. Makino, T. Fukushige, and M. Koga, in *Proceedings of SC2000*, IEEE Computer Society Press, Los Alamitos, 2000.

[8] J. Makino and T. Fukushige in *Proceedings of SC2001*, IEEE Computer Society Press, Los Alamitos, 2001.

[9] J. Makino, H. Daisaka, E. Kokubo, and T. Fukushige, in *Proceedings of SC2003*, IEEE Computer Society Press, Los Alamitos, 2003.

[10] Y. Duan and P. A. Kollman, *Science*, Vol. 282, p. 740, 1998.

[11] J. E. Barnes and P. Hut, Nature, Vol. 324, p. 446, 1986.

[12] V. Springel et al., Nature, Vol. 435, p. 629, 2005.

[13] J. Dubinski, J. Kim, C. Park, and R. Humble, *New Astronomy*, Vol. 9, p. 111, 2004.

[14] B. Moore et al., Mon. Not. Royal Astron. Soc., in press.

[15] D. A. Buell, J. M. Arnold, and W. J. Kleinfelder, *Splash2*, IEEE Computer Society Press, Los Alamitos, 1996.

[16] T. Hamada, T. Fukushige, A. Kawai, and J. Makino, *Publ. Astron. Soc. Japan*, Vol. 52 p. 943, 2000.

[17] R. Spurzem et al., in *Proceedings of Intr. Symp. Comp. Sci. and Eng.* 2002, JSPS, , 2002.

[18] D. Sugimoto, Y. Chikada, J. Makino, T. Ito, T. Ebisuzaki, and M. Umemura, *Nature* Vol. 345, p. 33, 1990.

[19] J. Makino and M. Taiji, *Scientific Simulations with Special-Purpose Computers* — *The GRAPE Systems*, John Wiley and Sons, Chichester, 1998.

[20] S. Kim, C. S. Lee, and J. S. Hwang, in *Proceedings of SC2001*, IEEE Computer Society Press, Los Alamitos, 2001.

[21] PLD Applications Co., http://www.plda.com/

[22] A. Kawai, T. Fukushige, J. Makino, and M. Taiji, in *Publ. Astron. Soc. Japan*, Vol. 52, p. 659, 2000

[23] J. Makino, T. Ito, and T. Ebisuzaki, *Publ. Astron. Soc. Japan*, Vol. 42 p. 717, 1990.

[24] L. Hernquist, P. Hut, and J. Makino, *Astrophysical Journal*, Vol. 402, p. L85, 1993.

[25] K & F Computing Research Co., http://www.kfcr.jp/product.html

[26] A. Kawai, http://www.sit.ac.jp/~user/kawai/pkg/

[27] J. Barnes, J. Comp. Phys., Vol 87, p. 161, 1990.

[28] T. Fukushige, A. Kawai, J. Makino, *Astrophysical Journal*, Vol. 606, p. 625, 2004.

[29] J. Makino, Publ. Astron. Soc. Japan, Vol. 43 p. 621, 1991.

[30] A. Kawai, J. Makino, and T. Ebisuzaki, *Astrophysical J. Supplement*, Vol. 151, p13, 2004

[31] E. Bertschinger, http://arcturus.mit.edu:80/cosmics/

[32] M. S. Warren and J. K. Salmon, in *Proceedings of Supercomputing '93*, IEEE Computer Society Press, Los Alamitos, 1993, p 12-21.