Porting the 3D Gyrokinetic Particle-in-cell Code GTC to the CRAY/NEC SX-6 Vector Architecture: Perspectives and Challenges

by

S. Ethier and Z. Lin

September 2003
PPPL Reports Disclaimer

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Availability


DOE and DOE Contractors can obtain copies of this report from:

U.S. Department of Energy
Office of Scientific and Technical Information
DOE Technical Information Services (DTIS)
P.O. Box 62
Oak Ridge, TN 37831
Telephone: (865) 576-8401
Fax: (865) 576-5728
Email: reports@adonis.osti.gov

This report is available to the general public from:

National Technical Information Service
U.S. Department of Commerce
5285 Port Royal Road
Springfield, VA 22161
Telephone: 1-800-553-6847 or (703) 605-6000
Fax: (703) 321-8547
Internet: http://www.ntis.gov/ordering.htm
The impressive performance achieved in 2002 by the Japanese Earth Simulator computer (26.58 Tflops, 64.9% of peak) [1] has revived the interest in vector processors. Having been the flagship of high performance computing for more than two decades, vector computers were gradually replaced, at least in the US, by much cheaper multi-processor super-scalar machines, such as the IBM SP and the SGI Origin series. Although the theoretical peak performance of the super-scalar processors rivals their vector counterparts, most codes cannot take advantage of this and can run only at a few percent of peak performance (< 10%). When properly vectorized, the same codes can, however, reach over 30 or even 40% of peak performance on a vector processor. Not all codes can achieve such performance. The purpose of this study is to evaluate the work/reward ratio involved in vectorizing our particle-in-cell code on the latest parallel vector machines, such as the CRAY/NEC SX-6, which is the building block of the very large Earth Simulator system in Japan [2]. This evaluation was carried out on the single node (8 cpus) SX-6 located at the Arctic Region Supercomputing Center (ARSC). Early performance results are compared to the same tests performed on the IBM SP Power 3 and Power 4 machines, on which our particle-in-cell code, GTC, is normally run.

The Gyrokinetic Toroidal Code (GTC) [3] was developed to study the dominant mechanism for energy transport in fusion devices, namely, plasma microturbulence. Being highly nonlinear, plasma turbulence is well described by particle codes for which all nonlinearities are naturally included. GTC solves the gyroaveraged Vlasov-Poisson system of equations (gyrokinetic equations [4]) using the particle-in-cell (PIC) approach. This method makes use of particles to sample the distribution function of the plasma system under study. The particles interact with each other only through a self-consistent field described on a grid such that no binary forces need to be calculated. This saves a great deal of computation, since it scales as $N$ instead of $N^2$, where $N$ is the number of particles. Also, the equations of motion to be solved for the particles are simple ordinary differential equations and can be easily solved with a Runge-Kutta algorithm or similar method. The main tasks of the PIC method at each time step are as follows: The charge of each particle is distributed among its nearest grid points according to the current position of that particle; this is called the scatter operation. The Poisson equation is then solved on the grid to obtain the electrostatic potential at each point. The force acting on each particle is then calculated from the potential at the nearest grid points; this is the “gather” operation. Next, the particles are “moved” by using the equations of motion. These steps are repeated until the end of the simulation.
GTC has been highly optimized for cache-based super-scalar machines such as the IBM SP. The data structure and loop ordering have been arranged for maximum cache reuse, which is the most important method of achieving higher performance on this type of processor. In GTC, the main bottleneck is the charge deposition, or scatter operation, mentioned above, and this is also true for most particle codes. The classic scatter algorithm consists of a loop over the particles, finding the nearest grid points surrounding each particle position. A fraction of the particle’s charge is assigned to the grid points proportionally to their distance from the particle’s position. The charge fractions are accumulated in a grid array. The scatter algorithm in GTC is more complex since one is dealing with fast gyrating particles for which motion is described by charged rings being tracked by their guiding center [5]. This results in a larger number of operations, since several points are picked on the rings and each of them has its own neighboring grid points.

Initial porting of the GTC code to the SX-6 was straightforward. Without any modifications to the code, the initial single processor test run was only 19% faster than on the Power3 processor, which has a peak of 1.5 Gflops compared to the 8 Gflops SX-6 processor. The compiler on the SX-6 includes very useful analysis tools, allowing a quick identification of the code’s bottlenecks. Also, a source listing clearly indicates the loops that have not been vectorized and the reasons why. With these tools in hand, the scatter operation in the charge depositing routine was quickly identified as the most time consuming part of the code, and was not vectorized because of memory dependencies.

The challenge with the scatter operation in all PIC codes, and on all types of processor, is the continuous non-sequential writes to memory. The particle array is being accessed sequentially with a fixed stride, but each of its elements, representing the position of a particle, corresponds to a random location in the simulation volume. This signifies that the grid array accumulating the charges gets written to in a random fashion, resulting in a poor cache reuse on a cache-based super-scalar processor. This problem gets amplified on a vector processor, since many particles will end up depositing some charge on the same grid points, thus giving rise to a classic memory dependence that prevents vectorization. Fundamentally, the requirement for vectorization is that all the individual operations making up a single vector operation must be independent from each other. The required number of independent individual operations is equal to the number of elements in the vector registers, or “vector length”. The simplest method to avoid the intrinsic memory conflicts of the scatter operation, and achieve vectorization, is to have each element in the vector register write to its own temporary copy of the charge accumulating array. One needs as many copies as the number of elements in the vector register, which is 256 for the SX-6. When the loop is completed, the information in the 256 temporary arrays must be merged into the real charge array. The increase in memory generated by this method is of at least VLEN*Ng, where VLEN is the processor’s vector length and Ng is the number of grid points in the domain. This can be a very large number considering that GTC uses tens of million of grid points for an average simulation. However, the
<table>
<thead>
<tr>
<th>Processor</th>
<th>Max speed (Gflops)</th>
<th>GTC test (Mflops)</th>
<th>Efficiency (real/max)</th>
<th>Relative speed (user time)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Power3</td>
<td>1.5</td>
<td>173.6</td>
<td>12%</td>
<td>1</td>
</tr>
<tr>
<td>Power4</td>
<td>5.2</td>
<td>304.5</td>
<td>6%</td>
<td>1.9</td>
</tr>
<tr>
<td>SX6</td>
<td>8.0</td>
<td>715.7</td>
<td>9%</td>
<td>5.2</td>
</tr>
</tbody>
</table>

Table 1: Single processor performance of GTC test run on IBM SP Power3 and Power4, and on SX-6 vector processor.

increased speed gained by the vectorization of the scatter operation compensates for the use of the extra memory. This algorithm was included in the GTC code, leading to the vectorization of the two most compute-intensive loops. Further modifications, mainly adding compiler directives, helped achieve the vectorization of the second most compute-intensive routine in the code.

Table 1 summarizes the results of the single processor test that compares the overall performance on the Power3, Power4, and SX-6 processors. One notes from the results that the Power3 processor gives the highest efficiency at 12% of the maximum theoretical speed. The fact that the Power4 runs GTC only at half the efficiency of the Power3 agrees with the extensive benchmarks done by the CSM group at the Oak Ridge National Laboratory [6], and which showed that the memory bandwidth of the Power 4 was not sufficient to keep up with its high-clocked CPU. With the code modifications made so far, the SX-6 runs the test case at 715.7 Mflops with 96.7% of vector operation ratio and an average vector length of 180.3 (the ideal being 256). Although this is only 9% of the maximum 8 Gflops that the vector processor can deliver, the SX-6 already runs 5.2 times faster than the Power3 processor and 2.7 times faster than the Power4.

More can be done to further improve the efficiency of the PIC method on the vector processor. The modifications that have been performed on GTC, so far, are just the beginning. A deeper analysis of the charge accumulating loop shows that even though the vector operation ratio is 99.89% with a perfect vector length of 256, the loop runs only at 7% of peak speed. However, the analysis also shows that a large number of scalar operations are performed in that same loop, most likely arising from the indirect array indexing characteristic of the random writes done in this algorithm. By simplifying the array indexing and sorting the particles according to their position, one might expect to achieve a much better performance on the SX-6. The goal is to reduce, as much as possible, the number of scalar operations, while maximizing the vector operations.

Thus far, the work/reward ratio of porting the GTC code to the SX-6 computer is good. We showed that a few small changes plus a more involved but straightforward method involving tempo-
rary arrays already increases the performance by a factor of 5. However, more code modifications need to be made in order to achieve the high efficiency obtained by other codes on the SX-6 computer. One also needs to study the parallel scaling of the code, which is a very important aspect since GTC usually runs on up to 1,024 processors.

This work was supported by US DOE Contract no. DE-AC02-76-CH03073 and in part by the DOE SciDAC Plasma Microturbulence Project.

References


External Distribution

Plasma Research Laboratory, Australian National University, Australia
Professor I.R. Jones, Flinders University, Australia
Professor João Canalle, Instituto de Fisica DEQ/IF - UERJ, Brazil
Mr. Gerson O. Ludwig, Instituto Nacional de Pesquisas, Brazil
Dr. P.H. Sakanaka, Instituto Fisica, Brazil
The Librarian, Culham Laboratory, England
Mrs. S.A. Hutchinson, JET Library, England
Professor M.N. Bussac, Ecole Polytechnique, France
Librarian, Max-Planck-Institut für Plasmaphysik, Germany
Jolan Moldvai, Reports Library, Hungarian Academy of Sciences, Central Research Institute for Physics, Hungary
Dr. P. Kaw, Institute for Plasma Research, India
Ms. P.J. Pathak, Librarian, Institute for Plasma Research, India
Ms. Clelia De Palo, Associazione EURATOM-ENEA, Italy
Dr. G. Grosso, Instituto di Fisica del Plasma, Italy
Librarian, Naka Fusion Research Establishment, JAERI, Japan
Library, Laboratory for Complex Energy Processes, Institute for Advanced Study, Kyoto University, Japan
Research Information Center, National Institute for Fusion Science, Japan
Dr. O. Mitarai, Kyushu Tokai University, Japan
Dr. Jiangang Li, Institute of Plasma Physics, Chinese Academy of Sciences, People’s Republic of China
Professor Yuping Huo, School of Physical Science and Technology, People’s Republic of China
Library, Academia Sinica, Institute of Plasma Physics, People’s Republic of China
Librarian, Institute of Physics, Chinese Academy of Sciences, People’s Republic of China
Dr. S. Mirnov, TRINITI, Troitsk, Russian Federation, Russia
Dr. V.S. Strelkov, Kurchatov Institute, Russian Federation, Russia
Professor Peter Lukac, Katedra Fyziky Plazmy MFF UK, Mlynska dolina F-2, Komenskeho Univerzita, SK-842 15 Bratislava, Slovakia
Dr. G.S. Lee, Korea Basic Science Institute, South Korea
Institute for Plasma Research, University of Maryland, USA
Librarian, Fusion Energy Division, Oak Ridge National Laboratory, USA
Librarian, Institute of Fusion Studies, University of Texas, USA
Librarian, Magnetic Fusion Program, Lawrence Livermore National Laboratory, USA
Library, General Atomics, USA
Plasma Physics Group, Fusion Energy Research Program, University of California at San Diego, USA
Plasma Physics Library, Columbia University, USA
Alkesh Punjabi, Center for Fusion Research and Training, Hampton University, USA
Dr. W.M. Stacey, Fusion Research Center, Georgia Institute of Technology, USA
Dr. John Willis, U.S. Department of Energy, Office of Fusion Energy Sciences, USA
Mr. Paul H. Wright, Indianapolis, Indiana, USA
The Princeton Plasma Physics Laboratory is operated by Princeton University under contract with the U.S. Department of Energy.

Information Services
Princeton Plasma Physics Laboratory
P.O. Box 451
Princeton, NJ 08543

Phone: 609-243-2750
Fax: 609-243-2751
e-mail: pppl_info@pppl.gov
Internet Address: http://www.pppl.gov