Parnass

Title	*Parnass: A Cluster of Workstations*
Participiant	Gerhard Zumbusch
Key words	Parallel Computing, Message Passing Interface (MPI), Parallel Virtual Machine (PVM), Distributed Memory, Myrinet, Gigabit/s High Speed Networks, local area/ system area networks (LAN/SAN), Zero-Copy Protocol
Description	PARNASS is a parallel computer build of single and quadruple R10k processor Silicon Graphics workstations and servers connected by a high performance gigabit/s speed network and additionally by switched fast ethernet. All workstations have been given the names of poets and philosophers by our system administrator. The resulting meta-computer is named after the mythological poets' place of assembly, the mountain Parnass. The similarity to the name of some parallel computing projects and companies is accidental. Parnass consists of a total of 49 (37) processors, some of them are single processors and some shared memory dual processors clustered to 4 processor shared memory machines. The single processor O2 workstation is equiped with 64 - 256 Mbytes and the dual processors Origin 200 are equiped with 192 Mbytes main memory. Each processor contains 0.5-1 Mbyte second level cache and 32+32 kbytes instruction and data first level cache. [some more pictures] The nodes are connected by a high performance Myrinet network at the speed of 1.28 Gbit/s duplex. The network adapters are connected by a two plane fat tree structure of full crossbar Myricom 8 port switches. Additionally the nodes are connected by fast ethernet. The peak performance of Parnass is 22.4 GFlop/s. Parnass has 4.2 Gbytes of main memory and about 150 Gbytes of hard disk storage. Parnass is used for research on and education in parallel and distributed computing in projects on fluid dynamcis (Navier-Stokes, turbulence simulation), astrophysics (MHD), geology (surface flow), molecular dynamics (Multipole) and parallel algorithms (multilevel). Network The computing nodes are connected by three separate networks: A fast-ethernet network (nominal 100 Mbit/s) with TCP/IP connects all nodes and serves as an uplink to the internet. It is used for standard services like ftp, rsh, NFS, NIS and network graphics X11 and DGL/ openGL. Some of the nodes are connected by a common bus (nominal 10240 Mbit/s, two processor systems) or a Cray link (nominal 5760 Mbit/s) for shared memory communication (four processor systems). The third, high performance network (nominal 1280 Mbit/s) is in preparation and will be mainly used for message passing (MPI, PVM) in parallel computing between all nodes not connected via shared memory. Here both low latency and high bandwith communication are essential. In order to maintain these properties also in larger networks, the network topology plays an important role. We choose a fat-tree topology for a maximum bandwidth. Depicted is a two stage tree built of 8-port switches. Each switch is a full-crossbar (duplex) connection of all ports, which means a nominal switching capacity of 10.24 Gbit/s per switch by Myricom. We show a configuration with 12 8-port switches resulting in a connection of up to 32 nodes with a backplane of 40.96 Gbit/s total capacity. Each message has to pass either one switch (neighbour) or exactly three switches on its way from one node to another. This means optimal scalabilty of the network and allows collisonless all-to-all routing. The fast-ethernet network is used for ordinary TCP/IP traffic and for the connection to other computers and terminals of the lab. The older FDDI ring connects some file servers, while the ATM uplinks the installation to the university network backbone built of ATM Oc3 components. Each fast ethernet link runs at a nominal rate of 100Mbit/s. A non-blocking crossbar ethernet switch (in preparation) has to have a backplane of at least 3.2Gbit/s. Myrinet Software We are interested in using the Myrinet on Parnass. In order to do message passing over a Myrinet on MIPS based machines, we have to port an operating system kernel driver, a control program for the NIC, and a message passing application interface for the user program. Such components are available in source code for Linux PCs and for Suns, some even for Power PCs and Windows NT on Intel and DEC Alpha based machines. However, no such drivers had been written for Silicon Graphics workstations so far. The data flow is depicted in the next figure. The memory copy operations by the host CPU of the original driver software are substituted by the faster DMA operations by the NIC. We implement a zero-copy protocol. At the time the destination address is known, the DMA transfer can be initiated. On single processor O2 machines this requires a cache invalidate call for the destination region (on the sender a cache write-back). The destination address usually does not point into physically contiguous memory. The memory is contiguous from the users view, which is done by the processors memory management unit (MMU). However, the memory block looks fragmented from the point of view of an I/O device. Hence the block has to be split into memory pages, whose addresses have to be translated. This gives a list of addresses and lengths, which can be programmed into the DMA unit of the NIC. Fortunately, we do not need an NIC processor interaction. Register access is sufficient. Before we are allowed to initiate the DMA, we have to lock the pages in memory to avoid paging. This also means that we have to wait for the DMA to terminate to unlock the pages and to return to the user context. The DMA operation becomes blocking. Comparison We compare the performance of all MPI implementations available on Parnass, vased on the portable MPICH implementation. We use a ping-pong test to measure latencies and performance between single processor workstations. The shared memory tests were run on a quadruple processor Origin 200. pink line: MPI using the shared memory device on a multiprocessor machine. light blue line: MPICH using BullDog on Myrinet, modified version using DMA transfers for packets longer than 512 bytes. red line: MPICH using BullDog on Myrinet, original version using PIO memcpy. green line: MPICH using p4 on fast ethernet (100Mbit/s). blue line: MPICH using p4 on switched ethernet (10Mbit/s). The ping-pong one-way latencies vary from 16 us for shared memory, 70 us for BullFrog on Myrinet to 0.63 ms for Fast Ethernet and 1.15 ms for Ethernet. The peak bandwidth performance is 7 Mbit/s for Ethernet, 40-47 Mbit/s for Fast Ethernet, up to 220 Mbit/s for Myrinet and 1 Gbit/s for shared memory. The high latencies of Ethernet are in part due to the TCP/IP overhead involved. Fast Ethernet compared to Ethernet starts paying off for packets longer than 1.5 kbytes. While it is a factor of 2 faster for small messages, the peak bandwidth is 6-7 times higher than Ethernet. We compare two versions of BullFrog for Myrinet: The DMA version is used for packets longer than 512 bytes, while the `memcopy' version achieves only 10% of the DMA version's bandwidth. Both numbers have to be compared to Myrinet's theoretical peak bandwidth of 1.28 Gbit/s. The latencies for the shared memory device mainly demonstrate OS and MPI administration overhead and time for system calls. The peak bandwidth is limited by some `memcopy' operations through shared memory. Bandwidth even decreases for packets larger than the size of the 2nd level cache down to 610 Mbit/s. The theoretical peak bandwidth of the shared memory connection between two double processor boards (CrayLink) is 5.76 Gbit/s and the on board connection is even faster. To summarize, one has to admit that the message passing system on a shared memory system achieves only a fraction of the high theoretical memory performance. It is still faster than shared memory implementations by other computer vendors. Generally, the faster a network is, the higher are the losses due to software overhead and management. This rule of thumb is in part true also for other high speed networks, like Myrinet on SGI and the Fast Ethernet. However, the difference between two fastest MPI versions, shared memory and BullFrog melts down for long packets to a factor of 2.5. A gap between the performance of shared memory communication and performance of the Myrinet communication could be observed, the first one available only between a few processors and the latter one available between all processors. However, one has to keep in mind that the Myrinet network performance scales better than shared memory links. This is also true for higher numbers of processor if we compare the Origin 2000 network fabric with Myrinet. Furthermore, workstations equiped with Myrinet posses a much more attractive price/ performance ratio than large shared memory machines.
Bibliography	M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra. MPI: The Complete Reference. MIT Press, 1996. W. Gropp, E. Lusk, N. Doss, A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Technical Report MCS-P567-0296, MCS Division, Argonne National Laboratory, 1996. G. Henley, N. Doss, T. McMahon, A. Skjellum. BDM: A multiprotocol program and host application programmer interface. Technical Report MSSU-EIRS-ERC-97-3, Mississppi State University, Dept. Computer Science, 1997. M. Griebel, G. Zumbusch, Parnass: Porting gigabit-LAN components to a workstation cluster. Proceedings of the 1st Workshop Cluster-Computing, held November 6-7, 1997, in Chemnitz, editor: W. Rehm, Chemnitzer Informatik Berichte, CSR-97-05 , pp. 101-124
Related projects	Parallel multi-level-particle methods combined with space-filling curves Adaptive Parallel Multigrid Methods with Space-Filling Curves and Hash Storage Further development of the flow-package NAST3D Die Kombimethode für Eigenwertprobleme für die Schrödingergleichung
In cooperation with	SFB 256 "Nonlinear Partial Differential Equations" Myricom, Inc.

Network

Myrinet Software

Comparison