Introduction

I took CSE 260 Parallel Computing by Bryan Chin in my first quarter at UCSD. Advertised as a practical course alternative to the CSE 240B Parallel Computer Architecture, I was interested in this as a good complement to CSE 240A Computer Architecture, and it did not disappoint! The course was fully hands on with three projects and the course based around the practical implementation of parallel computing. Contents ranged from vector processing on CPUs, graphics acceleration for arithmetic and using multiple independent systems using message passing for processing large amounts of data, while including theoretical concepts behind these like Amdahl’s law for speedup and Little’s Law for pipelines.

CPU Vector Extensions

Advanced Vector eXtensions (AVX) by Intel and Scalable Vector Extensions (SVE) by ARM are additional instructions added to the CPU’s Instruction Set Architecture (ISA) for assisting vector data processing. These help with many computationally intensive tasks, but using them needs careful consideration of the available memory hierarchy to keep the compute units fed with data. Not just the low latency of data transfer, but the available bandwidth also contributes to the speedup that can be achieved.

GotoBLAS structure (courtesy of blislab)

In the programming assignment, we started with a modified version of BLISLab which is a framework for optimizing BLAS , specifically GEneral Matrix Multiply (GEMM) using the GotoBLAS algorithm. We focussed on double precision floating point multiplications on an ARM Neoverse V1 architecture offered by the AWS Graviton series of processors on a c7g.medium EC2 instance. This was a very good exercise for understanding the in depth memory and CPU architecture of CPUs since we had to eke out every bit of performance to hit the extended target. This included cache optimizations in terms of placement of pieces of the matrices in certain memory, packing and padding the matrices so that the access patterns do not cause a cache miss and so on as shown in the image above. The kernel of the multiplication itself was converted to utilize the SVE and AVX intrinsics in C programming language to achieve more than 20 Gflops and complete the assignment.

GPU CUDA Programming

Graphics Processing Units, commonly known as GPUs or video cards enable hardware accelerated video processing and graphics workloads. Since video processing extensively uses matrix operations (given that an image is a matrix of pixel values to the computer), it can be utilized to speed up arithmetic computations as well. Since GPUs are specialized hardware, they do not enjoy the binary compatibility enjoyed by the more general purpose CPUs. NVIDIA’s Compute Unified Device Architecture (CUDA) is an extension of C/C++ and the device independent layer which can be utilized to write programs targeting a variety of NVIDIA GPUs. There are many differences from a CPU to a GPU to keep in mind during programming. The GPU has a very simple scheduler, but can schedule a large amount of threads at the same time while CPUs are limited to a handful of threads. The memory hierarchy is more explicit, the programmer can place different data in different levels of cache. A large number of registers are available too. This means that the division of work to all these cores in the GPU becomes the task of the programmer and the compiler to make sure the utilization is good.

GPU GEMM structure (courtesy of cutlass)

The programming assignment for this involved single precision GEMM, slightly different from before and now in CUDA/C++. We worked with an NVIDIA Turing architecture T4 GPU provided by the G4dn.xlarge EC2 instances on AWS. Optimizations such as blocking, tiling and loop unrolling was made while making maximum use of the available on chip memory to achieve the necessary performance. Since a sizeable number of cores (a warp, 32 threads) share the same memory, they can co-operatively load content to the shared memory, sync and proceed to computation. These, combined with the inspirations from CUTLASS , the NVIDIA’s CUDA GEMM implementation templates, helped us achieve the necessary performance of over 4 TFlops!

Message Passing Interface

Message Passing Interface (MPI) is another way to approach parallel computing, instead of tightly coupling the processors, these are loosely coupled independent systems that share data by communication (thus the name message passing) rather than shared memory. This is totally different approach in programming as well, since the programmer needs to optimize the amount of communication since it adds overhead. In MPI, there are many approaches to send and receive messages, including synchronous and asynchronous concepts and RDMA (Remote Direct Memory Access). Stencil methods are a commonly implemented with MPI using ghost cell exchanges, where data nearby which resides on another core is shared through a message.

Stencil Mesh and Ghost Cells structure (courtesy of Simula)

In the programming assignment, we were tasked with implementing the two-variable Aliev-Panfilov model modelling the cardiac excitation using MPI on the San Diego Super Computer’s Expanse cluster that happens to be at UCSD :). With over 90,000 CPU cores under its wings, it is a hefty beast. The starter code already contained the implementation for the ordinary and partial differential equations needed by the model for running on a single core. Our task was to get it running on more than one using MPI and perform weak and strong scaling experiments on the implementation. As part of extra credits were using AVX vectorization for the kernel and gathering the data to plot the progress of the simulation while running parallely. MPI Tutorial and RookieHPC had a very good collection of resources to explain how MPI functions work. By partitioning the data to split work across multiple nodes and implementing ghost cell exchanges for sharing the data at the edge with another processor in a generic fashion, the solution can be scaled to any number of processors. Using a separate core for gathering and plotting data, the whole process can be sped up since each core does not have to wait for the one overloaded core plotting the data as well. Running our program on 4096 processor cores (128 cores per node using 32 nodes) and getting 6 Tflops was truly a fulfilling experience.

Conclusion

This was a very fun course with a very good set of practical programming assignments that makes one understand each of the concepts involved to the greatest detail. Where else would you see the escalation from programming on a CPU to a GPU and then to a supercomputer?! If nothing else, just getting access to and playing around with the supercomputer makes this course worth it. If you are looking at getting some hands on experience with parallel computing and even computer architecture, CSE 260 by Bryan is a well rounded course to do just that. T’was a good time brushing up my C/C++ programming skills as well!