Vlada school blog: Accelerating Conjugate Gradient Solver: Temporal Versus Spatial Data

Abstract- Simulation of the object in the wind tunnel is a long lasting process, and therefore an ideal candidate for making code run in parallel.

Simulation complexity is still great for today’s computers. With a growing number of processes computation time is falling, but communication time is rising. Memory can also be the problem.

Existing solutions are based on one process being the master, and, as so, communicating with all other processes. That causes both time consuming communication while other processes should wait for the master process and memory problem, while one process holds all the data at one moment if no special technique is applied.

BECAUSE making a car model and simulating his air resistance in the wind tunnel is both time and money expensive, the idea of making a simulation has became a reality. It is expected that a first BMW will be made without making a single model soon. In order to evaluate air resistance, one needs to discretize a volume, set PDE-s, and solve them. Solving PDE-s with huge number of unknowns is not possible without using mathematical algorithm, except in some special cases. By discretizing, a system with n linear equations and n unknowns is obtained. The most popular way of presenting these is using matrices. In this case, the matrix will be sparse. Conjugate Gradient method is a method for solving system of linear equations using induction. In each step, we are supposed to be

closer to the exact solution.

This paper deals with optimizing code for parallel execution of Conjugate Gradient

algorithm with a sparse matrix that is a result of discretizing a volume and setting correspondent PDE-s. Code was tested on many multi- processors computer architectures, which were suitable for running MPI programs.

PROBLEM STATEMENT

When talking about simulations, where the volume is divided in small amounts of volume, it is obvious that by dividing on smaller pieces leads to the result with better precision. Of course, that also means longer execution time. Therefore, it is natural to try to run a simulation code in parallel.

While the majority of calculation in CG algorithm is matrix vector multiplication, the execution time is easy to calculate, and the multiplication can easily be divided on many processors.

Anyway, when the result should be spread to all of the processors, sometimes it is faster to run the whole simulation on single processor computer, then to run it on many processors, and then deal with the communications. Even if the communication lines are very fast, with every message passing interface one processor has to form a message head and body and send it, and the receiving part should do inverse operations, and the direct communication is not easy to establish, and usually not useful.

For example, with matrix vector multiplication divided on many processors, each of them needs to send the result to all of the others. In case of modeling the volume, the result matrix is sparse, which will be of great interest in further calculations.

PROPOSED SOLUTION AND WHY IT IS EXPECTED TO BE BETTER THEN OTHERS

In order to make this paper interesting and more understandable, the main idea will not be explained by explaining each part of the algorithm/code, but instead, explaining each idea that guided author to the final solution. Within each idea, main characteristics are given, a picture demonstrating what we have achieved by implementing it, and the problem to be solved by implementing next idea. But first, the serial implementation will be discussed.

The most important thing to notice when talking about making a simulation run on many processors is that there are approximately 200 iterations per one time stamp, where matrix- vector multiplication is the most processor demanding operation in each iteration. Beside it, scalar vector product is calculated in all iterations, and the multiplication of the vector with a constant. The most promising thing to do is to split matrix vector multiplication on many processors. Other operations are not to be split at this stage, while more time would be needed to send and receive the result than to calculate it on a single processor.

Dividing calculation onto many processes

Because of the nature of the problem, the matrix is divided onto non-zero blocks same sizes. The number of row blocks is divided by the number of available processors. Each processor is responsible for multiplication of approximately the same number of row blocks with appropriate vector. Figure 1 depicts a non-zero blocks marked as black circles, while the rest of blocks are zero blocks. The acceleration obtained by using this basic principle is obvious. Still, there is a problem to be solved. After each matrix-vector multiplication, a result should be collected, and then delivered to all processes. This approach requires sending and receiving huge amount of data.

Comparing to the previous case, maybe it is not that obvious, but the communication necessary for matrix vector multiplication is reduced almost n/2 times, where n could be even 10000. Anyway, the problem is still a little bit covered, but easy to notice. The whole communication is done by the root process and each other process. In case we have a computer architecture made of equal nodes, that means that each process would have to wait until all of the data has been received by the root process and sent to all other processes. If one process is run on single processor, that means that all processors would have to wait for communication to finish.

Making processes root independent

Now that we know which part of the vector is necessary for which process in order to do the calculation, we can try do determine who is the “owner” of the requested data. Even if all of the processes are equal when using MPI, we can mark 0 process as root, and all other processes as slaves. Similarly, we can force the process with any rank to work with corresponding row- blocks, and therefore know rank of the process that is working with any part of the vector. This way, as shown on figure 3, every process needs to send only n real numbers to the upper neighbor, where upper neighbor is the process with the previous rank number than the current process rank number, and n real numbers to the lower neighbor. Similarly, it needs to receive same amount of data from same processes. While many computer architectures support parallel communication between some processes, this could be almost n times faster than in previous case, where all the communication was done using the root process.

The last, but not least to say is that whole communication could be done in parallel to the calculation, which means that for big problem sizes, the communication time is around zero. This is achieved in tree stages. First is starting sending and receiving operation. Second is multiplying each row block that is independent from other processes. Third is checking if the communication has been finished. Only in case of having small data sets, processors would have

CONDITIONS AND ASSUMPTIONS OF THE RESEARCH TO FOLLOW

In this chapter, a brief introduction to computer architecture suitable for the program is given. Main testing was done at cluster Mozart at SGS department on IPVS, in Stuttgart, Germany. It consisted of 64 nodes, each containing two processors with 1GB of RAM memory, and, for code development time considerable cache memories. Anyway, the necessary architecture included any cluster that could have MPI installed on it. Communication lines were also of interest, but the research is done on such a way that for slower networks, in order to see the advantage of the parallel version of code comparing to the serial one, one needs to set enough big problem size. Special sparse matrix was produced by Ionel Muntean’s code, which was given in 9 vectors for the 2D case and 27 vectors for 3D case. These vectors represented non-zero data in the matrix. It is much more efficient than to store whole non-zero blocks in the memory, because, most of the elements in them were zeros.

ANALYTICAL PERFORMANCE ANALYSIS

In this chapter, analysis is done considering memory and time aspects. For each of them, a comparison between serial and parallel version is given. In order to make paragraph more understandable, let n be the dimension of the matrix and the vectors, and p number of processes in parallel version of the program, which will be used in later text.

Vlada school blog

Saturday, February 4, 2017

Accelerating Conjugate Gradient Solver: Temporal Versus Spatial Data