Next: Experimental speedup measurement Up: Speedup Measurement Previous: Speedup Measurement

Analytical speedup

To produce parallel code it is not a goal by itself. The main reason to accomplish this work has been to develop a program able to run faster than the original Tidal code using available computers and networks. As a measure of success we employ three parameters:

Speedup, defined as the ratio of the time to run a simulation employing the serial code (Tidal) over the the time to make the same simulation using the parallel code (PTidal). Its value ranges between 0 and np ^12.1. A value smaller than 1 means that the parallel version is worse than the serial one.
Utilization, defined as speedup/np. Its value ranges between 0 and 1. A value of 1 means that all the code has been distributed in the np processes and there is not wasted time in communication, a goal impossible to attain in the practice.
MFLOPS^12.2 performance of the ``parallel virtual machine'', obtained by multiplying the performance of each processor times the attained speedup while employing the parallel code.

Another important concept to be introduced here is the task parallel granularity, defined as the ratio between the time a process is making useful computations and the time it's waiting for communication from other processes. An easy way to increase (i.e. to coarse) the parallel granularity is decreasing the cell size to increase the total number of grid points the process is working on. Those concepts are further explained in Appendix B. In order to get a higher speedup while using the domain decomposition in the PTidal program (Kaplan, 1997), it was attempted to coarse the parallel granularity. The higher granularity is achieved increasing the computation in each block by dividing the cell size, due to the fact that the communication delay increase with a slower rate than the computation time. We designed a first grid (called th5 in the following) with $M(th5)\times N(th5)$ grid points and a second, th20, using half the grid size of the previous one and having 4 times the number of grid points ( $M(th20) \times N(th20) = 2M(th5) \times 2N(th5)$ ). The test setup was further explained in Section 7.2.1.

If both models use the same time step, and also a similar domain decomposition over an $m \times n$ mesh of processors, as in figure 5.2, the computation time, T_compute on (9.1), for every time step is multiplied by 4 when we switch from using grid th5 to th20, while the message size is increased by a factor of 2. Considering the latency delay $\alpha$ (defined in B.3), the communication time, T_communic on (9.2), is increased by a factor smaller than 2, R_c<2 on (9.3).

$\begin{displaymath} T_{compute}(m \times n)= 247 \frac{M/m \; N/n}{F_m} \qquad [\mu s] \end{displaymath}$

(12.1)

Where the mesh being worked on is $M \times N$ cells, 247 is the number of operations to build and solve the tri-diagonal matrix and F_m is the computer performance measured in MFLOPS.

$\begin{displaymath} T_{communic}(m \times n)=8 \quad (\alpha + \frac {8\times2(M/m+N/n)}{\beta}) \qquad [\mu s] \end{displaymath}$

(12.2)

Where the first 8 stands for the 8 times communication is employed, $\alpha$ [ $\mu$ s] is the latency delay and $\beta$ [Mega-Bytes/s] is the bandwidth , the second 8 stands for the 8 bytes in each double precision number and 2(M/n+N/n) is number of values being sent.

The R_c ratio of communication for th20 in relation to mesh th5is:

$\begin{displaymath} \begin{split} R_c &= \frac {\alpha + \frac{32M(th20)/m} {\... ...pha + \frac {32 M(th5)/m} {\beta}}\\ R_c &\leq 2 \end{split}\end{displaymath}$

(12.3)

In the case of a workstation's cluster the latency time explains most of the communications delay and the R_c rate is near 1. Therefore the communication delay is not significantly increased when we switch to the th20 grid and the increase in the runtime is due mainly to the computations, increasing the parallel granularity.

**Figure 9.1:** Utilization, th5 mesh, explicit-implicit code
$\begin{figure} \centering \epsfig{file=util_c414, width=.8\textwidth,clip} \end{figure}$

The above diagram represents run-time in the x axis and number of processes in the y axis, (4 slaves plus 1 master). The optimum is to have 4 processes working simultaneously, in green (gray in the b&w version). Red (black) represents wasted process time waiting for communication and Yellow (light gray) means overhead, useless time as well.

Working with the coarser granularity mesh the utilization parameter is increased and a higher speedup is obtained. Compare the numerically obtained utilization diagrams for the th5 grid (figure 9.1) and for the coarser parallel grain th80 grid (figure 10.1). On those diagrams is clearly verified the better utilization value in the th80 grid simulation from the larger green area.

The use of higher resolutions is the way of getting more accurate approximations when CFD restrictions apply. The increased speedup associated to the parallel grain coarsening is an indirect positive result of this grid refinement.

Next: Experimental speedup measurement Up: Speedup Measurement Previous: Speedup Measurement

Elias Kaplan M.Sc.
1998-07-22