Q: What optimisations would you immediately make to this program to
improve performance and why?
A: I did some quick tests on a system and came up with a few suggestions:
 Using gcc w/o optimization, the program took about 5 seconds to
run on a laptop I have access to. The command line was:
gcc -o a -g a.c
Changing the command line to:
gcc -o a -O2 -g a.c
and running the application reduces run time to about 2.2 seconds.
From this, it suggests:
a. unrolling the inner loop (duplicating the code / reducing number
of iterations of the loop)
b. find / use common sub expressions (e.g., assign OldValue-A[i][j]
to a temporary value & multiply by itself)
c. moving some values to registers (could hint by using register in
the declarations, but not all C compilers will comply with that hint)
d. depending on your system, reorder operations to keep all the
processor units (e.g., if you have multiple FPUs) busy - note, it is
HARD to do this by hand and is not portable between system types.
e. going down with i and j MAY be faster (testing against zero) on
the loops - again this is processor dependent. (caching effects on
moving forward through memory may negate this)
f. eliminating the zeroing of the array (near the start) - many
systems do this by default so the code doesn't have to do so.
g. using vector instructions when available.
See the man page for gcc (for example)
for more information (scroll down about 1/3rd to the section about
controlling optimization). For a more readable explanation of some of
the optimizations, see
for several references to C and C++ source code optimizations.
 The nested loop (j, then i) can be reversed (i, then j) and get
another 10% reduction in CPU time. This is because of cache
efficiencies but varies by system type - test to be sure.
 Changing the size of the array "A" to align each row on a page
sized boundary (varies by CPU type) - there are 129 floats per row the
way the array is currently defined. Something like A (which
allocates 256 floats per row / column) may run better. Again, you need
to run the test to be sure.
 It MAY also be better to define A as double instead of float. This
is due to the way C does floating point arithmetic (defaults to double
precision), eliminating a LOT of float to double & double to float
conversions. Test to be sure.
 It would look ugly, but use array constants to initialize array
"A". This would typically increase the size of the executable, but
move the initialization of A to compile time instead of run time.
 Revising the algorithm (e.g., use of "over relaxation") would
likely help for the relatively simple example provided. There is a
pretty concise explanation of this at
which suggests using factors between 1.25 and 1.5 for the adjustment
in each cycle (have to avoid oscillation, but reduces the number of
iterations (Iter) significantly).
Q: Rewrite the program to allow it to perform well on a shared memory
I will provide some general guidelines for this - note that if you
want this code rewritten for a specific shared memory system - you
MUST make a request for clarification including:
- the type of system being used
- the type of shared memory interconnect (e.g., myrinet, SMP local
- the run time library being used to communicate between the CPUs
(e.g., Posix threads, MPI)
and so on. The type of shared memory interconnect is ESPECIALLY
IMPORTANT since some have extremely variable performance (reads much
slower than writes, some do not enforce cache coherency).
 Depending on the run time library, start the programs on each CPU
and synchronize the start.
 The code up to the first executable line (the do statement if you
use array constants) is the same in each CPU. (start MPI or map memory
to be shared if necessary)
 Break the array into "M" pieces (for M cpus). There is a good
explanation of this at
which describes using Open MP for this type of application. The page
titled "8. Sharing Pattern" illustrates the specific layout of storage
and the memory to be used by a single CPU or by multiple ("adjacent")
 The main loop now looks something like the code at "7. Example
Program" in the previous reference. There are a sequence of operations
performed in parallel with sequential steps at the end of each inner
loop (computing / testing the error bounds, etc.). Specific
synchronization steps will vary by the run time library.
 The end of the program cleans up / generates the output (currently
For more information on parallel computing, search with phrases such as:
laplace relaxation SMP
MPI message passing interface
scalable coherent interface
and so on.
Good luck with your work. If some part of the answer is incomplete or
unclear, please make a clarification request so I can correct the
answer prior to rating.
Clarification of Answer by
03 Jun 2006 16:42 PDT
I would be glad to explain those points.
 Changing movement through the array allows for a couple of
additional optimizations. In C (not FORTRAN), arrays are stored in
what's called "Row Major Order" where the right most index varies most
for an explanation of this concept. By reversing the order of loop
indicies, the accesses to A[i][j] in the main loop are consecutive
through memory - allowing the compiler to generate more efficient
code. There should also be reduced cache misses as the CPU often
prefetches subsequent data, so the next iteration (computing
A[i][j+1]) should refer to data already in the cache.
 One technique used to get the maximum performance out of an
algorithm (or the system running that algorithm) is to ensure the data
is aligned on boundaries that are efficient for loading / storing
information. The array of floats (or doubles if you make that change)
should be aligned for FPU loads and stores, but there are some
secondary effects that can be optimized as well. Going back to the
the proper alignment of the unshared and shared regions can improve
performance. The rationale goes something like this:
- CPU A tries to access a value at address A1
- CPU B tries to access a value at address A2 which is either
o near A1 OR
o part of the same cache line as A1 (often the same as the previous condition)
In these cases, the memory sharing equipment must sequence accesses to
the memory at locations A1 and A2, slowing down one or both systems.
I suggested increasing the dimensions to 256 elements per row (1024
bytes for floats, 2048 bytes for doubles) to improve the chance that
these effects will NOT take place (and why you have to run the test).
The actual size of each row may actually have to be much larger to get
the optimal effect - a review of the page size of the CPU / and cache
design of the CPU and shared memory system will help determine the
"best" size for this value.
About the difference between floating point multiply and divide
operations, a "good" compiler will choose the best method for that
automatically. This is called strength reduction - see
(scroll near the bottom) for a good explanation of the types of
operations that are performed in this step (it explicitly mentions
multiplication by the inverse replacing the division by a constant).
the difference between floating point divide and multiply is about 3:1
or more (and varies by size of operations, etc.). Look at the number
of cycles for FDIV and FMUL respectively.
For the divide by 2 and 8 operations, the most efficient method could
be to adjust the binary exponent (reduce by 1 and 3 respectively) and
if feasible on the system you are running on, the compiler would
generate better code for the formulas as written.
On the caching question (what would I prefer to cache or not), the
obvious answer for the single CPU solution is
- all the data AND
- all the code
Using the size of 256 x 256 x 4 (for floats) of A, the array should
fit comfortably within the cache of a modern CPU. The code is quite
small so it should fit within to cache as well.
If you have a system with extremely limited cache, the obvious items
to cache (or keep in registers in lieu of cache) are:
- the loop indicies (or more properly the offsets used to access the
five values in each update)
- the total error value
- the number of iterations
- if possible, the values fetched in the previous iteration (next
row/ column values)
- the code for the innermost loop
On the multiple CPU solution, the caching question is much more
complicated and varies by the architecture of the shared memory
system. An SMP system does its best to keep caches consistent on all
the CPUs to mitigate the problem. In those cases, rewriting the code
for caching effects will not provide much of a performance boost.
On other shared memory systems, a cache miss on read is FAR more
expensive than one on write. I have seen ratios of about 20:1 on real
systems for this kind of operation. If you have one of these systems,
the generating CPU should copy the values to memory local to each
"using CPU" to minimize overall CPU time. In this case, you write the
code to apparently "do more work" (N writes to remote memory, then the
using CPU does a read from local memory) but the end result is faster