Google Answers: Laplace relaxation program

View Question

Q: Laplace relaxation program ( Answered 5 out of 5 stars

Question

Subject: Laplace relaxation program
Category: Computers > Programming
Asked by: echan80-ga
List Price: $20.00

Posted: 31 May 2006 11:22 PDT
Expires: 30 Jun 2006 11:22 PDT
Question ID: 734096

The fragment below compes from a Laplace relaxation program.

Iter = 0;
do {
TotalError = 0.0;
for (j = 1; j < N; j++) {
for (i = 1; i < N; i++) {
OldValue = A[i][j];
A[i][j] = A[i][j] /2.0+ (A[i - 1][j] + A[i + 1][j] + A[i]
[j - 1] + A[i][j + 1]) /8.0;
TotalError += (OldValue - A[i][j]) * (OldValue - A[i][j]);
}
}
Iter++;
} while (TotalError >= Norm);

What optimisations would you immediately make to this program to
improve performance and why?

Rewrite the program to allow it to perform well on a shared memory multiprocessor.

Clarification of Question by echan80-ga on 02 Jun 2006 08:16 PDT

It is a C program.
The full program is here:

main(){  double Norm = 0.05;  int N=127;  long Iter, i, j;  float
OldValue, TotalError;  float A[N + 1][N + 1];  for (i = 0; i <= N;
i++)     for (j = 0; j <= N; j++)      A[i][j] = 0.0;   for (i = 0; i
<= N; i++)    A[i][0] = 100.0;  for (j = 0; j <= N; j++)    A[0][j] =
100.0;  Iter = 0;  do {     TotalError = 0.0;     for (j = 1; j < N;
j++) {       for (i = 1; i < N; i++) {	       OldValue = A[i][j];	    
  A[i][j] = A[i][j] /2.0+ (A[i - 1][j] + A[i + 1][j] + A[i]				   [j -
1] + A[i][j + 1]) /8.0;	   TotalError += (OldValue - A[i][j]) *
(OldValue - A[i][j]);      }    }    Iter++;  } while (TotalError >=
Norm);  //printf("%10ld%10.6f\n", Iter);}

Clarification of Question by echan80-ga on 02 Jun 2006 08:18 PDT

main()
{  double Norm = 0.05;  
int N=127;  
long Iter, i, j;  
float OldValue, 
TotalError;  
float A[N + 1][N + 1]; 
 for (i = 0; i <= N; i++)     
for (j = 0; j <= N; j++)    
  A[i][j] = 0.0;   
for (i = 0; i <= N; i++)    
A[i][0] = 100.0;  
for (j = 0; j <= N; j++)    
A[0][j] = 100.0;  Iter = 0;  
do {     TotalError = 0.0;     
for (j = 1; j < N; j++) 
{       
for (i = 1; i < N; i++) 
{	       OldValue = A[i][j];	       
A[i][j] = A[i][j] /2.0+ (A[i - 1][j] + A[i + 1][j] + A[i][j - 1] +
A[i][j + 1]) /8.0;
TotalError += (OldValue - A[i][j]) * (OldValue - A[i][j]);      
}    }    
Iter++;  } 
while (TotalError >= Norm);  
//printf("%10ld%10.6f\n", Iter);}

Answer

Subject: Re: Laplace relaxation program
Answered By: maniac-ga on 02 Jun 2006 18:55 PDT
Rated: 5 out of 5 stars

Hello Echan80, Q: What optimisations would you immediately make to this program to improve performance and why? A: I did some quick tests on a system and came up with a few suggestions: [1] Using gcc w/o optimization, the program took about 5 seconds to run on a laptop I have access to. The command line was: gcc -o a -g a.c Changing the command line to: gcc -o a -O2 -g a.c and running the application reduces run time to about 2.2 seconds. From this, it suggests: a. unrolling the inner loop (duplicating the code / reducing number of iterations of the loop) b. find / use common sub expressions (e.g., assign OldValue-A[i][j] to a temporary value & multiply by itself) c. moving some values to registers (could hint by using register in the declarations, but not all C compilers will comply with that hint) d. depending on your system, reorder operations to keep all the processor units (e.g., if you have multiple FPUs) busy - note, it is HARD to do this by hand and is not portable between system types. e. going down with i and j MAY be faster (testing against zero) on the loops - again this is processor dependent. (caching effects on moving forward through memory may negate this) f. eliminating the zeroing of the array (near the start) - many systems do this by default so the code doesn't have to do so. g. using vector instructions when available. See the man page for gcc (for example) http://www.die.net/doc/linux/man/man1/gcc.1.html for more information (scroll down about 1/3rd to the section about controlling optimization). For a more readable explanation of some of the optimizations, see http://linuxgazette.net/issue71/joshi.html or see http://www.programmersheaven.com/zone3/articles/article559.htm for several references to C and C++ source code optimizations. [2] The nested loop (j, then i) can be reversed (i, then j) and get another 10% reduction in CPU time. This is because of cache efficiencies but varies by system type - test to be sure. [3] Changing the size of the array "A" to align each row on a page sized boundary (varies by CPU type) - there are 129 floats per row the way the array is currently defined. Something like A[255][255] (which allocates 256 floats per row / column) may run better. Again, you need to run the test to be sure. [4] It MAY also be better to define A as double instead of float. This is due to the way C does floating point arithmetic (defaults to double precision), eliminating a LOT of float to double & double to float conversions. Test to be sure. [5] It would look ugly, but use array constants to initialize array "A". This would typically increase the size of the executable, but move the initialization of A to compile time instead of run time. [6] Revising the algorithm (e.g., use of "over relaxation") would likely help for the relatively simple example provided. There is a pretty concise explanation of this at http://www.triumf.ca/relax/doc/relax3d/node11.html which suggests using factors between 1.25 and 1.5 for the adjustment in each cycle (have to avoid oscillation, but reduces the number of iterations (Iter) significantly). Q: Rewrite the program to allow it to perform well on a shared memory multiprocessor. I will provide some general guidelines for this - note that if you want this code rewritten for a specific shared memory system - you MUST make a request for clarification including: - the type of system being used - the type of shared memory interconnect (e.g., myrinet, SMP local memory, craylink) - the run time library being used to communicate between the CPUs (e.g., Posix threads, MPI) and so on. The type of shared memory interconnect is ESPECIALLY IMPORTANT since some have extremely variable performance (reads much slower than writes, some do not enforce cache coherency). [0] Depending on the run time library, start the programs on each CPU and synchronize the start. [1] The code up to the first executable line (the do statement if you use array constants) is the same in each CPU. (start MPI or map memory to be shared if necessary) [2] Break the array into "M" pieces (for M cpus). There is a good explanation of this at http://www.epcc.ed.ac.uk/ewomp2000/Presentations/SatohSlides.pdf which describes using Open MP for this type of application. The page titled "8. Sharing Pattern" illustrates the specific layout of storage and the memory to be used by a single CPU or by multiple ("adjacent") CPUs. [3] The main loop now looks something like the code at "7. Example Program" in the previous reference. There are a sequence of operations performed in parallel with sequential steps at the end of each inner loop (computing / testing the error bounds, etc.). Specific synchronization steps will vary by the run time library. [4] The end of the program cleans up / generates the output (currently commented out). For more information on parallel computing, search with phrases such as: laplace relaxation SMP posix threads MPI message passing interface myrinet scalable coherent interface infiniband and so on. Good luck with your work. If some part of the answer is incomplete or unclear, please make a clarification request so I can correct the answer prior to rating. --Maniac
Request for Answer Clarification by echan80-ga on 02 Jun 2006 19:48 PDT Can you carify the following two points? [2] The nested loop (j, then i) can be reversed (i, then j) and get another 10% reduction in CPU time. This is because of cache efficiencies but varies by system type - test to be sure. [3] Changing the size of the array "A" to align each row on a page sized boundary (varies by CPU type) - there are 129 floats per row the way the array is currently defined. Something like A[255][255] (which allocates 256 floats per row / column) may run better. Again, you need to run the test to be sure. And also changing the code from A[i][j]/8 to A[i][j]*0.125, does it also help with performance and why?
Request for Answer Clarification by echan80-ga on 02 Jun 2006 20:16 PDT Also, If you had control over what was cached and what was not cached in the problem what choices would you make and why? Thanks
Clarification of Answer by maniac-ga on 03 Jun 2006 16:42 PDT Hello Echan80, I would be glad to explain those points. [2] Changing movement through the array allows for a couple of additional optimizations. In C (not FORTRAN), arrays are stored in what's called "Row Major Order" where the right most index varies most rapidly. See http://en.wikipedia.org/wiki/Row-major_order for an explanation of this concept. By reversing the order of loop indicies, the accesses to A[i][j] in the main loop are consecutive through memory - allowing the compiler to generate more efficient code. There should also be reduced cache misses as the CPU often prefetches subsequent data, so the next iteration (computing A[i][j+1]) should refer to data already in the cache. [3] One technique used to get the maximum performance out of an algorithm (or the system running that algorithm) is to ensure the data is aligned on boundaries that are efficient for loading / storing information. The array of floats (or doubles if you make that change) should be aligned for FPU loads and stores, but there are some secondary effects that can be optimized as well. Going back to the description at http://www.epcc.ed.ac.uk/ewomp2000/Presentations/SatohSlides.pdf the proper alignment of the unshared and shared regions can improve performance. The rationale goes something like this: - CPU A tries to access a value at address A1 - CPU B tries to access a value at address A2 which is either o near A1 OR o part of the same cache line as A1 (often the same as the previous condition) In these cases, the memory sharing equipment must sequence accesses to the memory at locations A1 and A2, slowing down one or both systems. I suggested increasing the dimensions to 256 elements per row (1024 bytes for floats, 2048 bytes for doubles) to improve the chance that these effects will NOT take place (and why you have to run the test). The actual size of each row may actually have to be much larger to get the optimal effect - a review of the page size of the CPU / and cache design of the CPU and shared memory system will help determine the "best" size for this value. About the difference between floating point multiply and divide operations, a "good" compiler will choose the best method for that automatically. This is called strength reduction - see http://en.wikipedia.org/wiki/Compiler_optimization (scroll near the bottom) for a good explanation of the types of operations that are performed in this step (it explicitly mentions multiplication by the inverse replacing the division by a constant). According to http://tdenis.tripod.com/download/fpuopcode.txt the difference between floating point divide and multiply is about 3:1 or more (and varies by size of operations, etc.). Look at the number of cycles for FDIV and FMUL respectively. For the divide by 2 and 8 operations, the most efficient method could be to adjust the binary exponent (reduce by 1 and 3 respectively) and if feasible on the system you are running on, the compiler would generate better code for the formulas as written. On the caching question (what would I prefer to cache or not), the obvious answer for the single CPU solution is - all the data AND - all the code Using the size of 256 x 256 x 4 (for floats) of A, the array should fit comfortably within the cache of a modern CPU. The code is quite small so it should fit within to cache as well. If you have a system with extremely limited cache, the obvious items to cache (or keep in registers in lieu of cache) are: - the loop indicies (or more properly the offsets used to access the five values in each update) - the total error value - the number of iterations - if possible, the values fetched in the previous iteration (next row/ column values) - the code for the innermost loop On the multiple CPU solution, the caching question is much more complicated and varies by the architecture of the shared memory system. An SMP system does its best to keep caches consistent on all the CPUs to mitigate the problem. In those cases, rewriting the code for caching effects will not provide much of a performance boost. On other shared memory systems, a cache miss on read is FAR more expensive than one on write. I have seen ratios of about 20:1 on real systems for this kind of operation. If you have one of these systems, the generating CPU should copy the values to memory local to each "using CPU" to minimize overall CPU time. In this case, you write the code to apparently "do more work" (N writes to remote memory, then the using CPU does a read from local memory) but the end result is faster execution. --Maniac

echan80-ga rated this answer: 5 out of 5 stars

and gave an additional tip of: $10.00

Thanks for your help!
It is really great help for me!!!!

Comments

Subject: Re: Laplace relaxation program
From: leeach_3652-ga on 02 Jun 2006 01:43 PDT

I dont think you have given enough information for anyone to really help.

what language? (im assuming C/++ or Java)
where is the rest of the relevant code? [The initialization of the
matrix, the initialization of Norm etc...]
any idea what type of data it will take? [its difficult to do any
clever optimization without knowing the data]
why doesn't it crash when you reach you the end of the array? || how
does it check the last row and column?

that said, from the information provided i cant really see any
possible scope for optimization.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy