next up previous contents
Next: 11 Linking C++ and Up: IMPL_repdoc Previous: 9 Optional Arguements in   Contents


10 Performance

Our example suite includes two kinds of examples implemented in C, C++ and F90. The first examples are pure computation and do not include language wrapping or communication. The second kind are more complex, include communication, and are wrapped with C++ and F90 interfaces.

10.1 Single-Language Computation

Code for the example is in computation_example/mandelbrot.

Three programs were written in F90, C and in C++ that compute the Mandelbrot set. The code is line to line comparable in the three examples. This test was performed with various compilers across various platforms. The set was calculated on a grid of 1024x768 points. The view into the complex plane is from -2.0 to 2.0 on the real and imaginary axis. The test created the set 20 times for each run, and each run was performed twice to make sure the results were comparable from run to run. The results were averaged. All examples were compiled with -O3. The averages per run are:

Program/Compiler C (seconds) C++ F90
Linux (pgCC, pgf90) 26.5 26.5 26.6
SGI (vendor compiler) 49.5 49.6 40.9
IBM rs6000 (vendor) 34.2 21.0 20.9
Sun (vendor compiler) 46.0 32.1 51.3

The conclusion is that performance is generally very close. On most of the platforms F90 was slightly faster, but on one (the Sun) it was slower. A huge difference was seen when putting the GNU compilers head to head with the Portland Group compilers. The GNU tools were almost three times as slow. A huge difference between C/C++ and F90 is seen when compiling in debug mode (which should not be a problem).

10.2 Large-Scale Example

Code for the example is in large_scale_example/mandelbrot.

An example drawn from the NASA DAO PILGRIM library demonstrates the performance of various implementations of a procedure involving regular data transfers. The exercise takes a 2048x5 array of gridpoints and creates a 2d square decomposition of the points over 9 (3x3) processors. The gridpoints are then paraded around the decomposition in a circular manner 100 times using a transpose, until they final reach their original position. A comparison of buffers at the end of the programs verifies that the gridpoints made it correctly back to their original position. This example utilizes a larger code base than the computation example and involves MPI communication.

The following table shows times in seconds for C++ wrapped in F90 and for the original F90 PILGRIM library implementation. The C++ implementation is considerably faster.

Program/Compiler F90 C++
SGI (vendor compiler) 41.0 5.5
IBM rs6000 (vendor) 7.2 3.1
Sun (vendor compiler) 35.0 4.2

The codes are not line-to-line comparable. There were a number of optimizations that were easily done in the C++ code that contributed to its speed. For instance, a permutation is done by simply creating an indexing array, whereas PILGRIM copies the entire array.

next up previous contents
Next: 11 Linking C++ and Up: IMPL_repdoc Previous: 9 Optional Arguements in   Contents