Our example suite includes two kinds of examples implemented in C, C++ and F90. The first examples are pure computation and do not include language wrapping or communication. The second kind are more complex, include communication, and are wrapped with C++ and F90 interfaces.
Code for the example is in computation_example/mandelbrot.
Three programs were written in F90, C and in C++ that compute the Mandelbrot
set. The code is line to line comparable in the three examples. This test
was performed with various compilers across various
platforms. The set was calculated on a grid of 1024x768 points. The view into
the complex plane is from -2.0 to 2.0 on the real and imaginary axis. The test
created the set 20 times for each run, and each run was performed twice to make
sure the results were comparable from run to run. The results were averaged.
All examples were compiled with -O3. The averages per run are:
|Linux (pgCC, pgf90)||26.5||26.5||26.6|
|SGI (vendor compiler)||49.5||49.6||40.9|
|IBM rs6000 (vendor)||34.2||21.0||20.9|
|Sun (vendor compiler)||46.0||32.1||51.3|
The conclusion is that performance is generally very close. On most of the platforms F90 was slightly faster, but on one (the Sun) it was slower. A huge difference was seen when putting the GNU compilers head to head with the Portland Group compilers. The GNU tools were almost three times as slow. A huge difference between C/C++ and F90 is seen when compiling in debug mode (which should not be a problem).
Code for the example is in large_scale_example/mandelbrot.
An example drawn from the NASA DAO PILGRIM library demonstrates the
performance of various implementations of a procedure involving regular data transfers.
The exercise takes a 2048x5 array of gridpoints and creates a 2d square
decomposition of the points over 9 (3x3) processors. The gridpoints are then paraded
around the decomposition in a circular manner 100 times using a transpose, until they
final reach their original position. A comparison of buffers at the end of the
programs verifies that the gridpoints made it correctly back to their original
position. This example utilizes a larger code base than the computation
example and involves MPI communication.
The following table shows times in seconds for C++ wrapped in F90 and for the original F90 PILGRIM library implementation. The C++ implementation is considerably faster.
|SGI (vendor compiler)||41.0||5.5|
|IBM rs6000 (vendor)||7.2||3.1|
|Sun (vendor compiler)||35.0||4.2|
The codes are not line-to-line comparable. There were a number of optimizations that were easily done in the C++ code that contributed to its speed. For instance, a permutation is done by simply creating an indexing array, whereas PILGRIM copies the entire array.