|
Component Overhead in the NCEP SSI Package
Peggy Li/JPL, John Wolfe/NCAR, Jim Edwards/IBM, Weiyu Yang/NCEP IntroductionThe NCEP Spectral Statistical Interpolation (SSI) analysis system was selected as one of the ESMF Data Assimilation Applications included in ESMF Milestone F: First Code Improvement. In the milestone F report, it was reported that a 12% overhead was observed on the ESMF version of the SSI code compared to the baseline (non-ESMF) version of the code. The timing was conducted on a 16 processor run using the NCAR IBM cluster machine blackforest. The wall clock time to run the baseline code is 855 seconds and the ESMF code is 960 seconds. The objective of this task is to verify the timing results reported in the Milestone F, to conduct more thorough timing analysis in order to find out the source of the overhead, and finally recommend optimization strategies to reduce the overhead to below 10% of the total run time. ESMF version 2.0.0rp2 was used. NCEP SSI OverviewThe NCEP Spectral Statistical Interpolation (SSI) is a three-dimensional variational analysis of observations used in the National Weather Service in the data assimilation system to initialize the global atmospheric model.The physical domain of SSI is the global atmosphere from the surface to the stratopause. Instead of physics parameterizations, the SSI is comprised of forward model elements and their adjoints. Some major forward model elements are radiative transfer algorithms for each satellite instrument, convective and large-scale precipitation, spectral transform, grid interpolation, the balance equation, and the divergence tendency equation. The analysis minimizes a combination of fits to observations, fits to model background, and a set of dynamical constraints. The minimization is performed in spectral space. Computation of forward models and their adjoints are required every iteration. The background error covariance computation relates every variable to every other variable in the model domain. The input and output files are in both binary and BUFR formats. The analysis reads or writes about 800 Mb of data per day. SSI ESMF Code OverheadThe ESMF version of the standalone SSI code consists of two types of overhead: the ESMF overhead and the algorithm overhead. 1. The ESMF OverheadThe SSI ESMF code only uses the ESMF superstructure classes. It is not coupled with any other ESMF components nor uses any ESMF infrastructure layer services. 2. The Algorithm OverheadThere is additional data conversion overhead in the ESMF version of the SSI code due to lack of ESMF support for spectral fields. The code converts the spectral fields stored in an input file into Gaussian grids and saves the grid data into an ESMF Grid object of type ESMF_GridType_LatLon in the Initialize routine. The grid object is then passed to the Run routine to process. When the data is used in the main computation, the gridded data has to be converted back to the spectral fields. At the end of computation, the spectral data is once again converted back to the Gaussian grid and stored into the ESMF Grid object. The result is not written to a file or passed to other ESMF components in this standalone run. In summary, there are three extra data conversions in the ESMF code. ResultsWe did the timing test on NCAR's IBM cluster machine blackforest. We compiled the SSI baseline code and the SSI ESMF code using the IBM Fortran 95 compiler xlf95 with compiler switches -q32 -qrealsize=8 -O3, i.e., with 32 bit addressing , 8 byte for REAL type and optimization level O3. We ran both versions of the code on 8, 16, 32, 64, and 128 processors. We used 2 processors per node for all the runs. The main computation of the SSI code has two parts, the data assimilation part assimilates observation data from nine input files and the computation part does the analysis for 200 iterations. Each code was ran twice for each configuration and the total run time is the average of the two runs. The total run time for the Baseline SSI and the ESMF SSI code are shown in Table 1 and Figure 1.
Table 1. The Total Run Time for the Baseline SSI and the ESMF SSI code
Figure 1. The Timing Results Comparison The timing difference between the Baseline code and the ESMF code is in between 0.22% to 3.81%, which is within the range of errors. In other words, the timing difference of the two versions of the code is statistically insignificant. In the following, we will analyze the ESMF overhead in the SSI ESMF code. ESMF Code AnalysisWe measured the total run time, the time to do each data conversion, and the time to run the main routine in the ESMF code. Table 2 shows the overhead caused by the extra data conversions, by ESMF function calls and the total overhead in the code. As depicted in the table, the ESMF overhead remains about constant regardless the number of processors used. The data conversion overhead is about 2/3 of the total overhead and the total overhead does not exceed 2% of the total run time for up to 128 processors.
Table 2. The Overhead in the ESMF SSI Code ConclusionThe 12% overhead of the SSI/ESMF code reported in the Milestone F is mainly
caused by its using 64 bit addressing mode instead of 32 bit used by the
baseline code. After recompiling the ESMF code with 32 bit address, the
time to
run the two versions of the code is almost identical for 8, 16, 32, 64,
or 128 processors
on NCAR’s IBM cluster blackforest (see Table 1 and Figure
1). The
total ESMF overhead including the extra data conversion from the spectral
fields
into Gaussian grid and vice versa is about 4 seconds (slightly increasing
with increasing number of processors), or 0.26% for 8 processors to 1.82%
for 128
processors (Table 2). The SSI ESMF code only uses the
ESMF superstructure classes. It is not coupled with any other ESMF components
nor uses any
ESMF infrastructure
layer services except for the ESMF Physical Grid class. Therefore, the
overhead introduced by the additional ESMF layers for the SSI code should
be minimal
and the timing analysis agrees with the assumption. |
