L10 Performance Assessment: MOSART One-way Coupling

This page should describe Performance Assessment Tests performed for this stand alone feature and should provide links to all the result pages.

Summary

Model timing test was performed to compare MOSART against RTM, both running at 0.5 degree globally. MOSART supports three different decompositions set by namelist option, decomp_option. Compared to RTM, MOSART represents a 3x-8x increase in cost, which is still relatively inexpensive compared to other model components.


Performance Test 1

Performance Test 1: Testing of model timing

Date last modified: 

Contributors: Anthony Craig; Hongyi Li

Provenance: (Run provenance Link, Code Tag, etc:)

 

Results:

Testing of model timing, on a linux cluster called Constance at PNNL, 90 day tests with restart turned off, the intel compiler, comp_run_barriers turned on (to isolate component timing), in I cases (IMCLM45 and ICLM45), with mosart running on the half degree global grid, on 24 tasks, the rof run time (multiple runs shown) is

- rtm: 0.06, 0.06 seconds/day

- mosart: 0.47, 0.52 seconds/day

Independent testing in CESM on yellowstone, suggests a difference in cost between rtm and mosart of 3x-5x.  This is expected given the differences in complexity of the rtm and mosart models.

Even with the 3x-8x increase in mosart cost compared to rtm, the mosart model is still going to be relatively inexpensive compared to other models in the system.  In the tests on constance, clm was run at T31 resolution which is 56x lower resolution than mosart (at half degree) and the clm run time was 1.2 seconds/day (or 2.5x higher).  On a per grid point basis, mosart is therefore about 140x less expensive than clm45.

Scaling tests were also done on constance for 90 days, no restarts, intel compiler, comp_run_barriers, IMCLM45, mosart on the half degree grid, the rof run time (multiple runs shown) vs pe count is

 

pe counts

Run1 (s/day)

Run2 (s/day)

12

1.08

1.09

23

0.47

0.52

48

0.25

0.25

96

0.14

0.14

192

0.11

0.11

384

0.11

 

 

Additional information:

MOSART currently supports three different decompositions; basin, 1d, and roundrobin.  These are set by a namelist option, decomp_option.  The basin decomposition ensures there is no off pe communication in downstream advection, the roundrobin decomp generally provides a well load balanced decomposition, and the 1d decomp is an alternative.  mosart cost varies by gridcell because the subcycling timestep depends on input parameters on a gridcell by gridcell basis.  In testing, the roundrobin performed the best because it provides the best internal load balance and communication cost is generally about 10% of the total cost at moderate pe counts in the half degree resolution.  roundrobin is the default decomposition.  All three decompositions result in bit for bit results.  Additional decompositions could be implemented that trade off load balance and communication cost further by providing a basin-driven decomposition that estimates cost on a gridcell by gridcell basis and allows for basins to be subdivided to balance cost on pes and minimize communication.  This would be most beneficial at higher pe counts where the relative cost of communication is higher.   This would almost certainly improve scaling at higher pe counts.

The downstream communication of water in mosart has been implemented using an mct sparse matrix multiply.  To ensure bit-for-bit results in the sparse matrix multiply, the sparse matrix multiply is performed with an "Xonly" strategy to eliminate partial sums.  This option is controllable via a namelist variable, smat_option.

Tests were performed with settings of "Xonly", "Yonly", and "opt" (which is "XandY") and performance varied minimally with "Xonly" performing best.  There are actually two sparse matrix multiplies in mosart. The first is for downstream advection, the second is for transport of water directly to outlet points.  Both are derived at initialization from the mosart input file "downstream index" field.  Water that is passed into mosart can either be tranported directly to the outlet points immediately or it can be transported downstream incrementally via the mosart solver.