W16 Physics Grid Performance Phase 1

This page should describe Performance Assessment Tests performed for this stand alone feature and should provide links to all the result pages.

Summary

Short summary of what was done and what was the result.


Performance Test 1

To look at how close physgrid gets to ideal speedup of 9/4 in the
physics computations, I did a 1-month run with -pecount S on cori-knl,
to provide a reasonable number of columns per core. Relevant
high-level timers are as follows:
 
ne30np4
"CPL:RUN_LOOP"      693      693 1.031184e+06   2.330065e+06  3365.724 (   296      0)  3361.610 (   544      0)
"CPL:OCNT_RUN"      693      693 1.030491e+06   6.395224e+02     2.122 (     0      0)     0.901 (   657      0)
"CPL:ICE_RUN"       693      693 1.031184e+06   5.078033e+03     9.215 (   319      0)     5.647 (   666      0)
"CPL:LND_RUN"       693      693 1.031184e+06   2.091669e+04    34.030 (   282      0)    27.238 (   669      0)
"CPL:ATM_RUN"       693      693 1.031184e+06   1.965188e+06  3226.064 (   296      0)  2668.105 (   420      0)
"a:CAM_run1"        693      693 1.031877e+06   1.307641e+06  2267.392 (   296      0)  1713.941 (   377      0)
"a:CAM_run2"        693      693 1.031877e+06   2.497877e+05   382.013 (   296      0)   349.990 (   151      0)
"a:CAM_run3"        693      693 1.031877e+06   3.823804e+05   567.488 (   372      0)   526.157 (   182      0)
"a:CAM_run4"        693      693 1.031877e+06   2.388912e+04    36.367 (     0      0)    34.452 (   401      0)
"a:UniquePoints"    693      693 1.031877e+06   2.399564e+03     4.179 (   296      0)     2.732 (   562      0)
"a:putUniquePoints" 693      693 1.031877e+06   5.007935e+03     8.052 (   296      0)     6.124 (   562      0)
 
ne30pg2
"CPL:RUN_LOOP"      693      693 1.031184e+06   1.197055e+06  1727.589 (   145      0)  1727.089 (   532      0)
"CPL:OCNT_RUN"      693      693 1.030491e+06   6.345620e+02     2.469 (     0      0)     0.880 (   518      0)
"CPL:ICE_RUN"       693      693 1.031184e+06   4.448382e+03     7.500 (   523      0)     4.345 (   648      0)
"CPL:LND_RUN"       693      693 1.031184e+06   1.419479e+04    23.461 (     0      0)    18.509 (   585      0)
"CPL:ATM_RUN"       693      693 1.031184e+06   1.119652e+06  1649.384 (    26      0)  1541.688 (   448      0)
"a:CAM_run1"        693      693 1.031877e+06   5.988779e+05   901.520 (   692      0)   787.083 (   370      0)
"a:CAM_run2"        693      693 1.031877e+06   1.315582e+05   193.712 (   396      0)   185.811 (   676      0)
"a:CAM_run3"        693      693 1.031877e+06   3.717056e+05   543.277 (   369      0)   528.668 (   545      0)
"a:CAM_run4"        693      693 1.031877e+06   1.642631e+04    25.760 (     0      0)    23.682 (   644      0)
"a:dyn_to_fv_phys"  693      693 1.031877e+06   8.753453e+03    12.869 (   396      0)    12.479 (   654      0)
"a:fv_phys_to_dyn"  693      693 1.031877e+06   2.451420e+04    40.740 (    73      0)    32.149 (   640      0)
                                                ^ timer sum
 
The speedups based on the timer sum column are as follows:
    ideal speedup: (/ 9.0 4.0) 2.25
    run1, before coupler: (/ 1.307641e+06 5.988779e+05) 2.1834851478072577
    run2, after  coupler: (/ 2.497877e+05 1.315582e+05) 1.8986859047934677
Thus, there's a little room for improvement in run2, but not much in run1.
 
The fv_phys vs UniquePoints timers show the cost of high-order remap.

Performance Test 2

Default -pecount on Cori-KNL, 1-month run.

          name        ranks   call-count            sum       max       min

  ne30np4
    CPL:RUN_LOOP       1350 2.008800e+06   2.241619e+06  1660.485  1660.455
    CPL:OCNT_RUN       1200 1.784400e+06   1.201508e+03     4.542     0.968
    CPL:ICE_RUN        1200 1.785600e+06   1.789948e+04    18.119     9.421
    CPL:LND_RUN        1350 2.008800e+06   3.711972e+04    32.732    24.977
    CPL:ATM_RUN        1350 2.008800e+06   2.046370e+06  1576.281  1496.246
    a:CAM_run1         1350 2.010150e+06   1.299927e+06  1016.440   940.340
    a:CAM_run2         1350 2.010150e+06   2.450479e+05   186.443   177.676
    a:CAM_run3         1350 2.010150e+06   4.454110e+05   339.305   322.815
    a:CAM_run4         1350 2.010150e+06   5.284136e+04    40.104    39.119
    a:UniquePoints     1350 2.010150e+06   2.426363e+03     1.912     1.464
    a:putUniquePoints  1350 2.010150e+06   4.627142e+03     3.993     2.843

  ne30pg2
    CPL:RUN_LOOP       1350 2.008800e+06   1.571943e+06  1164.625  1164.173
    CPL:OCNT_RUN       1200 1.784400e+06   1.139701e+03     1.945     0.919
    CPL:ICE_RUN        1200 1.785600e+06   7.574052e+03     7.080     5.905
    CPL:LND_RUN        1350 2.008800e+06   2.970938e+04    24.901    19.429
    CPL:ATM_RUN        1350 2.008800e+06   1.442209e+06  1083.946  1049.537
    a:CAM_run1         1350 2.010150e+06   7.752476e+05   593.253   552.322
    a:CAM_run2         1350 2.010150e+06   1.689466e+05   127.287   122.445
    a:CAM_run3         1350 2.010150e+06   4.514902e+05   340.133   328.524
    a:CAM_run4         1350 2.010150e+06   4.366816e+04    33.281    32.323
    a:dyn_to_fv_phys   1350 2.010150e+06   1.895656e+04    14.433    13.922
    a:fv_phys_to_dyn   1350 2.010150e+06   2.259800e+04    17.194    16.566

  RUN_LOOP timer max: (/ 1660.5 1165) 1.4253218884120171 speedup
  run1 timer sum: (/ 1.299927e+06 7.752476e+05) 1.6767894541047275 speedu