To look at how close physgrid gets to ideal speedup of 9/4 in the
physics computations, I did a 1-month run with -pecount S on cori-knl,
to provide a reasonable number of columns per core. Relevant
high-level timers are as follows:
ne30np4
"CPL:RUN_LOOP" 693 693 1.031184e+06 2.330065e+06 3365.724 ( 296 0) 3361.610 ( 544 0)
"CPL:OCNT_RUN" 693 693 1.030491e+06 6.395224e+02 2.122 ( 0 0) 0.901 ( 657 0)
"CPL:ICE_RUN" 693 693 1.031184e+06 5.078033e+03 9.215 ( 319 0) 5.647 ( 666 0)
"CPL:LND_RUN" 693 693 1.031184e+06 2.091669e+04 34.030 ( 282 0) 27.238 ( 669 0)
"CPL:ATM_RUN" 693 693 1.031184e+06 1.965188e+06 3226.064 ( 296 0) 2668.105 ( 420 0)
"a:CAM_run1" 693 693 1.031877e+06 1.307641e+06 2267.392 ( 296 0) 1713.941 ( 377 0)
"a:CAM_run2" 693 693 1.031877e+06 2.497877e+05 382.013 ( 296 0) 349.990 ( 151 0)
"a:CAM_run3" 693 693 1.031877e+06 3.823804e+05 567.488 ( 372 0) 526.157 ( 182 0)
"a:CAM_run4" 693 693 1.031877e+06 2.388912e+04 36.367 ( 0 0) 34.452 ( 401 0)
"a:UniquePoints" 693 693 1.031877e+06 2.399564e+03 4.179 ( 296 0) 2.732 ( 562 0)
"a:putUniquePoints" 693 693 1.031877e+06 5.007935e+03 8.052 ( 296 0) 6.124 ( 562 0)
ne30pg2
"CPL:RUN_LOOP" 693 693 1.031184e+06 1.197055e+06 1727.589 ( 145 0) 1727.089 ( 532 0)
"CPL:OCNT_RUN" 693 693 1.030491e+06 6.345620e+02 2.469 ( 0 0) 0.880 ( 518 0)
"CPL:ICE_RUN" 693 693 1.031184e+06 4.448382e+03 7.500 ( 523 0) 4.345 ( 648 0)
"CPL:LND_RUN" 693 693 1.031184e+06 1.419479e+04 23.461 ( 0 0) 18.509 ( 585 0)
"CPL:ATM_RUN" 693 693 1.031184e+06 1.119652e+06 1649.384 ( 26 0) 1541.688 ( 448 0)
"a:CAM_run1" 693 693 1.031877e+06 5.988779e+05 901.520 ( 692 0) 787.083 ( 370 0)
"a:CAM_run2" 693 693 1.031877e+06 1.315582e+05 193.712 ( 396 0) 185.811 ( 676 0)
"a:CAM_run3" 693 693 1.031877e+06 3.717056e+05 543.277 ( 369 0) 528.668 ( 545 0)
"a:CAM_run4" 693 693 1.031877e+06 1.642631e+04 25.760 ( 0 0) 23.682 ( 644 0)
"a:dyn_to_fv_phys" 693 693 1.031877e+06 8.753453e+03 12.869 ( 396 0) 12.479 ( 654 0)
"a:fv_phys_to_dyn" 693 693 1.031877e+06 2.451420e+04 40.740 ( 73 0) 32.149 ( 640 0)
^ timer sum
The speedups based on the timer sum column are as follows:
ideal speedup: (/ 9.0 4.0) 2.25
run1, before coupler: (/ 1.307641e+06 5.988779e+05) 2.1834851478072577
run2, after coupler: (/ 2.497877e+05 1.315582e+05) 1.8986859047934677
Thus, there's a little room for improvement in run2, but not much in run1.
The fv_phys vs UniquePoints timers show the cost of high-order remap. |