Summary of Budget Analysis in CESM/ACME

The analysis focuses on water and heat.  There is no analysis yet available for
energy/momentum.


Validating the coupled system budget is tricky for several reasons

- Models are coupling fluxes but we are trying to conserve quantities.  The appropriate 
  areas have to be applied to any budget analysis.  In addition, those areas are critical 
  in the mapping and merging operations.  Those areas are also associated with both
  statically variable fractions (like land cover on the atm grid or land grid) and time
  varying fractions (like sea ice cover on the ocean and atm grids or ocean cover on the
  atm grid).

- The budget analysis ultimately comes down to trying to show the system conserves.  In
  other words, the sum of the budget is somehow "zero".  But the individual terms that
  make up that budget are generally fairly large and of varying signs.  In other words,
  a 1% error in the computation of any one budget term in the diagnostics can have a huge
  impact on the overall budget.

- Ulimately, the budget analysis over a coupling period, over a day, over a month, 
  over a season, over a year, and over a multi-year period will be different.  There
  should be diurnal, seasonal, and interannual variability in the system that can
  be diagnosed if the budgets are computed correctly.  Care must also be taken when
  looking at a budget for any given time period to not read "too much" into that budget.

- Because the model is discretized in time and coupling periods vary between components, 
  there are inherent system lags and averaging.  In other words, it will not generally be 
  possible to balance the budget exactly over a coupling period.  The error introduced
  from the lags should decrease from a budget analysis perspective as the budget averging
  period increases.  But the implication is that there is no way to ever get "zero", and
  as a developer, there is always a question of whether the budget diagnostics represent
  the system accurately or whether there is still a bug in the budget diagnostics.  It can
  be very difficult to tell the difference.

- There are multiple areas in the system.  Each component model has it's own area either
  explicitly or implicitly specified by the internal numerical discretization.  There
  are also areas associated with the conservative mapping weights.  Those weights are
  only as conservative as the area computation associated with the weights generation.
  The coupler uses the mapping areas as it's internal area values because those are
  generally computed using the same algorithm in an offline setting.  The sum of the
  mapping areas should be very close to surface area of the earth.  In CESM, the fluxes 
  are area corrected as they are sent to/from the coupler.  In other words, the fluxes are 
  multiplied by the ratio of the model and coupler areas to convert from model areas
  to coupler areas.  While this might seem non-conservative, it is actually required
  for conservation.  This approach conserves quantity between the model and coupler
  for gridcell areas that do not agree exactly between the coupler and model.  It allows
  for differences in areas in different models inherent in different model numerics.

- Fluxes must be mapped conservatively in the coupler and the mapping weights must be
  conservative.  These are generated offline and must be checked carefully for 
  conservation.  These will never be exactly conservative, unfortunately, due to 
  small errors in the computation of the areas and the overlaps and other reasons.

- It's not clear to how many digits the model should conserve over long periods.  But it
  should be at least several (5 or more?).

- Budgets are most valuable in a fully coupled system.  In any configuration where a 
  model is driven by data models, the overall budget is meaningless.  However, in that case, 
  the net fluxes into and out of an active component should be reasonable.

- All coupling fields in the budget MUST BE accounted for correctly.  As noted above, any
  small error in the diagnostics, any missing field, any incorrect sign, any units problem,
  any error in the applied area and fraction will wreak havoc on the overall budget.

- The sign convention in the budget analysis is DIFFERENT than the sign convention
  in the coupler.  The sign convention in the coupler has been positive downward
  where the models are considered in the following order, top to bottom; atm, lnd, 
  rof, ice, ocn.  The sign convention for the budgets is positive for that component
  to gain heat/water and negative for that component to lose heat/water.  So the
  signs must be considered carefully in the analysis.

- The state of the model is critical in understanding the budget diagnostics.  The budget
  of a model that is spinning up is going to be very, very different than a model that is
  stable.  Until models spin up, they will generally be sources or sinks of heat or water
  in the overall system from the coupler perspective and the budgets should show that.
  That net exchange of heat and water might be very different in the system once it's
  spun up.  Ultimately, the net flux of heat and water in the coupler has to be consistent
  with the change in heat and water in all components.  If a components is a big source
  or sink (over some medium term, like years, decades or centuries, depending on how the
  spin up goes), that is going to be reflected in the budget.

- In a spun up and stable system, there should be no long term net heat or water flux from
  particular component.  In practice, this will not be the case in a complex climate modeling
  system because there are always some lingering long term trends, but in theory, that 
  defines a stable spun up state.


There are really three parts of the budget that need to be understood.

- First, at the model level.  Each model should basically be consistent with
     D(Heat) /dt = sum(external_forcing)
     D(Water)/dt = sum(external_forcing)
  That means the model is conserving.  Most models will have some internal lags
  that can make this analysis difficult.  But ultimately, any internal lags,
  sources, or sinks should be understood well enough that every term in
     D(Heat) /dt = sum(external_forcing) + sinks + sources + lags
     D(Water)/dt = sum(external_forcing) + sinks + sources + lags
  should be identified, the equation should balance, sinks=sources=0 (hopefully)
  and lags are much smaller than other terms over the long time scale for every
  individual component.

- Second, within the coupler.  The coupler can only diagnose the net heat and
  water flux between components.  This can be broken down in various terms and
  by component in a number of different ways.  But ultimately, the coupler should
  also be conserving heat and water with only lags creating an inability to balance
  exactly.  The coupler can diagnose the net heat and fresh water flux into any
  component.

- Third, between the coupler and the model component.  While this might seem trivial
  and should be trivial, it's worth checking that the net flux of heat or water into
  or out of any component as diagnosed by the coupler agrees with the sum(external_forcing) 
  in every model.  That verifies that the coupling between the coupler and the model is 
  working correctly.  For instance, the first and second parts above can be checked and 
  everything can look fine even in a case where a model is reading external forcing 
  from a file and not receiving it from the coupler.  This third steps closes the budget 
  and ensures the coupler and model are interacting correctly.  Care must be taken to
  assess the quantity of heat and water, not the flux.  And the appropriate areas must
  be applied on the coupler side and the model side (these might be different areas).


The coupler can only address the second part of the budget above.  That coupler budget 
analysis is carried out in 

  cime/driver_cpl/driver/seq_diag_mct.F90

in subroutines

  public seq_diag_atm_mct
  public seq_diag_lnd_mct
  public seq_diag_rof_mct
  public seq_diag_glc_mct
  public seq_diag_ocn_mct
  public seq_diag_ice_mct

There is a different subroutine to analyze the terms from each component.  In practice,
this is done to separate the analysis for each component to make each clearer.  It also
allows each diagnostic subroutine to be called at a different part of the run sequence,
either to support extra concurrency for performance but also to allow each diagnostic
subroutine to be called at the correct place in the run sequence where the coupling
fields and dynamically varying fractions have consistent values for application.  The
ice fraction is updated at one particular location in the run sequence and which diagnostics
are called before or after this fraction update is important.  Also, in many of the interfaces,
there is a "do_" logical flag that allows the interface to be called multiple times in
the run sequence in order to compute different terms at different places in the model.

In each seq_diag_* interface, the gridcell areas and fractions are available and all fields
are multiplied by the appropriate area and fraction.  Also, the first time in each
subroutine, field index values are computed and stored to save time for character string
look-ups in subsequent calls.  As these subroutines are called, local data (by MPI task)
is summed into a local array called budg_dataL.  All data is stored locally until it's
written either to the log file or to the restart file.

While the diagnostics are broken up into several pieces by component, the budget really is 
a comprehensive set of diagnostics defined by the the datatypes

   real(r8),public :: budg_dataL(f_size,c_size,p_size) ! local sum, valid on all pes
   real(r8),public :: budg_dataG(f_size,c_size,p_size) ! global sum, valid only on root pe
   real(r8),public :: budg_ns   (f_size,c_size,p_size) ! counter, valid only on root pe

budg_dataL is where the locally accumulated budgets are stored for each component.  budg_dataG
is where the global budgets are summed before writing to the log or restart file.  budg_ns
is a counter that accumulates the number of times the budget is accumulated.  Ultimately,
the average (not accumluated) budgets are written to the log file and budget_ns is needed
to average.

The arrays are three dimensional
  f_size = the number of different fields that are accumulated independently, currently
    this is 17 and includes area, 9 heat terms, and 7 water terms.
  c_size = the number of components terms that are accumulated independently, currently
    this is 22 and includes inputs and outputs to each of the 6 components plus
    an additional pair that separates the northern and southern hemispheres of the
    ice model.  That accounts for the first 14 terms.  The other 8 terms are similar
    terms but computed on the atm grid.  This is for a separate diagnostic in the model.
    That diagnostic looks at the budget solely on the atmosphere grid.
  p_size = the different time periods supported in the budget accumulation, currently this 
    is 5 and the diagnostics support instantenous, daily, monthly, annual, and long term
    budgets separately.

By leveraging a single three-dimensional array, all the budget information for any MPI
task can be stored by field, component, and time in a single array.  That array can then
be quickly summed to generate the global diagnostics.  These arrays are also written
to the coupler restart files to support proper accumulations over different time periods
even when the model is stopped and restarted.

The budget data is written to restart in seq_rest_mod.F90.  (NOTE: We should check that 
budg_dataG is being computed BEFORE the restarts are written in all cases.  Looking at the 
code quickly, I'm not convinced it is.  This would affect restartability of budgets.  The 
issue is that before a restart is written, seq_diag_sum0_mct has to be called and that's 
only called by the seq_diag_print_mct routine and only under certain circumstances.  I think 
this must be OK, but we should verify.)  

The budget diagnostics are written to the coupler log file in seq_diag_print_mct.  There are 
3 levels of budget output.  plev>=1 is the standard net summary budgets by term and by 
component.  plev>=2 provides diagnostics of the balance between each surface component and 
the atmosphere.  plev>=3 details the atm, lnd, ocean, and ice budgets on the atm grid.  
plev=1 is really the most basic budget and the budget that should be the main focus.

The budget diagnostics are controlled by a series of coupler namelist input.

  logical    :: do_budgets      ! do heat/water budgets diagnostics
  integer    :: budget_inst     ! instantaneous budget level
  integer    :: budget_daily    ! daily budget level
  integer    :: budget_month    ! monthly budget level
  integer    :: budget_ann      ! annual budget level
  integer    :: budget_ltann    ! long term budget level written at end of year
  integer    :: budget_ltend    ! long term budget level written at end of run

do_budgets must to set to true to get any budgets.  The other budget integer flags
set the level of the budget diagnostics for that time period (associated with plev 
above).  Budgets for different time periods are controlled, accumuluated, and written
independently.  ltann writes the long term budget at the end of each year.  ltend
writes the long term budget at the end of each run.  The lt budgets are accumulated
since the start of a given case.  And all data is accumulated across a restart 
correctly (or should be).

The budget diagnostics are not bit-for-bit reproducible on different MPI task counts
due to the local and global accumulation phases, but this has not been a requirement.

Output for budget_ann=1 should look something like this (this data is taken from 
an arbitrary CESM case and should not be treated as ideal, spun up, validated, scientifically
correct, or otherwise.  It's just an example set of data).

(seq_diag_print_mct) NET AREA BUDGET (m2/m2): period =   annual: date =    260101     0
                       atm            lnd            ocn         ice nh         ice sh        *SUM*  
        area    -1.00000000     0.29174398     0.66380857     0.02269426     0.02175296    -0.00000023
  
(seq_diag_print_mct) NET HEAT BUDGET (W/m2): period =   annual: date =    260101     0
                       atm            lnd            rof            ocn         ice nh         ice sh            glc        *SUM*  
     hfreeze     0.00000000     0.00000000     0.00000000     0.06461262    -0.02851287    -0.03609976     0.00000000    -0.00000000
       hmelt     0.00000000     0.00000000     0.00000000    -0.90137816     0.37440068     0.52695615     0.00000000    -0.00002134
      hnetsw  -163.68069466    41.68913175     0.00000000   121.19128914     0.47355259     0.32910518     0.00000000     0.00238400
       hlwdn  -335.94243834    86.82047885     0.00000000   239.26125546     4.85790467     5.00257906     0.00000000    -0.00022029
       hlwup   393.65982821  -107.77979589     0.00000000  -274.64148108    -5.58690599    -5.65160920     0.00000000     0.00003605
     hlatvap    82.54401732    -9.75856097     0.00000000   -72.61188586    -0.04934321    -0.12437396     0.00000000    -0.00014668
     hlatfus     0.85408604    -0.27936490     0.00000000    -0.40724960    -0.04611690    -0.12135644     0.00000000    -0.00000180
      hiroff     0.00000000     0.05379623    -0.00000000    -0.05380324     0.00000000     0.00000000     0.00000000    -0.00000701
        hsen    22.36323927   -10.68462583     0.00000000   -11.74410411    -0.00646233     0.07167567     0.00000000    -0.00027733
       *SUM*    -0.20196215     0.06105924    -0.00000000     0.15725517    -0.01148337    -0.00312330     0.00000000     0.00174560
  
(seq_diag_print_mct) NET WATER BUDGET (kg/m2s*1e6): period =   annual: date =    260101     0
                       atm            lnd            rof            ocn         ice nh         ice sh            glc        *SUM*  
     wfreeze     0.00000000     0.00000000     0.00000000    -0.17130502     0.07559509     0.09570992     0.00000000    -0.00000000
       wmelt     0.00000000     0.00000000     0.00000000     0.81914718    -0.26813490    -0.55091384     0.00000000     0.00009843
       wrain   -30.43448867     6.31746794     0.00000000    23.97606210     0.07191497     0.06909592     0.00000000     0.00005226
       wsnow    -2.55944273     0.83717382     0.00000000     1.22040634     0.13819869     0.36366928     0.00000000     0.00000540
       wevap    32.99164098    -3.89730845     0.00000000   -29.03314109    -0.01750979    -0.04374031     0.00000000    -0.00005865
     wrunoff     0.00000000    -3.15257628    -0.00197029     3.15567234     0.00000000     0.00000000     0.00000000     0.00112576
     wfrzrof     0.00000000    -0.16121135     0.00000000     0.16123235     0.00000000     0.00000000     0.00000000     0.00002100
       *SUM*    -0.00229042    -0.05645433    -0.00197029     0.12807420     0.00006407    -0.06617902     0.00000000     0.00124421

What you are looking at are the area, heat, and water terms by component.

The first section sums the areas in the system.  These are coupler areas which are 
accumulated as the diagnostics are computed.  These areas include the time varying fractions
and represent an average area normalized by the surface area of the earth (m2/m2).  All the 
rest of the budget diagnostics are normalized by the surface area of the earth as well
before being written to generate W/m2 and km/m2s*1e6 units.  If you think about the actual
computation in the code, the coupler diagnostics are accumulating fluxes*areas.  These
are averaged and then finally divided by the area of the surface area of the earth to
produce the table above.  The division by area does not impact the results, it just puts 
the diagnostics into units that are more easily understood.

Across any row, the SUM should be close to zero.  It will not be zero due to lags.  Each row
demonstrates conservation of fluxes within the coupler.  For instance, wrunoff shows mostly water 
passed from the land model, through runoff (which has very little net accumulation) and then
into the ocean model and the net sum of all terms is .0011 vs 3.15 for any individual term.
In the same way, wrain is 30.4 out of the atm, 6.3 into land, 23.97 into ocean, 0.07 into sea
ice nh and sh for a net sum of 0.00005 (6-7 digits).

Each column shows the total net accumluation of heat and water in any given component.  So
atm is losing net heat and water, but to several digits less than any specific term.

The bottom right hand element (SUM x SUM) shows the overal budget of all terms and all
components.  Ideally, this would be zero (like any row should be) and isn't because of lags.

All of this assumes that all coupling fluxes are in the appropriate units, W/m2 and kg/m2*s
and that all coupling fluxes are part of the "flux" list, not the "state" list of coupling
fields.  This is critical as the model differentiates states and fluxes in the area
correction application and mapping methods.  Fluxes are always mapped conservatively.
States never have the area corrections applied.


In summary, there are many important details that need to be taken into account in the budget 
diagnostics and any small error will likely cause significant problems in the overall analysis.
There is significant value to both having robust diagnostic capabilities in all models and the
coupler that can easily be turned on to generate these diagnostics and in maintaining these
diagnostics as the models evolve.  Fixing these diagnostics after the fact tends to require
a fairly heroic effort.  Whenever a coupling field is changed or a new model or coupling
field is introduced, the impact on the budget should be outlined ahead of time, modifications 
to the budget should be part of the implementation, and a budget analysis should be part of the
validation process.  Simply focusing on connecting the fields between models through the
coupler is inadequate if there is a direct impact on the budget diagnostics.