Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • provenance in implemented in CIME, fires up along the run, so even if the job dies you will get something, creates a world read/writable but some things are not, so someone has to clean it up (only works on ORNL but they will clean it up on Luster so it does need to be archived) on NERSC and MIRA it needs to be cleaned up manually users case and run directories
  • Ben parses this data and puts into a database, Mark – can we then delete all other, Pat - this is not sufficient, there is more info there
  • Pat you want stuff I am not collecting and I am collecting stuff you may not need
  • send Pat email on confluence to update documentation on the provenance
  • there are timers, profile timers per component, talk to Noel Keen about it, there are log files per simulated days it is captured in the coupler, profile timers are summaries - one text log file in case directory, only exist if the job finishes. When the jog is running you need to look at the coupler files in the run directory monitoring the run of the job.
  • Monitor log file, if it does not update for some time (depending on run, could be minute,  actually 15 min should be always enough) we could notify people after 15 min, so they can kill the job, instead of waiting for it to die by itself because it is hanged and waste the the time allocation. There could be external job that can monitor the run, so it can kill the job even if the job is hanged. 
  • for the future – resubmitting a job as a capability in CIME with requesting more nodes and having many jobs so that you can launch another job after one dies without waiting in the queue again.
  • there are other log files to look at, 5 or 6 that log the running job, should look also at stderr and stdout
  • provenance on how long does it take to compile, sit in a queue,  all collected as provenance in the same place, some is written in the beginning of the job, some at checkpoints and some at the end of the job then its done. All LCF has priority jobs and also may keep tract of the allocations.

 

Atmosphere Group Notes

/wiki/spaces/ATM/pages/71336556

...