2021-05-24

Keep a log of your HPC work

If you work on an HPC, chances are you're using some sort of job scheduling manager. More than likely, it's PBS or (increasingly) SLURM. While these softwares (partly dependent on your HPC administrators) keep some record of past jobs, I found it very useful to keep my own text file history of all the jobs I ran. Even if you're not using an HPC or some job scheduling software, keep reading; I bet you can find a way to customize these ideas to your computational setup.

I see two primary benefits for keeping this information:

  1. Generate a record of exactly how many (batch) jobs you're running. This can be useful if you want to know how many jobs you ran on a certain day, or between two certain days, etc. (Think in advance: some [boss] could ask you for similar information at some point, at which time it would be instructive and self-defensive to have the information handy.)

  2. Keep a direct map between job IDs (probably just meaningless-to-you numbers) and job working/submit directories (which should be more readable, if you're good at naming). If you or your HPC admins have any trouble with a particular job or computer node, it's likely that one of you will speak the language of job IDs and the other the submit directories. Now you will have a translation key between them.


Saving a bit of useful information from each scheduled job you run is trivial. Here I'm showing you how I do it using only bash commands, built-in SLURM (or PBS) environmental variables, and a text file.

To start, create a file named HPC_JOBS.log at your "home" location (probably something like ~/ or /home/$USER/ or $HOME/). Then copy and paste these next two lines into that file, which will serve as a sort of table heading:

DATE EXEC_HOST JOB_ID JOB_SUBMIT_DIR

-----------------------------------------------------------------------------------------------------------------

Now that that's set up, all you need to do is add three lines of code into your job scripts before they run. I've listed two sets of commands below depending on whether your HPC is using SLURM or PBS. (The commands are nearly identical, but each job scheduler has its own environmental variables.)


SLURM:

exec_host=`qstat -f $SLURM_JOB_ID | grep exec_host | awk '{print $3}' ` #NAME OF HEAD NODE WHERE JOB IS RUNNING.

run_date=`date +%D` #DATE JOB STARTED RUNNING, IN MM/DD/YY FORMAT.

echo "$run_date $exec_host $SLURM_JOB_ID $SLURM_SUBMIT_DIR" >> /home/$USER/HPC_JOBS.log

PBS:

exec_host=`qstat -f $PBS_JOBID | grep exec_host | awk '{print $3}' ` #NAME OF HEAD NODE WHERE JOB IS RUNNING.

run_date=`date +%D` #DATE JOB STARTED RUNNING, IN MM/DD/YY FORMAT.

echo "$run_date $exec_host $PBS_JOBID $PBS_O_WORKDIR" >> /home/$USER/HPC_JOBS.log

Final note:

Since your HPC_JOBS.log file is simply being appended to with each job you run, you should take care to protect this simple text file in some way. I suggest saving/moving/backing it up every few months and starting a new log file.