Checkpoint/Restart application on Eddie
1. Why do we need a Checkpoint/Restart system on Eddie?
Currently user applications are not allowed to run longer than 48 hours on eddie, controlled by the runtime limit of the batch system queue. This policy allows for the queuing system to respond to changing needs, and also allow for maintenance of individual nodes when required. For users who need a long time to complete their jobs, they can use the checkpointing/restart facility. This will checkpoint an unfinished job and resubmit it to the queue in the event of it being cut off before it is due to finish.
2. Do I need to change my application code? Do I need to rebuild my code to link with some libraries?
For dynamically linked executables, No. Nothing has to be changed within your code or in your building procedure, you even do not need to link existing executables with the BLCR library as it will be loaded into your application automatically at startup time. Just do what you used to do, your application could benefit from Checkpoint/Restart on Eddie with no change at all - just an extra option when submitting your job (see below).
ECDF System Team has implemented this facility with the Berkeley Lab Checkpoint/Restart (BLCR) package. This is a kernel level checkpoint/restart product checkpointing a wide range of applications, without requiring changes to be made to application code. This product is gracefully integrated in Grid Engine who will do all checkpoint/restart operations for users' applications. BLCR is completely transparent to regular users on Eddie.
Statically linked executables will need to link the library explicitly as the following sample:
module add blcr/64/0.5.0 gcc -L/usr/local/Cluster-Apps/blcr/0.5.0/lib64 -lcr -o hello ./hello.o
3. How do I switch on checkpoint/restart?
Checkpoint/restart is turned on by the qsub option -ckpt BLCR, which tells Grid Engine that your job needs to be checkpointed (using BLCR). You also need a checkpoint/restart job script which implements the BLCR job interface. One template script has been provided by ECDF System Team (see the end of these instructions).
To submit a checkpoint job, just submit this template script followed by your application and parameters:
qsub -N <job_name> -l s_rt=HH:MM:SS -cwd -ckpt BLCR checkpoint.sh <app_path>/<application> [parameters]
where HH:MM:SS is the maximum runtime of each checkpointed run, which must be less than or equal the maximum runtime available on the system (currently 48 hours).
4. Can I checkpoint/restart MPI job?
Some MPI libraries are supported by BLCR, like LAM/MPI. So MPI jobs should be able to benefit from this feature. We are still working on this part to implement this facility in the near future.
5. Are there any downsides to doing this?
Checkpointed jobs can only be submitted to the same node as they previously running on. This means your job will potentially wait for some time to be restarted until the target node is empty.
6. How frequently will my job be checkpointed?
The checkpoint interval has been set to be 1 hour, which means your job will lose no more than 1 hour recent running results from restarting.
Note: Ensure you have set environment for Sun Grid Engine in your shell.
Your should add either
. /etc/profile module add sge
or
export SGE_ROOT=/usr/local/Cluster-Apps/sge export PATH=/usr/local/Cluster-Apps/sge/bin/lx26-amd64:/usr/local/Cluster-Apps/sge/sbin/lx26-amd64:$PATH export LD_LIBRARY_PATH=/usr/local/Cluster-Apps/sge/lib/lx26-amd64:$LD_LIBRARY_PATH
in your $HOME/.bash_profile
checkpoint.sh - template script
#!/bin/sh trap '/usr/local/Cluster-Apps/sge/bin/lx26-amd64/qmod -sj $JOB_ID;sleep 10' usr1 export PATH=/usr/local/Cluster-Apps/blcr/0.5.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/Cluster-Apps/blcr/0.5.0/lib64:$LD_LIBRARY_PATH export tmpdir=$SGE_CKPT_DIR/ckpt.$JOB_ID if [ -e "$tmpdir/currcpr" -a -r "$tmpdir/currcpr" ]; then export currcpr=`cat $tmpdir/currcpr` export ckptfile=$tmpdir/context_$JOB_ID.$currcpr fi if [ "$RESTARTED" -eq "2" -a -e "$ckptfile" -a -r "$ckptfile" ] ; then echo "restart job $JOB_ID from checkpoint file: $ckptfile" 1>&2 cr_restart $ckptfile else echo "Init job $JOB_ID from the starting point" 1>&2 cr_run $* fi
A Tutorial of checkpoint/restart on Eddie
A sample code
You can test checkpoint/restart on Eddie by the following sample code:
count.c
#include <stdio.h>
#include <time.h>
#define max_num 320
int main(void)
{
float s;
int i, x, y, z;
time_t pt;
for (i=0;i<max_num;i++){
if(i%5 == 0){
time(&pt);
printf("%si=%d\n",ctime(&pt),i);
}
sleep(1);
}
return 0;
}
The above code counts number each second and print out the number every 5 second. The maxmum number is defined as 320 which takes the job ~5min to finish. A different length can be picked by editting the definition. Build this code on Eddie using gcc:
gcc -o count count.c
Use checkpoint/restart testing node
node126 and node246 on have been reserved for testing checkpoint/restart on Eddie. The runtime limit on these node is 00:02:20, and checkpoint happens every 1 min. Such configuration makes you see the effect of checkpoint/restart shortly so that you don't have to wait for 48 hour to find out if your jobs restart successfully or not. Anather benefit is you jobs could start on these nodes very soon as the frontend runnning job will stop running in no more than 00:02:20.
Submit job
To avoid bothering other normal jobs, the checkpoint/restart testing nodes only accept jobs submitted with a specific PE, ckpt_test. You need request 1 slot on this PE for your job as the following:
qsub -cwd -pe ckpt_test 1 -ckpt BLCR -l s_rt=00:02:20 checkpoint.sh /exports/work/is_iti_ug/ywan/checkpoint/count
Note: You need to copy the above script checkpoint.sh to your current directory before submitting jobs.
How to know if checkpoint/restart works
The simplest way is wait for your job stopping. If it completes over the length of runtime limit, checkpoint/restart definitely works. But this method may take you days. A quicker way is to see the job state after restarting. The job state should become 's' from 'r' when the job hits runtime limit and appears 'Rq' when residing in the queue. When it manages to restart, its state turns to be 'Rr'.
You are suggested to test checkpoint/restart with the above sample code which will finish after restarting twice, you can see the complete output when it completes successfully. You can also test with your own code. Althought your code may takes days to finish, you can trust checkpoint/restart if your job's state is Rr after running half an hour, which need to restart your job over 10 times.
Once you make sure that your job can restart properly, you can submit it normally without the PE option:
qsub -cwd -ckpt BLCR -l s_rt=47:45:00 checkpoint.sh <full_path>/<prog>
Note that avoid using exact 48 hour as s_rt because it is the trigger of job restarting, and checkpointing also happens at the same time. You don't want to make risk of possible failure caused by operation conflict.