Home arrow Sticky Business arrow Failed NetBackup Jobs: Retry vs Requeue
Mar 15 2003
Failed NetBackup Jobs: Retry vs Requeue Print E-mail
Written by Paul Winkeler   
Saturday, 15 March 2003
Many NetBackup users find themselves wondering under what circumstances a job is retried vs requeued. This is really a trick question as these two states are not mutually exclusive at all. Instead, they are tightly coupled and well as follows...

When a job fails to run to completion, NetBackup will automatically re-try it any number of times based on the global attribute "Schedule Backup Attempts". In between such retries the job is not returned to the queue though because all of its resources (drive, tape, etc.) remain assigned to the job. However, there are two exceptions:

  • The job was active and the drive it was writing to went down and the storage unit to which the job was assigned is now out of drives the job will fail unless... The "Wait in Queue" attribute has been set on the server in which case the job is requeued.
  • Similarly, if a job is submitted but the required storage unit is not available, having the "Queue on Error" attribute set will cause that job to qet requeued.
Note that throughout the requeueing process the job maintains the same job id (not to be confused with its process id).
Thus, retries and requeues are tightly coupled. Any failure will cause a job to get retried as long as it still has retries left. Once a job has been assigned a set of resources during its tenure in the queue, it is stuck with them. Should the job fail as the result of one of its resources becoming unavailable it can get requeued to await the assignment of replacement resources. I suspect a job could in theory be requeued multiple times as long as the total number of retries is not exceeded but have not observed this behavior.

As long as a job's resources remain available, retries happen in rapid succession. Once the try count exceeds the "Tries per Period" setting the job is will be scheduled some time after a delay controlled by the "Time Period" setting. Unfortunately it is not possible to control the delay between successive retries within a "Time Period".

Last Updated ( Thursday, 26 April 2007 )
 
< Prev