The death of a daemon will be reported by mail by the archive_supervisor within one hour. The first thing to do is to look at the tail of the dead daemon logfile and try to understand what happened. Depending on the error, the daemon may be able to detect it (and die gracecefully) or not (see below).
Of course, it is normal for the archiver daemons to be dead when the media are full. In that case, the messages from the archive_supervisor should be preceded by another message sent by distd. However, distd waits for all the archivers to finish before sending this message so it could happen that the "daemon's dead" message comes before the "media are full" one. In that case, it is easy to find out from the logfile. Make sure to receive the "media are full" message before starting to do anything.
If the daemon detected the error, it prints an informative error message in the logfile, in addition to any message from the Shell on the standard error and dies gracefully. The error should then be clearly identified and the operator should try to fix it. A typical case of this kind of error is the failure of one of the archiving device (ex: bad write...). The daemon can then be simply restarted with the start_daemon command as explain in chapter 3.2.2 (see warning above).
An example of possible error is when a daemon couldn't be started because a .status file already existed and the status was RUNNING. This may happen when the daemon died ungracefully (see 6.1.2) or when the archive machine had to reboot (see 6.3).
Also, it might happen that a daemon dies because the file it processes is illegal. For example, if a frame doesn't contain some compulsary FITS cards, aiqed or stripheadd will fail on this file and die. Of course, illegal files should never be fed to the pipeline but you never know... This type of error is easy to detect by analyzing the last file processed. It is easy to restart the pipeline either by correcting the file or removing it manually from the pipeline.
This happens typically when there is a programming error in a script.
Then only the Shell message is recorded into the logfile and the
daemon dies ungracefully. Typical errors of these kind are syntax errors
in a part of the script that is very seldom used (How about reprogramming
the whole archive pipeline in Perl so as the compilation checks all
the syntax before hands...).
From the last messages in the logfile, it should be possible to approximatly locate the part of the script where the error occured. The operator should them analyze the source code to find the bug and fix it.
When ready, the daemon can be restarted. However, because it died ungracefully, the .status file in the working directory was not removed and the daemon will refuse to start. Just removing the .status file should be enough (see warning above).