CFHT archive manual - administration 

Archive Administration

Most archive administration is performed via a collection of scripts that monitor and control the archive daemons. For troubleshooting problems in the archive it is helpful to be familiar with the files, processes and other scripts that these utilities use, and in some cases it can be necessary to manually intervene to recover from unexpected failures.

  • Starting and Stopping Daemons
  • Monitoring the Archive
  • Tape Library
  • Adding Locations
  • Changing Media
  • Updating file headers


  • Starting and Stopping Daemons

    Starting and stopping daemons is usually done individually through start_daemon and kill_daemon, and collectively through start_archive and kill_archive. Before the archive host is shut down the archive must be halted manually, usually using kill_archive. Similarly it must be started manually after a reboot or other maintenance. Anything that causes an archive daemon to halt abnormally can leave it in a state that requires manual intervention to restart it. Otherwise it is always safe to request the halt of a daemon via kill_archive or kill_daemon. If the daemon is in a state that is unsafe for shutdown, the scripts will return a descriptive error and let the daemon continue running. Some daemons are issued a signal by the kill script to halt the next moment that it is safe to do so. Rerunning the kill script will confirm that the daemon is dead.

    The kill and start scripts make an assessment of each daemon by reading two files in the daemon's working directory: .status contains a report of the current status of the daemon, usually "IDLE" or "RUNNING", while .pid contains the process id (PID) issued by the system when the daemon starts up. The kill script checks first to see that the daemon is in a safe state for shutdown ("IDLE") and then checks to see that the daemons PID is still in the list of current processes on the system. If both are true the daemon's process is killed and the .status and .pid file are deleted.

    There is a possibility that an archive daemon could die while leaving its .status file in the "RUNNING" state. The kill script won't officially kill a daemon in this state (even though it's already dead) and the start script won't try to start a daemon with a .status file in this state. This is intentional since in this condition it is possible that some files were not accessed and potentially will not get processed, thus the daemon needs to be checked manually to make sure that it is safe to restart it. Usually a daemon will not remove files from its working directory until all steps are complete, so simply restarting a daemon is usually safe but it's important to check its logfile and working directory to make sure first. Some daemons, like tarexa do move files from the working directory to a subdirectory called .current before they are completely processed but the daemon has a routine for recovering unexpectedly while files are in this area. Once you are confident that it is safe to restart the daemon delete the .status and .pid files from its working directory manually.

    Monitoring the Archive

    The first method for checking the status of the archive is a script called check_arch which reports on the status of all the archive daemons and disk space on the archive host and the session hosts at the summit. Further information can be found about the operation of a particular daemon by reading its logfile in /h/archive/sw/log.

    Every daemon and some utilities are configured at startup by a parameter file in /h/archive/sw/par. To change some aspect of the operation of a daemon usually the "par" file for that daemon is edited. Note that the daemon only gets these values at startup, so to make them active the daemon must be killed and restarted. There is also a file in this directory which is for the archive pipeline in general and is read by many scripts including the kill and start scripts to get the names of the daemons involved in the archive pipeline, their parfiles and their working directories.

    There is also a script called archive_supervisor that is run regularly by the system cron facility that checks the status of the archive and reports via email any potential problems. Usually in response the administrator should get more information by running check_arch and then specific information about the daemon having difficulty by reading the logfile.

    Usage of the disk pools can be summarised using the tool "check_pool" and files can be moved, deleted and restored from tapes using adh.

    Tape Library

    The tar daemons currently use a tape library which considerably reduces the amount of human intervention required for the operation of the archive, but introduces the possibility for an error to arise during the manipulation of the tapes. The tar daemons keep track of the name of the tape they are currently writing to, the device they expect that tape to be in, and the amount of data already written to this tape in files in their working directory (.label, .device, .current_tape). They will not write to a tape that they know to be full so in the event of errors during a tape change or drive cleaning (which occurs at scheduled intervals during tape changes) a good recovery technique is to replace the tapes the daemons expect in the drives they expect manually using libh (note that libh accounts for the name of each tape ending in the instance of the daemon that allocated it, ie. "3d" or "4d") and start the archive. The tar daemons should immediately attempt to change tapes again. Note that the above condition will occur if the tape library runs out of free tapes while the archive is running.

    To troubleshoot problems with the tape library the routine called by the distributor when both tar daemons report full tapes, change_auto_media, can be run manually. Make sure that the tar daemons are first dead and invoke it with the -i (for interactive) flag and the name of the daemon to change:

    /h/archive/sw/adm/change_auto_media -i tarexa3d

    With the -i option, errors are reported and you are given the option to continue if you deem that they should not be fatal. This script also starts the daemon after changing its tape so be prepared for it to run.

    Occasionally it is possible for the two daemons to get out of sync with each other such that the tapes they write are no longer identical. This usually happens when one halts for some reason (like an error from the tape drive) and the other continues running. Usually the running daemon will continue writing to the end of its tape and shut down waiting for the other. The distributor will issue a message to the incomplete daemon to write its next set and exit for a tape change. If this daemon is simply started and is more than one set behind the other its tape will be relatively incomplete. The distributor starts all tar daemons with the same files so the next tape will be consistent but the last one will be short. If you intervene before the tape change (which should be waiting for the inactive daemon to respond to its request to save and exit), first diagnose and resolve the reason why the daemon shut down in the first place, then you can remove the .broadcast file from the daemon's working directory and restart it (but not the other daemons), this will let the daemon run to completion of the tape, where it will ask the distributor again to change tapes and the distributor, seeing that both daemons are complete will grant the request. If you don't find that one tape is truncated until after the fact you will have to duplicate the tape from the one written by the daemon that completed its tape using one of the utilities in /h/archive/sw/tools, you will also have to copy the listfile from the successful daemon to the name of the list file for the unsuccessful daemon in /h/archive/sw/list/old.

    Adding Locations to Look for Files to be Archived

    A common administrative procedure is to add or remove a location to check for files to be archived, which must be done for each new instrument and each instrument removal. The process is fairly straightforward and is done by editing the parfile for the copy daemon. For each location the local path on the remote host, the remote host name, the method for copying files and the automount path on the archive host if it is different than on the remote host. All are on separate lines and all must be updated for each location added or removed. Kill and restart the copy daemon to make the changes active.

    Changing Media in the Tape Library

    When the tape library begins to get full, or when it becomes desirable to send tapes to the CADC, the tape library must be manually emptied and refilled with blank tapes. This process can be done while the archive pipeline is running. First print out a summary of the library state and a summary of the contents of all archive tapes currently loaded:

    libh --print_report | lpr -Pwps
    libh --tape_info | lpr -Pwps

    Then prepare a label for each of the tapes you plan to remove which includes the name of the tape (independent of the daemon that wrote it), and the date the tape was fist allocated, which is the first of the two dates listed by --tape_info. Also prepare a sticker for the outside of the box which has the tape name, the first date the range of filenames on the tape, and the number of sets. Then open the tape library magazines one by one by using the interface on the front of the library. A small map of the slot numbers is located on the inside of the drawer for each magazine, using this guide locate the tapes listed in the printout of --print_report and slide the label into their slot while they are still in the magazine, it is very important that each tape get the correct label. When they are all labeled remove the tapes from their carriers, slide the write protect tab to the "protected" position (for DLT tapes the tab turns orange when it is protected), and place the tape in its protective box. Fill the empty magazine slots with blank tapes and close the magazine door before moving to the next magazine.

    Once the library has been emptied libh needs to be told that the tapes are now blank and can be used by using the --init_slot flag. If you only changed the tape in slot 6:

    libh --init_slot 6

    if you changed the tapes in slots 3 through 4 and slots 32 through 37:

    libh --init_slot 3-4
    libh --init_slot 32-37

    Note that initializing a slot is the equivalent of erasing the tape in it since any process can now be assigned the tape for writing.

    Updating Headers in the Archive

    Occasionally it is necessary to change a file's headers after it has been archived. The procedure is as follows:

    1: If the file is not still on disk, extract the entire tape which contains the file into the storage pool.
    2: update the header of the file.
    3: rewrite the archive tape with the updated file(s).
    4: send a copy of the updated header to the CADC.

    It is not necessary to update the tapes at the CADC unless the pixel data changes. The CADC maintains the header data separately from the pixel data, and the headers are only updated by the FTP channel. The headers of the files on tape are not read at the CADC. When we request a replacement tape from the CADC we ask for their most recent header data + original pixel data rather than an exact duplicate of their tape.

    Usually the updated headers are sent to the CADC as compressed tarballs via FTP.

    A copy of each header update sent to the CADC is kept in /data/ar15/header_updates/

    There are several ways to update header vaules. The archive utility /h/archive/sw/tools/header_update_wrapper provides an easy interface to update many headers at once. It can also be used to extract and update headers files. The usage and configuration is contained in the first lines of the wrapper. Generally, run it first in the mode to extract and operate on header files. Open a sample of the resulting headers to confirm that they look right, then run it in the mode which updates the files on disk. Tar the extracted header files and put them in the FTP site. Send a message to the CFHT contact at the CADC and alert them to the presence of update headers.


    Kanoa

    Last modified: Thur Sept 27 15:31:21 HST 2001