Data handling scripts (to MSS and deletion)

From Hall A Wiki
Revision as of 15:20, 28 August 2021 by Rom (Talk | contribs)

Jump to: navigation, search

On the ADAQ cluster, running from the adaq account, there are cron scripts which automatically copy the raw data to the MSS, and may also delete files from the local data disks if necessary. This has been implemented for 20+ years for the HRS DAQ, the Parity DAQ, and the polarimeters Moller and Compton. At times it was deployed for other DAQ systems too, like BDX and PEPPO. If you are a maintainer of a DAQ systems please contact me (Bob Michaels) because there is some coordination and rules to follow, e.g. filenames must be unique, of limited size, and there should have agreed upon keywords like the experiment name in the filename.

For most users, the main rule is this : DON'T INTERFERE ! Please do not delete, move, or rename raw data files. Especially do not delete anything ! The risk is doing a "rm" command which wipes out weeks of precious beam time. Well, that's very unlikely to happen because we copy promptly but the possibility exists. Please have the self-discipline to not "rm" from a "data" disk. Also, please do not put any files on a disk with the string "data" on the name of the disk Those "data" disks are for raw data only. There are separate work disks and scratch disks for files that are not raw data, for example root output files. Thank you for tolerating my paranoia.

Briefly, the way the scripts work is as follows:

1. MSS copying: All files on the data disks are checked to see if they are in the MSS already in the appropriate area, and if not we put them there using "jput". There is a certain probability of order 1% that jput fails. These failures are logged. Another script comes along and re-tries to copy any failures. Essentially we try an infinite number of times to "jput", but in practice I think I've never seen two or more failures. Now, there are various things that can go wrong, e.g. we logged that the file was copied but it actually was not, or we copied but there is no duplicate. There are about 15 rare things that can go wrong. For each of these, an email is sent to Bob Michaels, and I check the log files and fix the problem by hand.

2. File deletion: A cleanup script is run which first checks disk usage. If the disks are getting full, it starts to delete files provided they are in the MSS, in duplicate, with the same name and same byte count as file on the local disk which is to be deleted. We would rather not delete files because its useful to the experiment to have them locally. Therefore, the aggressiveness of the cleanup script is tuned to how full the disk is. Specifically the amount of time the files can stay on disk is a function of the percentage of disk available. Lately this ends up being about 3 weeks, typically. If the disks get close to full and email is sent to Bob Michaels.