Power outage: lessons learned

NEXT
Make New Entry, Make Followup Entry

User name W. Deconinck

Log entry time 10:47:43 on October 28, 2009

Entry number 296883

keyword=Power outage: lessons learned

There was a "lessons learned" meeting about the power outage this morning
at MCC. It was scheduled after the first power outage, but became even
more relevant yesterday. Many issues with the procedures were identified
(but nothing major). For the halls, these are the relevant ones:
- pager system did not seem to work properly for many of the OPS
out-pages, secondary contact information is important
- some people were paged in and sat idle until a while after power
restored because of access restrictions and network restoration
- ARMs were in short supply, extra operators should be called in when
this happens
- there should have been a notification on the switch-back from the
generator to the 40MVA power
- a lot of time was wasted on writing atlist entries, or they were not
written at all, probably a baseline 'power recovery' atlist will be
developed with other atlists needed for task beyond its scope

In particular for Hall A, I think the following could have been useful:
- flow chart with actions to take when power is out (admittedly, that
would be an almost empty page and depend on experiments: page RC, page
target, page tech), after the first power outage the target-on-call was
not notified which probably lead to our target fan motor failure
- we should keep a local copy of the 'staff' database with pager number
in case the network to the main site goes down too (it didn't this time)
- the halog should stay functional while power is off; it was available
but new entries were only posted after a while though they were available
in the preview section. The OPS elog was much better in this respect.
- the procedure for access to the halls will be improved, ODH risks need
to be considered, e.g. if only the dome reads O2 < 18% does that mean the
hall has to stay closed completely? PSS might be put on the generator
instead of just locking down access. More ARMs will be made available
for critical hall work.
- instructions and check lists for when power comes back on, e.g. check
on necessary DAQ services, check on VME and HV crates if possible,
basically information the shift crew can collect to alleviate the task of
the experts and to make sure we don't overlook anything until we start
- reboot DAQ and computers a while after power is back, to ensure that
they are picking up services (nfs, nis) that might still have been down
right after power was restored
- while beam is being restored, attempt a complete start-up procedure:
ramping magnets, moving target, setting HV, taking pedestal runs

A copy of this log entry has been emailed to: rom, reimer, meekins