[Hpc-notice] RCC Status update: All services online

Casey Mc Laughlin cmclaughlin at fsu.edu
Mon Oct 10 15:04:37 EDT 2022


Hi RCC Users,

We are pleased to report that all RCC services are back online. Thanks for being patient while the Systems Team worked to bring everything back. This includes the following services:

  *   The HPC and Spear clusters are online, and the Slurm scheduler is accepting jobs.
  *   The "/hpc" VPN profile for students, guests, and any other non-staff members is up.
  *   Open OnDemand is up.
  *   The self-service web portal and webservices (RCCTool) are up.
  *   All RCC managed customer VMs and other hosted systems are up.
  *   Globus is up.
  *   Our storage export servers are up.

If you had jobs running before this morning, you will need to resubmit them.

The power was out from approximately 9am to 9:30am. Because it was unplanned and unexpected, it took about five hours to bring all RCC services back online.

What caused the outage: The Orr Protection company performed a standard, periodic inspection of the fire suppression system. This is a standard procedure, but it was the first time after the Sliger renovation that finished last August. This time, the test triggered the Emergency Power Off (EPO) on the UPS. The connection between the fire suppression system and the EPO was established during the renovation but had not been documented.

The root cause has been identified and remedied. If you have any questions or notice anything that isn't working, please let us know: support at rcc.fsu.edu<mailto:support at rcc.fsu.edu>.

Best regards,
The RCC Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fsu.edu/pipermail/hpc-notice/attachments/20221010/1561d7c6/attachment.html>


More information about the Hpc-notice mailing list