From cmclaughlin at fsu.edu Mon Oct 10 09:21:24 2022 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 10 Oct 2022 13:21:24 +0000 Subject: [Hpc-notice] POWER INCIDENT AT SLIGER: All systems offline Message-ID: Hi RCC Users, We had a major power outage at the Sliger Datacenter around 9:05am this morning. Our Systems Team is currently restoring power to our systems. We will provide updates. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Mon Oct 10 13:01:44 2022 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 10 Oct 2022 17:01:44 +0000 Subject: [Hpc-notice] RCC Status Update: HPC / Slurm online; other services down Message-ID: Hi RCC Users, Thanks for being patient while we bring services back up. As of 12:45pm, the HPC, Spear nodes, and Slurm are online. We are working to bring other services back as soon as possible. See below: ONLINE * The HPC and Spear clusters are online, and the Slurm scheduler is accepting jobs. OFFLINE * The "/hpc" VPN profile for students, guests, and any other non-staff member is still down. * Open OnDemand is down * The self-service web portal and webservices (RCCTool) are down. * All RCC managed customer VMs are down. * Globus is down. * Our export servers are down. We will be posting throughout the afternoon as systems come back online. If you have any questions or comments, please direct them to support at rcc.fsu.edu. Best regards, Casey -- Casey McLaughlin Support Coordinator | Research Computing Center Information Technology Services | Florida State University p 850.644.6270 | w rcc.its.fsu.edu [cid:6f1e22c0-2a8a-4f32-8369-afa975250226] [cid:1773840a-60a5-40d2-aeb0-a483f0bee391] [cid:1cf651bb-8b97-437d-89b7-d84550657ebd] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-anw4h3sm.png Type: image/png Size: 437 bytes Desc: Outlook-anw4h3sm.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-goqttr10.png Type: image/png Size: 585 bytes Desc: Outlook-goqttr10.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-jkzjhe3u.png Type: image/png Size: 594 bytes Desc: Outlook-jkzjhe3u.png URL: From cmclaughlin at fsu.edu Mon Oct 10 15:04:37 2022 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 10 Oct 2022 19:04:37 +0000 Subject: [Hpc-notice] RCC Status update: All services online Message-ID: Hi RCC Users, We are pleased to report that all RCC services are back online. Thanks for being patient while the Systems Team worked to bring everything back. This includes the following services: * The HPC and Spear clusters are online, and the Slurm scheduler is accepting jobs. * The "/hpc" VPN profile for students, guests, and any other non-staff members is up. * Open OnDemand is up. * The self-service web portal and webservices (RCCTool) are up. * All RCC managed customer VMs and other hosted systems are up. * Globus is up. * Our storage export servers are up. If you had jobs running before this morning, you will need to resubmit them. The power was out from approximately 9am to 9:30am. Because it was unplanned and unexpected, it took about five hours to bring all RCC services back online. What caused the outage: The Orr Protection company performed a standard, periodic inspection of the fire suppression system. This is a standard procedure, but it was the first time after the Sliger renovation that finished last August. This time, the test triggered the Emergency Power Off (EPO) on the UPS. The connection between the fire suppression system and the EPO was established during the renovation but had not been documented. The root cause has been identified and remedied. If you have any questions or notice anything that isn't working, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From pvandermark at fsu.edu Tue Oct 11 10:13:56 2022 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Tue, 11 Oct 2022 14:13:56 +0000 Subject: [Hpc-notice] Power outage in Sliger Message-ID: Dear Sliger colocation customers and RCC users, Yesterday, we experienced an unplanned power outage in our Sliger data center. It was an unfortunate fluke of standard maintenance and an undocumented feature. During our renovation last spring, the contractor installed a new fire-suppression system. In addition, a new safety feature was added, connecting that system with the emergency power-off switch on our UPS. However, the contractor did not add that feature to the wiring diagrams, so when Orr Protection performed a routine test on the new system, it turned off the UPS and thereby turned off power in the server room. RCC staff returned power to most of the colocation customers within 15 minutes. But unfortunately, many customers still had to come to the data center to reset or turn on their equipment. Because of its complexity, the RCC HPC system took several hours to fully operational. In the afternoon of Monday, October 10th, FSU's Department of Environmental Health and Safety put a permanent fix in place for the issue. We, therefore, are confident that this was a unique occurrence. We are genuinely sorry for any inconvenience this has caused. Best Regards, Paul -- Paul van der Mark, PhD Director, Research Computing Center Information Technology Services Florida State University Phone: 850.644.0193 its.fsu.edu | rcc.fsu.edu https://fsu.zoom.us/my/pvandermark -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Wed Oct 26 13:00:49 2022 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 26 Oct 2022 17:00:49 +0000 Subject: [Hpc-notice] HPC storage system resolved Message-ID: Hi HPC users, The HPC experienced a brief outage due to an underlying filesystem issue with our parallel storage system (GPFS). You might want to check any jobs that were running during the last hour. We resolved this, but if you see any further issues, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Thu Oct 27 15:58:14 2022 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 27 Oct 2022 19:58:14 +0000 Subject: [Hpc-notice] MATLAB license issues Message-ID: Hello MATLAB users, We encountered a system-side issue when attempting to install a new MATLAB license. We are currently working on the issue and will let you know when it is resolved. In the meantime, MATLAB may not available on RCC systems. If you do not need any high performance computing capabilities, we recommend using the MATLAB instance on the FSU Virtual Lab (https://its.fsu.edu/service-catalog/desktop-and-mobile-computing/its-software/myfsuvlab) while we work on resolving the issue. Thanks for your patience. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Thu Oct 27 17:17:06 2022 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 27 Oct 2022 21:17:06 +0000 Subject: [Hpc-notice] RESOLVED: MATLAB Issues Message-ID: Hi MATLAB users, The MATLAB license issues are resolved. You should no longer see an error when you run the process or start an Open OnDemand MATLAB session, nor should you see any license expiration notices. If you started a session during the outage, you may still encounter issues. Simply start a new MATLAB job, either in Open OnDemand or on the terminal. If you continue to experience issues, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: