From cmclaughlin at fsu.edu Tue Jun 1 13:50:53 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Tue, 1 Jun 2021 17:50:53 +0000 Subject: [Hpc-notice] POWER OUTAGE UPDATE: RCC administrative and storage services coming online; HPC remains offline Message-ID: Hi RCC users, The Sliger Building has been buzzing with activity for the past five days while a variety of vendors have working to get the data floor ready for operations. Power is fully restored. We are currently waiting for HVAC to be fully restored. We are hoping that happens within the next hour or two. After that, we will begin bringing online RCC services. We anticipate that we can start bringing things up within the next few hours. We will focus on our administrative and storage services first, and then slowly bring online the 700+ hardware nodes that comprise the HPC. Since there a still a lot of unknowns and the power is actively being worked while I am typing this message, we are still unsure of the exact timeline. However, we will send at least one more update out before 5pm, or when we get specific services online, whichever comes first. We are not ready to pull CoLo customers off the generator yet. Mitch Gans will send a notice out about that when we are confident that the center is fully operational. Thanks for your continued patience. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Tue Jun 1 17:18:36 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Tue, 1 Jun 2021 21:18:36 +0000 Subject: [Hpc-notice] POWER OUTAGE UPDATE: RCC administrative and storage services coming online; HPC remains offline In-Reply-To: References: Message-ID: Hi RCC Users, The vendors have completed restoring (and testing) power and cooling in the Sliger Data-center. We have begun to bring up our systems, but the process is going to take at least several hours. The Systems Team is working on our parallel storage (GPFS), archival storage, and administrative systems now. In an abundance of caution, we would like the cooling system to run overnight, so we will wait until tomorrow morning to begin powering on the High Performance Computing (HPC) cluster. We appreciate your continued patience. We will be monitoring the support at rcc.fsu.edu email address, so if you have any concerns or issues, please let us know by sending an email to that address. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Tue Jun 1 22:20:29 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 2 Jun 2021 02:20:29 +0000 Subject: [Hpc-notice] POWER OUTAGE UPDATE: Work will continue on RCC services through Wed, June 2 In-Reply-To: References: , , Message-ID: Hi RCC Users, We are continuing to bring up RCC services. Currently, our parallel storage system (GPFS) is online, but not yet available to users. We will resume bringing up the rest of the systems starting at 6:30am tomorrow morning. You can expect to receive several updates throughout the day tomorrow until all our services are online. This includes: * The HPC cluster * The Spear interactive cluster * Parallel storage * Archival storage * Customer VMs and special servers Thanks very much for your continued patience. We will be monitoring the support at rcc.fsu.edu email address, so if you have any concerns or issues, please let us know by sending an email to that address. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Wed Jun 2 09:11:00 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 2 Jun 2021 13:11:00 +0000 Subject: [Hpc-notice] SLIGER DATACENTER MORNING REPORT: Cooling issues overnight; HPC power-up delayed by at-least a few hours Message-ID: Hi RCC Users, RCC staff have been in the data center for the past few hours continuing to work to bring up systems. We are currently focusing our efforts on the following services: 1. GPFS (Parallel storage) 2. Archival Storage 3. Customer VMs and special servers Last night, the new cooling system had an issue, and did not adequately provide cooling for the data center server floor. The vendor is here and working on it this morning. We expect it to be resolved around mid-day. However, in an effort not to risk hardware issues, we will wait until the vendor corrects the cooling issue to begin bringing up the HPC. We will send another update out at 2pm or when services are brought online, whichever comes first. In the meantime, please feel free to contact us at support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Wed Jun 2 15:15:26 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 2 Jun 2021 19:15:26 +0000 Subject: [Hpc-notice] Sliger Update: Most RCC services are online; HPC is OFFLINE until tomorrow Message-ID: Hi RCC Users, All RCC services are online except for the High Performance Computing Cluster. Online services include: * All storage systems (Archival, GPFS) * Globus SFTP transfers * Login nodes * Spear interactive cluster nodes * Open OnDemand (although you cannot submit jobs) * The vpn.fsu.edu/hpc profile for student and guest access to the VPN * The https://acct.rcc.fsu.edu self-service web interface You can access storage resources via our login nodes (hpc-login.rcc.fsu.edu) or export nodes (export.rcc.fsu.edu). However, you still cannot submit Slurm jobs to the HPC (sbatch and srun commands will timeout). We are still waiting on the vendor to get the cooling system working before we bring up the HPC, since it is by far the largest heat generator. We will begin powering on the HPC tomorrow morning, Thurs, June 3, at 8:30am. We absolutely need the cooling to be on and reliable before we power on the 600+ nodes in the HPC cluster. So, we appreciate your continued patience. Again, if you have any questions or concerns, please direct them via email to support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Thu Jun 3 09:08:16 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 3 Jun 2021 13:08:16 +0000 Subject: [Hpc-notice] Sliger Update: Cooling failed again yesterday afternoon; HPC power-up delayed Message-ID: Hi RCC Users, The cooling system in the Sliger Datacenter failed again after running for approximately for four hours yesterday. RCC staff noticed the temperatures starting to rise early yesterday evening. We notified the vendors, and they are working on it. The good news is that they have identified the source of the problem, a faulty flow switch in the liquid cooling infrastructure. We don't have a specific ETA yet for when we can bring the HPC back online today, especially since cooling failed twice before. Once we have confidence that the cooling issues have been completely resolved, we will post a message to this notice list. In the meantime, RCC staff are taking the opportunity to shore up some infrastructure that we do have control over. In that regard, we have a faulty PDU in the rack that contains the Archival storage system. We may have to turn off the Archival system for about an hour to replace the faulty PDU. If we do that, we will send a notice to this list both when it goes down and when it comes back up. Again, we appreciate your continued patience, and will let you know when we have news sometime today. If you have any questions or concerns, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Thu Jun 3 16:22:06 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 3 Jun 2021 20:22:06 +0000 Subject: [Hpc-notice] Sliger Data Center Update: Cooling partially restored; HPC coming up tomorrow at 8am Message-ID: Hi RCC Users, Chilled water cooling at the Sliger Data Center has been restored to operation, and if it remains functional overnight, we are going to begin bringing up the HPC starting at 8am tomorrow. RCC staff have been monitoring temperatures in Sliger, and have also requested that the contractor keep the portable, temporary cooling unit online over the weekend to give us time to react to any unexpected issues. For the power-up now starting tomorrow morning, we will begin with the owner nodes and work towards the general access nodes. We will send another notice around 9am tomorrow with a report of the cooling situation in the morning. In the meantime, I know we've repeatedly thanked you for your patience, but it's worth stating again. If you have any questions or concerns, please direct them to support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri Jun 4 10:15:50 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 4 Jun 2021 14:15:50 +0000 Subject: [Hpc-notice] User accounting issues: Please disregard any notices received since 9am this morning. Message-ID: Hi RCC Users, We are aware that many users have received notifications that their account has expired and/or their sponsor has revoked privileges from their account. In addition, several users have sent notices that their accounts have been locked. We have revoked these changes, so please disregard any emails you have received since 9am this morning. Our sincerest apologies for this issue. Best regards, The RCC Team PS > Please stay tuned for further information regarding the HPC. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri Jun 4 11:53:38 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 4 Jun 2021 15:53:38 +0000 Subject: [Hpc-notice] Sliger Outage Update: HPC is coming online (slowly and carefully) Message-ID: Hi RCC Users, Here's the latest on the status of the HPC. The cooling system in Sliger has been stable for over 12 hours, and we feel somewhat confident that it is safe to power on some HPC compute nodes. However, FSU Facilities and the contractor are performing additional work and tests on the cooling system today, so we're being very cautious in case we need power the compute nodes down again. We are going to start with the owner based HPC nodes (in our "M" racks), and then move on to our general access nodes. We will send another message out today if and when the following occur: * the HPC is partially online and ready to accept job submissions, and/or * we need to power off the HPC nodes in response to another cooling issue. Either way, we will publish a notice no later than 6pm--probably sooner--with an update. In the meantime, all our other services (storage, VMs, etc) have been available for the past two days. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri Jun 4 17:10:50 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 4 Jun 2021 21:10:50 +0000 Subject: [Hpc-notice] Sliger Friday Afternoon Update: HPC partially online, FSU & RCC will monitor Data Center temps over the weekend. Message-ID: Hi RCC Users, Here is the last update before the weekend (hopefully ?). On the HPC, all owner nodes and GPU nodes are online, and they are processing jobs submitted via Slurm. However, part of the cluster will remain offline through the weekend, including most of the general access nodes. Jobs submitted through the backfill2 queue will run, but they may have to wait longer than usual before they start. We are asking for your patience over the weekend, and we will hopefully be able to bring more of the cluster up on Monday. Chilled water cooling has been the hold-up for most of the week. The contractors have one of the two primary chillers online and fully tested. However, we do not believe they will have the other one online by the end of the day. We will continue to run temporary cooling along with the chiller over the weekend for some redundancy. With only one of the two chillers running, there is no backup in case of a failure. The HPC nodes can heat up to room to near-dangerous levels very quickly, and we don't want to take any risks during this weekend. We will, therefore, not bring the whole cluster up until we have reassurance that both chillers are functional. Refer to the image below to see some of the wild temperature swings in the datacenter this past few days: [cid:79c2debc-24e7-4281-aef8-90d3e03ee9f9] If the contractors can get the second cooling unit online by early next week, we will power up the full HPC as early as Monday. Either way, you can expect another message from the RCC Staff no later than 2pm on Monday. Have a good weekend, and once again, thanks very much for your patience. As usual, comments and inquiries should be sent via email to support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-shiifppf.png Type: image/png Size: 152214 bytes Desc: Outlook-shiifppf.png URL: From cmclaughlin at fsu.edu Mon Jun 7 14:02:23 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 7 Jun 2021 18:02:23 +0000 Subject: [Hpc-notice] Sliger Renovation Update: Cooling stable over the past weekend, more HPC nodes online Message-ID: Hi RCC users, The cooling system remained stable over the weekend, and we are bringing up more HPC nodes, specifically some nodes that are shared between owner and general access queues. There is still a lot of renovation activity happening in the machine room, so we are being cautious about bringing compute nodes up and will continue to do so slowly over the next few days. In order to not spam our users with too many messages, we will commit to sending emails once per day, or if something goes wrong. As such, you can expect to receive emails on this list no later than 4pm daily until the entire HPC is online. In the case that something goes awry with the cooling system, we might have to power down some HPC nodes, but we will do everything we can to minimize the impact to your research. If you have any comments or questions, please send via email to support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Tue Jun 8 15:40:37 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Tue, 8 Jun 2021 19:40:37 +0000 Subject: [Hpc-notice] FINAL Sliger Renovation Update: All RCC Systems Operational Message-ID: Hi RCC Users, We have brought all RCC systems back online, including the HPC; all partitions/queues are available. The vendor has tested the redundant cooling, and the systems team feels comfortable that we are able to return to normal operating capacity. As you can imagine, it has been an exhausting week for both the vendor and the RCC staff. We are incredibly appreciative of your patience during this longer-than-expected outage while the vendors worked to correct the cooling issues. This will be our final update unless something goes wrong. If you notice anything not working, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: