From cmclaughlin at fsu.edu Wed Oct 2 15:42:48 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 2 Oct 2019 19:42:48 +0000 Subject: [Hpc-notice] Partial HPC cluster failure Message-ID: Hi RCC Users, We are experiencing an issue with a power distribution unit for several racks in the HPC. Running jobs are affected on the following racks: 1. M32 2. I29 3. I30 4. I31 5. I32 6. I35 7. I36 Jobs in the following partitions are affected: * backfill * backfill2 * changlani_q * coaps18_q * eoas19_q * fraser_q * genacc_q * hongli_q * ktaylor_q * mecfd18_q * medicine_q * quicktest * rcc_internal * sec4m_q * stagg_q * stata_q * stroupe_q * yin19_q In addition, the InfiniBand switch is down, so jobs in other partitions may be affected as well. The Systems Team has been deployed and we hope to have this issue resolved soon. In the meantime, you can get status updates at: https://fla.st/2oysWFq and direct inquiries to support at rcc.fsu.edu. Best Regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Wed Oct 2 16:04:28 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 2 Oct 2019 20:04:28 +0000 Subject: [Hpc-notice] Partial HPC cluster failure - Nodes back online. In-Reply-To: References: Message-ID: Hi RCC Users, All of the affected nodes (see list below) are back online and operational. Unfortunately, due to the nature of the problem, all jobs running on the affected nodes were killed. We apologize for the inconvenience, and if we can do anything, please let us know (support at rcc.fsu.edu). List of affected racks: 1. M32 2. I29 3. I30 4. I31 5. I32 6. I35 7. I36 List of affected partitions: * backfill * backfill2 * changlani_q * coaps18_q * eoas19_q * fraser_q * genacc_q * hongli_q * ktaylor_q * mecfd18_q * medicine_q * quicktest * rcc_internal * sec4m_q * stagg_q * stata_q * stroupe_q * yin19_q Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Thu Oct 17 13:09:06 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 17 Oct 2019 17:09:06 +0000 Subject: [Hpc-notice] Keeping an eye on Tropical Disturbance 16 Message-ID: Hi RCC Campus Partners, As you may have heard, a tropical disturbance is forming in the Gulf of Mexico and is threatening the greater Tallahassee area. We do not anticipate shutdown of any services at this time, but want to make you aware of the possibility. As meteorologists become more confident in the storm path, RCC staff will make a decision on our action plan. We will post an update no later than this weekend. If you have any questions or issues, let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri Oct 18 11:58:52 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 18 Oct 2019 15:58:52 +0000 Subject: [Hpc-notice] UPDATE on Tropical Disturbance 16 - All Systems to Remain online Message-ID: Hi RCC Campus Partners, We are closely watching Potential Tropical Cyclone Sixteen, and at this time are planning to keep the Sliger server room in operation throughout the storm. This includes all RCC systems (HPC, Spear, VMs, and storage). We will make a further announcement to you this weekend if expectations change. Useful links to check out in the meantime: * FSU Alerts * National Hurricane Center If you have any questions or concerns, let us know by emailing support at rcc.fsu.edu. Best Regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Sun Oct 20 17:32:43 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Sun, 20 Oct 2019 21:32:43 +0000 Subject: [Hpc-notice] Unexpected downtime for HPC login nodes Message-ID: Hi RCC Campus Partners, We are working on an internal systems issue affecting the HPC login nodes. We will post another update to this notice list as soon as the issue is resolved and the nodes are available again. In the meantime, other services, including already-running jobs are unaffected. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Mon Oct 21 16:07:36 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 21 Oct 2019 20:07:36 +0000 Subject: [Hpc-notice] Ongoing issues with the HPC login nodes Message-ID: Greetings RCC Campus Partners, We are experiencing ongoing issues with our virtualization cluster, which is affecting the HPC Login Nodes. As soon as we have further updates, we will post another update. In the meantime, if you have trouble connecting to the HPC, feel free to connect to one of our login nodes directly: ssh [RCC_USERNAME]@hpc-login-vm1.rcc.fsu.edu ssh [RCC_USERNAME]@hpc-login-vm2.rcc.fsu.edu Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Tue Oct 22 09:25:16 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Tue, 22 Oct 2019 13:25:16 +0000 Subject: [Hpc-notice] RESOLVED: Ongoing issues with the HPC login nodes Message-ID: ? Hi RCC Campus Partners, The issues with the virtualization cluster and the login nodes have been resolved. Thanks for your patience. If you have any further issues, please let us know (support at rcc.fsu.edu). Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Wed Oct 30 09:31:01 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 30 Oct 2019 13:31:01 +0000 Subject: [Hpc-notice] Globus issue Message-ID: Hi RCC Campus Partners, We are currently experiencing an issue with our Research Archival Storage System. In order to stabilize the system, we are going to un-mount the system from the export nodes and disable the endpoint in Globus. We will post another notice to this list in a few hours or as soon as this issue is resolved. Details: https://fla.st/2otiori Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From pvandermark at fsu.edu Wed Oct 30 09:45:42 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Wed, 30 Oct 2019 13:45:42 +0000 Subject: [Hpc-notice] Archival issue. Not Globus In-Reply-To: References: Message-ID: Dear RCC users, We are currently experiencing some issues with the archival system. The Globus system is fully functioning. The archival system will temporarily not be available through globus, but the Globus service itself is fine. Best regards, The RCC Team On Wed, 2019-10-30 at 13:31 +0000, Casey Mc Laughlin via Hpc-notice wrote: > Hi RCC Campus Partners, > > We are currently experiencing an issue with our Research Archival > Storage System. > > In order to stabilize the system, we are going to un-mount the system > from the export nodes and disable the endpoint in Globus. > > We will post another notice to this list in a few hours or as soon as > this issue is resolved. > > Details: https://fla.st/2otiori > > Best regards, > The RCC Team > _______________________________________________ > You received this message, because you have an account with the FSU > Research Computing Center > More information: http://rcc.fsu.edu/connect > > ** More News: http://rcc.fsu.edu/news > ** Facebook: http://facebook.com/fsurcc > ** Twitter: http://twitter.com/fsurcc From pvandermark at fsu.edu Wed Oct 30 12:10:08 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Wed, 30 Oct 2019 16:10:08 +0000 Subject: [Hpc-notice] Archival issue In-Reply-To: References: Message-ID: <642241b163bdb2f45fec3df6bc31f05e4d670fbb.camel@fsu.edu> Dear RCC users, We have pinpointed the issue with our archival system to some unusual IO patterns and we are trying to determine the cause of this. All ZFS volumes are currently unmounted and we will bring them back online in the coming hours. Best regards, The RCC Team On Wed, 2019-10-30 at 09:45 -0400, Paul van der Mark wrote: > Dear RCC users, > > We are currently experiencing some issues with the archival system. > The > Globus system is fully functioning. The archival system will > temporarily not be available through globus, but the Globus service > itself is fine. > > Best regards, > The RCC Team > > On Wed, 2019-10-30 at 13:31 +0000, Casey Mc Laughlin via Hpc-notice > wrote: > > Hi RCC Campus Partners, > > > > We are currently experiencing an issue with our Research Archival > > Storage System. > > > > In order to stabilize the system, we are going to un-mount the > > system > > from the export nodes and disable the endpoint in Globus. > > > > We will post another notice to this list in a few hours or as soon > > as > > this issue is resolved. > > > > Details: https://fla.st/2otiori > > > > Best regards, > > The RCC Team > > _______________________________________________ > > You received this message, because you have an account with the FSU > > Research Computing Center > > More information: http://rcc.fsu.edu/connect > > > > ** More News: http://rcc.fsu.edu/news > > ** Facebook: http://facebook.com/fsurcc > > ** Twitter: http://twitter.com/fsurcc From pvandermark at fsu.edu Wed Oct 30 14:45:58 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Wed, 30 Oct 2019 18:45:58 +0000 Subject: [Hpc-notice] Archival issue In-Reply-To: <642241b163bdb2f45fec3df6bc31f05e4d670fbb.camel@fsu.edu> References: <642241b163bdb2f45fec3df6bc31f05e4d670fbb.camel@fsu.edu> Message-ID: <599ded354ffc9ff0af834149c1d9d3b0d1769e36.camel@fsu.edu> Dear RCC users, All archival volumes have been brought back online and the globus fsurcc#archival endpoint has been reactivated. The issue was that a failed drive was being replaced, which is a pretty standard operation for a raid configuration and usually does not impact operations. However, because of some very heavy IO on the system, this reconstruction was interrupted all the time. We are looking at a way to prevent this type of perfect storm of events. Please let us know if you still experience issues with the archival storage. Best, The RCC Team On Wed, 2019-10-30 at 12:10 -0400, Paul van der Mark wrote: > Dear RCC users, > > We have pinpointed the issue with our archival system to some unusual > IO patterns and we are trying to determine the cause of this. All ZFS > volumes are currently unmounted and we will bring them back online in > the coming hours. > > Best regards, > The RCC Team > > On Wed, 2019-10-30 at 09:45 -0400, Paul van der Mark wrote: > > Dear RCC users, > > > > We are currently experiencing some issues with the archival system. > > The > > Globus system is fully functioning. The archival system will > > temporarily not be available through globus, but the Globus service > > itself is fine. > > > > Best regards, > > The RCC Team > > > > On Wed, 2019-10-30 at 13:31 +0000, Casey Mc Laughlin via Hpc-notice > > wrote: > > > Hi RCC Campus Partners, > > > > > > We are currently experiencing an issue with our Research Archival > > > Storage System. > > > > > > In order to stabilize the system, we are going to un-mount the > > > system > > > from the export nodes and disable the endpoint in Globus. > > > > > > We will post another notice to this list in a few hours or as > > > soon > > > as > > > this issue is resolved. > > > > > > Details: https://fla.st/2otiori > > > > > > Best regards, > > > The RCC Team > > > _______________________________________________ > > > You received this message, because you have an account with the > > > FSU > > > Research Computing Center > > > More information: http://rcc.fsu.edu/connect > > > > > > ** More News: http://rcc.fsu.edu/news > > > ** Facebook: http://facebook.com/fsurcc > > > ** Twitter: http://twitter.com/fsurcc