From cmclaughlin at fsu.edu Thu Aug 5 13:44:18 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 5 Aug 2021 17:44:18 +0000 Subject: [Hpc-notice] Updates on the upcoming software maintenance Aug 16 - 20 Message-ID: Hi HPC & Spear users, This is an update with more details about the software upgrades planned for the week of August 16 - 20. Slurm controller upgrade on Monday, August 9 at 6pm This coming Monday, August 9th at 6pm, the Research Computing Center will perform an upgrade of the HPC cluster job scheduler to the latest stable release. We anticipate the upgrade taking no more than two hours. During the upgrade, the cluster will keep running jobs that have started; i.e., entered the "running" state. However, job status and new job submission will be unavailable. Jobs that are waiting in-queue should resume once the job scheduler upgrade is complete. The job scheduler upgrade is being performed prior to the general maintenance in order to minimize downtime and compute node outages. We advise you to monitor your running compute jobs for irregularities and contact us if you need to: support at rcc.fsu.edu. We will post an announcement to this email list once the upgrade is complete. More details on the schedule for compute node upgrades Over the past few weeks, the Systems Team has tightened the upgrade schedule for the compute nodes that run HPC jobs. The general plan is to begin by upgrading the login and Spear nodes, Then, we will upgrade the free, general access resources that constitute the majority of our HPC cluster. Finally, we will upgrade the nodes with purchased resources on them ("owner nodes"). The plan for the owner nodes will be to upgrade half of them at a time, so at least 50% of the owner nodes remain online at any given time. The general procedure per set of nodes will be: 1. Set the node to "drain" state. In this state, the node will not accept new jobs and will try to complete as many currently running jobs as possible. 2. Wait approximately 24 hours. 3. Reinstall the operating system and software on the node Also, there will be a mismatch for a few days during the week between the software versions that are installed on the login nodes and the versions that are on the compute nodes. This discrepancy will decrease throughout the week as we upgrade compute nodes to the new software stack. You may encounter errors related to this when submitting jobs. In some cases, code might need to be recompiled. Given all of this, we recommend the following to users: 1. Do not submit any multi-day jobs during the week of August 16 - 20 if you can avoid it. 2. Be prepared for partial outages and cancelled or interrupted jobs as we reinstall operating system nodes. Once again, feel free to reach out to us with any questions, concerns, or comments: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Mon Aug 9 13:50:57 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 9 Aug 2021 17:50:57 +0000 Subject: [Hpc-notice] REMINDER: hpc-controller maintenance TONIGHT (Aug 9) at 6pm Message-ID: Hi HPC & Spear users, Today at 6pm, the Research Computing Center will upgrade the HPC cluster job scheduler to the latest stable release. We anticipate the upgrade taking no more than two hours. During the upgrade, the cluster will keep running jobs that have started; i.e., entered the "running" state. However, job status and new job submission will be unavailable. Jobs that are waiting in-queue should resume once the job scheduler upgrade is complete. The job scheduler upgrade is being performed prior to the general maintenance (starting next Monday, August 16) in order to minimize downtime and compute node outages. We advise you to monitor your running compute jobs for irregularities and contact us if you need to: support at rcc.fsu.edu. We will post an announcement to this email list once the upgrade is complete. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Mon Aug 9 20:09:18 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Tue, 10 Aug 2021 00:09:18 +0000 Subject: [Hpc-notice] HPC cluster job scheduler upgrade completed Message-ID: Hi HPC & Spear users, We are pleased to report that the HPC cluster job scheduler upgrade is now complete. We are now ready to begin the next phase of the software upgrade, which will begin next Monday, August 16. Full details are on our website. Please let us know if you have issues with any of your jobs: support at rcc.fsu.edu. Best regards, Casey -- Casey McLaughlin Research Computing Center Information Technology Services | Florida State University p 850.644.6270 | w rcc.its.fsu.edu [cid:19f6f723-8ada-48e6-8776-3dc5943adee8] [cid:0471af70-36a3-4f74-8633-a4658d8723f0] [cid:4b6a1a22-c0e8-421a-9752-20067a060211] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-1q4yn4r5.png Type: image/png Size: 437 bytes Desc: Outlook-1q4yn4r5.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-gl5diowe.png Type: image/png Size: 585 bytes Desc: Outlook-gl5diowe.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Outlook-4u2hqdzn.png Type: image/png Size: 594 bytes Desc: Outlook-4u2hqdzn.png URL: From mgans at fsu.edu Fri Aug 13 09:36:21 2021 From: mgans at fsu.edu (Mitch Gans) Date: Fri, 13 Aug 2021 13:36:21 +0000 Subject: [Hpc-notice] 8/13 Tropical Depression Fred Update Message-ID: Greetings, We have been watching reports about Tropical Depression Fred, and have completed checks at the Sliger Data Center, and are prepared. We are expecting wind and heavy rainfall, but are uncertain at this time as to the effects of the storm's landfall anticipated later this weekend. We will provide you with another update if predictions or conditions significantly change. Tropical Depression Fred If you have any questions or concerns, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team RCC @ FSU now recycles can rings Mitch Gans Florida State University 2035 E. Paul Dirac Drive Tallahassee, FL 32306-2760 mgans at fsu.edu Sliger Data Center: (850) 644-8555 Cell: (850) 591-6193 Fax: (850) 644-8722 -------------- next part -------------- An HTML attachment was scrubbed... URL: From pvandermark at fsu.edu Sun Aug 15 20:03:18 2021 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Mon, 16 Aug 2021 00:03:18 +0000 Subject: [Hpc-notice] 8/13 Tropical Depression Fred Update In-Reply-To: References: Message-ID: Dear RCC users, It looks like tropical storm Fred is moving towards Tallahassee. We keep monitoring the progress of the storm, but at the moment we plan to stay operational as much as possible. Because of the storm, FSU HR has deemed the campus to be restricted to essential personnel only for Monday, August 16th. We, therefore, advise all of our colocation customers to postpone any non-essential visits to the Sliger server room as much as possible and wait until Tuesday. Best regards, The RCC Team ________________________________ From: Mitch Gans Sent: Friday, August 13, 2021 9:36 AM To: 'its-colocation-customers at lists.fsu.edu' ; JESfwd-hpc-notice Cc: Paul Van Der Mark Subject: 8/13 Tropical Depression Fred Update Greetings, We have been watching reports about Tropical Depression Fred, and have completed checks at the Sliger Data Center, and are prepared. We are expecting wind and heavy rainfall, but are uncertain at this time as to the effects of the storm?s landfall anticipated later this weekend. We will provide you with another update if predictions or conditions significantly change. Tropical Depression Fred If you have any questions or concerns, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team RCC @ FSU now recycles can rings Mitch Gans Florida State University 2035 E. Paul Dirac Drive Tallahassee, FL 32306-2760 mgans at fsu.edu Sliger Data Center: (850) 644-8555 Cell: (850) 591-6193 Fax: (850) 644-8722 -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Mon Aug 16 16:38:48 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 16 Aug 2021 20:38:48 +0000 Subject: [Hpc-notice] Software upgrade has begun Message-ID: Hello HPC and Spear users, We have begun the process of upgrading the software on the login, compute, and Spear nodes in the HPC cluster. As of this afternoon, we have brought one login node offline to upgrade. Once we are sure that it is working well, we will then upgrade the other two login nodes. We have also begun upgrading the compute nodes starting with the oldest nodes in the cluster. As of now, the owner nodes remain fully functional. We will post another update before we set those nodes to "drain" (not accept new jobs). As a reminder, at least half of the owner nodes will remain online at any given time. Details are on our web site at: https://rcc.fsu.edu/news/2021-software-upgrade-august-16-20. If you have any questions or comments, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Wed Aug 18 16:32:10 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 18 Aug 2021 20:32:10 +0000 Subject: [Hpc-notice] HPC/Spear software upgrade: Owner nodes upgrade starting tomorrow; login node upgrades in process Message-ID: Hi HPC & Spear users, We are continuing to work on the software upgrade. As of now, we've upgraded a significant amount of the compute nodes in the free "general access" section of the HPC cluster. We will begin upgrading compute nodes in our owner pool tomorrow, starting with our "hpc-m35-..." and "hpc-m36-..." nodes. We have already set these nodes to not accept new job submissions this afternoon. So long as your jobs do not specify any specific node constraints, you should be able to continue to submit jobs as usual. In the meantime, we continue to work on ensuring the most popular packages are working on the new CentOS 8 builds. If you have any questions or concerns, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Thu Aug 19 18:16:38 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Thu, 19 Aug 2021 22:16:38 +0000 Subject: [Hpc-notice] Thursday, August 19 HPC software update status Message-ID: Hi HPC & Spear users, We are continuing to make progress with the RCC software update. The Login nodes have been upgraded to CentOS 8 and the new software stack. A few users have reported issues with software that they run, and we are working to correct those issues. We encourage you to try your jobs and report any issues to our support system: support at rcc.fsu.edu. We also upgraded half of the owner compute nodes, and we plan on completing the other half tomorrow. All owner queues/partitions remain available during this upgrade. We mentioned in an earlier announcement that we were making some changes to the Linux Environment Modules. If you have any module load ... directives in your submit scripts, you will probably be affected by these changes. We've published detailed documentation about these changes on our website. The basic takeaway is that modules are now hierarchical, which means that module load intel-openmpi now becomes module load intel openmpi (note the space instead of the dash). The old module syntax will work for a time, but we encourage everyone to update their scripts to the new format. Finally, Python3 is now the default version. We are working on an environment module to load Python2. If you need that sooner than later, please let us know (support at rcc.fsu.edu). Thanks for being patient while we work on this upgrade, and let us know if you encounter any unexpected issues. We will post another status update tomorrow. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri Aug 20 16:25:00 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 20 Aug 2021 20:25:00 +0000 Subject: [Hpc-notice] HPC & Spear Friday update Message-ID: Hi HPC and Spear users, We have made significant progress with the software upgrade, but we are not quite there yet. We will continue to work through the weekend. Most of the HPC has been upgraded, including the owner nodes, and there are many jobs running on the cluster now. We are now working through the remainder of the nodes and some software issues that have cropped up due to the upgrade. If you notice an issue with a job, please report it: support at rcc.fsu.edu. Thanks for your patience while we work through support requests. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Mon Aug 23 16:11:16 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Mon, 23 Aug 2021 20:11:16 +0000 Subject: [Hpc-notice] Monday update: Software upgrade Message-ID: Hi HPC and Spear users, The upgrade is mostly complete, with some exceptions. Notably, we encountered some unexpected issues when upgrading the GPU nodes, the Spear nodes, and some owner nodes in the M31 rack. The Systems Team is working through those now. In the meantime, most of our compute nodes are online and the scheduler is accepting & running jobs. Also, we are cleaning up configuration issues and fixing users' software issues with that came up during the upgrade. We appreciate all the bug reports that have already come in, and we encourage you to keep them coming. If you find an issue, please send us an email at support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri Aug 27 13:42:48 2021 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 27 Aug 2021 17:42:48 +0000 Subject: [Hpc-notice] Friday update regarding software upgrade Message-ID: Hi HPC & Spear users, One last update for the software upgrade project before the weekend. Except for a handful of systems, all the Spear and HPC Compute nodes, including the GPU nodes have been upgraded to CentOS 8. The project won't be fully complete, however, until all the software issues are resolved. Our software team is meeting daily to triage and work on issues. We will continue to do so until every issue related to the upgrade is complete. We'd like to remind everyone who has custom code that it will likely need to be recompiled for the new software stack. As with any software upgrade, shared libraries have changed; some have been removed in favor of newer alternatives, and some have new versions. Please let us know if you need help recompiling your software. For those of you who have outstanding software upgrade requests, we appreciate your patience. In our triage process, we've been prioritizing getting existing packages working before attending to new software requests. This will be the case for the next few days at least. If you have any questions or comments, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team -------------- next part -------------- An HTML attachment was scrubbed... URL: