[Hpc-notice] Updates on the upcoming software maintenance Aug 16 - 20
Casey Mc Laughlin
cmclaughlin at fsu.edu
Thu Aug 5 13:44:18 EDT 2021
Hi HPC & Spear users,
This is an update with more details about the software upgrades planned for the week of August 16 - 20<https://rcc.fsu.edu/news/2021-software-upgrade-august-16-20>.
Slurm controller upgrade on Monday, August 9 at 6pm
This coming Monday, August 9th at 6pm, the Research Computing Center will perform an upgrade of the HPC cluster job scheduler to the latest stable release. We anticipate the upgrade taking no more than two hours.
During the upgrade, the cluster will keep running jobs that have started; i.e., entered the "running" state. However, job status and new job submission will be unavailable. Jobs that are waiting in-queue should resume once the job scheduler upgrade is complete.
The job scheduler upgrade is being performed prior to the general maintenance in order to minimize downtime and compute node outages.
We advise you to monitor your running compute jobs for irregularities and contact us if you need to: support at rcc.fsu.edu<mailto:support at rcc.fsu.edu>. We will post an announcement to this email list once the upgrade is complete.
More details on the schedule for compute node upgrades
Over the past few weeks, the Systems Team has tightened the upgrade schedule for the compute nodes that run HPC jobs.
The general plan is to begin by upgrading the login and Spear nodes, Then, we will upgrade the free, general access resources that constitute the majority of our HPC cluster. Finally, we will upgrade the nodes with purchased resources on them ("owner nodes").
The plan for the owner nodes will be to upgrade half of them at a time, so at least 50% of the owner nodes remain online at any given time. The general procedure per set of nodes will be:
1. Set the node to "drain" state. In this state, the node will not accept new jobs and will try to complete as many currently running jobs as possible.
2. Wait approximately 24 hours.
3. Reinstall the operating system and software on the node
Also, there will be a mismatch for a few days during the week between the software versions that are installed on the login nodes and the versions that are on the compute nodes. This discrepancy will decrease throughout the week as we upgrade compute nodes to the new software stack. You may encounter errors related to this when submitting jobs. In some cases, code might need to be recompiled.
Given all of this, we recommend the following to users:
1. Do not submit any multi-day jobs during the week of August 16 - 20 if you can avoid it.
2. Be prepared for partial outages and cancelled or interrupted jobs as we reinstall operating system nodes.
Once again, feel free to reach out to us with any questions, concerns, or comments: support at rcc.fsu.edu<mailto:support at rcc.fsu.edu>.
Best regards,
The RCC Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fsu.edu/pipermail/hpc-notice/attachments/20210805/82f809b2/attachment.html>
More information about the Hpc-notice
mailing list