<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div>Hi HPC & Spear users,</div>
<div><br>
</div>
<div>This is an update with more details about the <a href="https://rcc.fsu.edu/news/2021-software-upgrade-august-16-20" title="https://rcc.fsu.edu/news/2021-software-upgrade-august-16-20">
software upgrades planned for the week of August 16 - 20</a>.</div>
<div><br>
</div>
<div><span style="font-size: 14pt;"><b>Slurm controller upgrade on Monday, August 9 at 6pm</b></span></div>
<div><br>
</div>
<div>This coming Monday, August 9th at 6pm, the Research Computing Center will perform an upgrade of the HPC cluster job scheduler to the latest stable release. We anticipate the upgrade taking no more than two hours.</div>
<div><br>
</div>
<div>During the upgrade, the cluster will keep running jobs that have started; i.e., entered the "running" state. However, job status and new job submission will be unavailable. Jobs that are waiting in-queue should resume once the job scheduler upgrade is
complete.</div>
<div><br>
</div>
<div>The job scheduler upgrade is being performed prior to the general maintenance in order to minimize downtime and compute node outages.</div>
<div><br>
</div>
<div>We advise you to monitor your running compute jobs for irregularities and contact us if you need to:
<a href="mailto:support@rcc.fsu.edu" title="mailto:support@rcc.fsu.edu">support@rcc.fsu.edu</a>. We will post an announcement to this email list once the upgrade is complete.</div>
<div><br>
</div>
<div><span style="font-size: 14pt;"><b>More details on the schedule for compute node upgrades</b></span></div>
<div><br>
</div>
<div>Over the past few weeks, the Systems Team has tightened the upgrade schedule for the compute nodes that run HPC jobs.</div>
<div><br>
</div>
<div>The general plan is to begin by upgrading the login and Spear nodes, Then, we will upgrade the free, general access resources that constitute the majority of our HPC cluster. Finally, we will upgrade the nodes with purchased resources on them ("owner nodes").</div>
<div><br>
</div>
<div>The plan for the owner nodes will be to upgrade half of them at a time, so at least 50% of the owner nodes remain online at any given time. The general procedure per set of nodes will be:</div>
<div>
<ol>
<li>Set the node to "drain" state. In this state, the node will not accept new jobs and will try to complete as many currently running jobs as possible.</li><li>Wait approximately 24 hours.</li><li>Reinstall the operating system and software on the node</li></ol>
</div>
<div>Also, there will be a mismatch for a few days during the week between the software versions that are installed on the login nodes and the versions that are on the compute nodes. This discrepancy will decrease throughout the week as we upgrade compute nodes
to the new software stack. You may encounter errors related to this when submitting jobs. In some cases, code might need to be recompiled.</div>
<div><br>
</div>
<div>Given all of this, we recommend the following to users:<br>
<ol>
<li>Do not submit any multi-day jobs during the week of August 16 - 20 if you can avoid it.
<br>
</li><li>Be prepared for partial outages and cancelled or interrupted jobs as we reinstall operating system nodes.</li></ol>
</div>
<div>Once again, feel free to reach out to us with any questions, concerns, or comments:
<a href="mailto:support@rcc.fsu.edu" title="mailto:support@rcc.fsu.edu">support@rcc.fsu.edu</a>.</div>
<div><br>
</div>
<div>Best regards,</div>
The RCC Team<br>
</body>
</html>