[Hpc-notice] REMINDER: System maintenance to occur the week of August 3 - 7
Paul Van Der Mark
pvandermark at fsu.edu
Wed Jul 29 14:43:42 EDT 2020
Dear RCC partners,
There is the possibility of a storm coming towards Tallahassee. Although we plan to continue with the system maintenance next week, we heard from Lenovo that, in the worst case, the company will not allow technicians to travel to Tallahassee. In that scenario, we will have to move our maintenance by one week to Monday, August 10 through Friday, August 14.
This Friday we will have a better picture of the development of potential tropical cyclone #9 and if Lenovo allows a technician to travel to Tallahassee.
Best regards,
The RCC Team
________________________________
From: hpc-staff <hpc-staff-bounces at lists.fsu.edu> on behalf of Casey Mc Laughlin via hpc-staff <hpc-staff at lists.fsu.edu>
Sent: Thursday, July 23, 2020 11:49 AM
To: JESfwd-hpc-notice <hpc-notice at lists.fsu.edu>
Cc: Casey Mc Laughlin <cmclaughlin at fsu.edu>; JESfwd-hpc-staff <hpc-staff at lists.fsu.edu>
Subject: [hpc-staff] REMINDER: System maintenance to occur the week of August 3 - 7
Hi RCC Campus Partners,
This is a reminder that, as of the time of this message, we are planning to perform system maintenance the week of Monday, August 3 through Friday, August 7.
See below for details.
Affected services
The affected services include:
* all HPC and Spear services, including login nodes, parallel storage, and compute nodes,
* all Research Archival volumes,
* all VMs, including those that are hosted for customers
Services not affected include:
* Most data center hosting customers will remain online; we've already reached out and have been working with customers affected by the maintenance.
Scope of work
During this upgrade, we will perform upgrades to all major software on the HPC and Spear. Notable highlights include:
1. upgrade the software that powers our parallel storage system (GPFS)
2. perform hardware maintenance on the Research Archival System
3. improve our power infrastructure
4. upgrade our scheduler software, Slurm, to the latest version (v20.02 as of the time of this article)
5. reorganize part of our network configuration and update firmware on our switches
6. update the software on our database server
7. optimize our HPC InfiniBand network
We originally reported that most services wouldn't be down for the entire week, but as we move closer to the scheduled maintenance date, we realize that is a practical improbably. We will, however, notify you if any services can resume sooner than expected.
Draft schedule
We plan on sending out daily notices the entire week. Also, this schedule is subject to change, but we will keep you notified if and when it does.
* Friday, July 31 at 9am
* We will begin draining HPC compute nodes and disable new job submissions. This means that we will configure nodes to shut off one-by-one as all the jobs on that node complete.
* Monday, August 3 at 7am
* We will disable access to the following systems and services:
* HPC Login nodes
* Spear nodes
* Export nodes (GPFS and Archival storage) and Globus
* Lenovo consultants will begin maintenance on the storage system software (GPFS and Archival) promptly at 7am. All users that wish to retrieve data off of the system should so by this time.
* Tuesday, August 4 at 9am
* Conditioned Air and Power will arrive to perform work on Power Distribution Unit "D".
* Affected colocation customers have already been notified, and we are working with individual campus units to minimize impact. Nevertheless, send us a message<mailto:support at rcc.fsu.edu> if you have any concerns or questions.
* Wednesday & Thursday, August 5 and 6
* The above work will continue.
* Friday, August 7 at 5pm
* We expect all systems will be back online by this time, but we will let you know if any residual issues remain.
Questions or issues?
If we are able to provide access to any service earlier then expected, we will do so and notify you.
If you have any questions, issues, or requests, please let us know: support at rcc.fsu.edu<https://rcc.fsu.edu/support>.
Best regards,
The RCC Team
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fsu.edu/pipermail/hpc-notice/attachments/20200729/0ace5f91/attachment.html>
More information about the Hpc-notice
mailing list