From cmclaughlin at fsu.edu Wed May 1 15:01:09 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Wed, 1 May 2019 19:01:09 +0000 Subject: [Hpc-notice] Reminder - HPC and Spear maintenance to occur NEXT WEEK - May 6 - May 12 Message-ID: Hi RCC Users, Starting on Monday at 7am, we will perform maintenance on our HPC and Spear clusters. During this time, the HPC and Spear will be unavailable. We will also be performing brief maintenance on our storage system. The maintenance window will begin on Monday, May 6 at 7am and last for one week. All systems will be back online no later than Monday, May 13 at 9am. Additionally, our storage systems (GPFS and Archival) will be offline from Monday, May 6 at 7am until 12pm (approx 4-5 hours). We will send a notice out as soon as the storage systems are available. Some systems may be available earlier. We have timed the upgrade to occur between academic semesters in the hope of minimizing potential impact on research activities. What we are doing The 2019 software upgrade will allow us to accomplish the following: * Upgrade over 500 software packages to new versions (list and details) * Upgrade the Slurm scheduler to Version 18.08 (release notes) * Run new benchmarks on the HPC and post results in our website * Upgrade the hardware network configuration on portions of the HPC cluster * Perform critical storage system maintenance activities Services Affected * GPFS and Archival storage will be unavailable briefly on Monday, May 6 from 9am until no later than 12pm. * HPC and Spear will remain offline all week until Monday, May 13 at 9am. On Monday, we will perform brief maintenance to our Archival and GPFS storage systems. We expect to have these services back online very quickly. You will be able to read and write data via Globus and SFTP/RSYNC for the remainder of the maintenance period. The "SKY" VM cluster will not be affected and will remain online throughout the maintenance period. Tentative Schedule * Friday, May 3 - 9am * We will begin draining HPC compute nodes. * Sunday, May 5 - 5pm * We will disable HPC job submission sin Slurm. The cluster will stop accepting new jobs at this time. Already-running jobs will continue to run. * Monday, May 6 - 7am - MAINTENANCE BEGINS * We will disable access to the following systems: * HPC Login nodes * Spear nodes * Export nodes (GPFS and Archival storage) * We will turn off and rebuild HPC login nodes and compute nodes. Any jobs running at this time will be cancelled. * Monday, May 6 - 12pm (or earlier) * We will restore access to to the Export Nodes (GPFS and Archival storage) * Saturday, May 4 - 9am * We will run benchmarks and tests on the HPC and Spear * Monday, May 13 - 9am * HPC and Spear will be back online. If we are able to provide access to any service early, we will do so and notify RCC users. Summary We will publish updates and schedule changes as we get closer to the maintenance window. In the meantime, we appreciate your patience and support. If you have any questions, issues, or requests, please let us know: support at rcc.fsu.edu. Best regards, The RCC Team Research Computing Center Information Technology Services | Florida State University w rcc.its.fsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From cmclaughlin at fsu.edu Fri May 3 13:42:54 2019 From: cmclaughlin at fsu.edu (Casey Mc Laughlin) Date: Fri, 3 May 2019 17:42:54 +0000 Subject: [Hpc-notice] Maintenance (Monday, May 6 - Monday, May 12) - Where to stay updated Message-ID: Hi RCC Users, This is a final reminder that we are conducting systems maintenance on Monday, May 6 through Monday, May 12. * We will be posting updates at-least once per day at our website: https://fla.st/2VaDnyl * The schedule and a detailed overview are available on our announcement page: https://fla.st/2FUvrbe We have already begun "draining" nodes. This means that individual compute nodes will shut down once they finish processing current jobs. We will disable new job submissions on Sunday at 5pm. Any jobs that are still running at 7am on Monday will be killed when we start rebuilding nodes. If you have any questions or concerns, let us know: support at rcc.fsu.edu. Best regards, The RCC Team Research Computing Center Information Technology Services | Florida State University w rcc.its.fsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From pvandermark at fsu.edu Tue May 7 09:33:26 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Tue, 7 May 2019 13:33:26 +0000 Subject: [Hpc-notice] Maintenance (Monday, May 6 - Monday, May 12) - Where to stay updated In-Reply-To: References: Message-ID: <1557236005.30892.175.camel@fsu.edu> Dear RCC users, Our storage system was brought back online and has been available since yesterday morning through our export nodes. You can access your data through the globus tool and the rsync program. Our login nodes will stay unavailable until further notices. We are currently reinstalling all of our compute nodes and we don't anticipate any unexpected delays. Best regards, The RCC Team. On Fri, 2019-05-03 at 17:42 +0000, Casey Mc Laughlin via Hpc-notice wrote: > Hi RCC Users, > > This is a final reminder that we are conducting systems maintenance > on Monday, May 6 through Monday, May 12. > We will be posting updates at-least once per day at our website: http > s://fla.st/2VaDnyl > The schedule and a detailed overview are available on our > announcement page: https://fla.st/2FUvrbe > We have already begun "draining" nodes. This means that individual > compute nodes will shut down once they finish processing current > jobs. We will disable new job submissions on Sunday at 5pm. Any > jobs that are still running at 7am on Monday will be killed when we > start rebuilding nodes. > > If you have any questions or concerns, let us know: support at rcc.fsu.e > du. > > Best regards, > The RCC Team > Research Computing Center > Information Technology Services | Florida State University > w rcc.its.fsu.edu > > > _______________________________________________ > You received this message, because you have an account with the FSU > Research Computing Center > More information: http://rcc.fsu.edu/connect > > ** More News: http://rcc.fsu.edu/news > ** Facebook: http://facebook.com/fsurcc > ** Twitter: http://twitter.com/fsurcc From pvandermark at fsu.edu Wed May 8 15:56:31 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Wed, 8 May 2019 19:56:31 +0000 Subject: [Hpc-notice] Maintenance (Monday, May 6 - Monday, May 12) - Where to stay updated In-Reply-To: <1557236005.30892.175.camel@fsu.edu> References: <1557236005.30892.175.camel@fsu.edu> Message-ID: <1557345390.7316.52.camel@fsu.edu> Dear RCC users, Most of our compute nodes are reinstalled and we are doing Quality Assurance testing with the new software stack. We don't anticipate any unexpected delays and hope to bring everything back online on time. Best regards, The RCC Team On Tue, 2019-05-07 at 13:33 +0000, Paul Van Der Mark via Hpc-notice wrote: > Dear RCC users, > > Our storage system was brought back online and has been available > since > yesterday morning through our export nodes. You can access your data > through the globus tool and the rsync program. Our login nodes will > stay unavailable until further notices. > > We are currently reinstalling all of our compute nodes and we don't > anticipate any unexpected delays. > > Best regards, > The RCC Team. > > > On Fri, 2019-05-03 at 17:42 +0000, Casey Mc Laughlin via Hpc-notice > wrote: > > Hi RCC Users, > > > > This is a final reminder that we are conducting systems maintenance > > on Monday, May 6 through Monday, May 12. > > We will be posting updates at-least once per day at our website: > > http > > s://fla.st/2VaDnyl > > The schedule and a detailed overview are available on our > > announcement page: https://fla.st/2FUvrbe > > We have already begun "draining" nodes. This means that individual > > compute nodes will shut down once they finish processing current > > jobs. We will disable new job submissions on Sunday at 5pm. Any > > jobs that are still running at 7am on Monday will be killed when we > > start rebuilding nodes. > > > > If you have any questions or concerns, let us know: support at rcc.fsu > > .e > > du. > > > > Best regards, > > The RCC Team > > Research Computing Center > > Information Technology Services | Florida State University > > w rcc.its.fsu.edu > > > > > > _______________________________________________ > > You received this message, because you have an account with the FSU > > Research Computing Center > > More information: http://rcc.fsu.edu/connect > > > > ** More News: http://rcc.fsu.edu/news > > ** Facebook: http://facebook.com/fsurcc > > ** Twitter: http://twitter.com/fsurcc > > _______________________________________________ > You received this message, because you have an account with the FSU > Research Computing Center > More information: http://rcc.fsu.edu/connect > > ** More News: http://rcc.fsu.edu/news > ** Facebook: http://facebook.com/fsurcc > ** Twitter: http://twitter.com/fsurcc From pvandermark at fsu.edu Fri May 10 17:22:06 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Fri, 10 May 2019 21:22:06 +0000 Subject: [Hpc-notice] Maintenance (Monday, May 6 - Monday, May 12) - Where to stay updated In-Reply-To: <1557345390.7316.52.camel@fsu.edu> References: <1557236005.30892.175.camel@fsu.edu> <1557345390.7316.52.camel@fsu.edu> Message-ID: <1557523325.12919.29.camel@fsu.edu> Dear RCC users, We are currently dotting on i's and crossing all t's. Although everything looks fine, we will run some final tests this weekend. The cluster will become available this Monday morning as planned. Best regards, The RCC Team On Wed, 2019-05-08 at 15:56 -0400, Paul van der Mark wrote: > Dear RCC users, > > Most of our compute nodes are reinstalled and we are doing Quality > Assurance testing with the new software stack. > > We don't anticipate any unexpected delays and hope to bring > everything > back online on time. > > Best regards, > The RCC Team > > On Tue, 2019-05-07 at 13:33 +0000, Paul Van Der Mark via Hpc-notice > wrote: > > Dear RCC users, > > > > Our storage system was brought back online and has been available > > since > > yesterday morning through our export nodes. You can access your > > data > > through the globus tool and the rsync program. Our login nodes will > > stay unavailable until further notices. > > > > We are currently reinstalling all of our compute nodes and we don't > > anticipate any unexpected delays. > > > > Best regards, > > The RCC Team. > > > > > > On Fri, 2019-05-03 at 17:42 +0000, Casey Mc Laughlin via Hpc-notice > > wrote: > > > Hi RCC Users, > > > > > > This is a final reminder that we are conducting systems > > > maintenance > > > on Monday, May 6 through Monday, May 12. > > > We will be posting updates at-least once per day at our website: > > > http > > > s://fla.st/2VaDnyl > > > The schedule and a detailed overview are available on our > > > announcement page: https://fla.st/2FUvrbe > > > We have already begun "draining" nodes. This means that > > > individual > > > compute nodes will shut down once they finish processing current > > > jobs. We will disable new job submissions on Sunday at 5pm. Any > > > jobs that are still running at 7am on Monday will be killed when > > > we > > > start rebuilding nodes. > > > > > > If you have any questions or concerns, let us know: support at rcc.f > > > su > > > .e > > > du. > > > > > > Best regards, > > > The RCC Team > > > Research Computing Center > > > Information Technology Services | Florida State University > > > w rcc.its.fsu.edu > > > > > > > > > _______________________________________________ > > > You received this message, because you have an account with the > > > FSU > > > Research Computing Center > > > More information: http://rcc.fsu.edu/connect > > > > > > ** More News: http://rcc.fsu.edu/news > > > ** Facebook: http://facebook.com/fsurcc > > > ** Twitter: http://twitter.com/fsurcc > > > > _______________________________________________ > > You received this message, because you have an account with the FSU > > Research Computing Center > > More information: http://rcc.fsu.edu/connect > > > > ** More News: http://rcc.fsu.edu/news > > ** Facebook: http://facebook.com/fsurcc > > ** Twitter: http://twitter.com/fsurcc From bgentry at fsu.edu Mon May 13 11:37:06 2019 From: bgentry at fsu.edu (Brian Gentry) Date: Mon, 13 May 2019 15:37:06 +0000 Subject: [Hpc-notice] 2019 RCC Maintenance Complete Message-ID: Dear RCC Customers, The 2019 RCC planned maintenance was a success. All systems targeted for rebuilding and upgrading are online and available. Everything should work as you expect it to. More than 500 computing packages were upgraded during this time. A list of updated packages and versions is here: https://rcc.fsu.edu/news/software-upgrade-coming-in-may Please note that MPI has been updated and will require that any program using MPI be recomplied against the new version. Should you encounter odd behavior, or need assistance resolving issues, please contact us at support at rcc.fsu.edu or use the web form at: https://rcc.fsu.edu/support Thanks and good luck with your projects and research! -------------- next part -------------- An HTML attachment was scrubbed... URL: From pvandermark at fsu.edu Fri May 17 09:40:22 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Fri, 17 May 2019 13:40:22 +0000 Subject: [Hpc-notice] GPFS "no space left on device" error Message-ID: <1558100422.12919.160.camel@fsu.edu> Dear RCC users, We are running into an issue with our GPFS filesystem: occasionally when you create a new file on the research partition, you might run into this error: "No space left on device" even though the file system has plenty of free space and you are not over your quote. This is an internal issue to GPFS and we are working with IBM on this. We anticipate that this will be solved today. This does not impact our home directories. Best regards, The RCC Team From pvandermark at fsu.edu Fri May 17 13:53:52 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Fri, 17 May 2019 17:53:52 +0000 Subject: [Hpc-notice] GPFS "no space left on device" error In-Reply-To: <1558100422.12919.160.camel@fsu.edu> References: <1558100422.12919.160.camel@fsu.edu> Message-ID: <1558115632.12919.219.camel@fsu.edu> Dear RCC users, We have a solution in place that fixes most issues that we have seen: - Any new files that are created will work fine. - If you append data to any existing large files, you could potentially run into the no space left on device error. We are re-balancing some of our file-sets and the last issue will hopefully disappear during the weekend/beginning of next week. Best regards, The RCC Team On Fri, 2019-05-17 at 09:40 -0400, Paul van der Mark wrote: Dear RCC users, We are running into an issue with our GPFS filesystem: occasionally when you create a new file on the research partition, you might run into this error: "No space left on device" even though the file system has plenty of free space and you are not over your quote. This is an internal issue to GPFS and we are working with IBM on this. We anticipate that this will be solved today. This does not impact our home directories. Best regards, The RCC Team From pvandermark at fsu.edu Fri May 17 14:14:04 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Fri, 17 May 2019 18:14:04 +0000 Subject: [Hpc-notice] GPFS "no space left on device" error In-Reply-To: <1558115632.12919.219.camel@fsu.edu> References: <1558100422.12919.160.camel@fsu.edu> <1558115632.12919.219.camel@fsu.edu> Message-ID: <1558116843.12919.231.camel@fsu.edu> Dear RCC users, The issue has returned and new files will also generate the error again. We are working on this. Best, The RCC Team. On Fri, 2019-05-17 at 13:53 -0400, Paul van der Mark wrote: Dear RCC users, We have a solution in place that fixes most issues that we have seen: - Any new files that are created will work fine. - If you append data to any existing large files, you could potentially run into the no space left on device error. We are re-balancing some of our file-sets and the last issue will hopefully disappear during the weekend/beginning of next week. Best regards, The RCC Team On Fri, 2019-05-17 at 09:40 -0400, Paul van der Mark wrote: Dear RCC users, We are running into an issue with our GPFS filesystem: occasionally when you create a new file on the research partition, you might run into this error: "No space left on device" even though the file system has plenty of free space and you are not over your quote. This is an internal issue to GPFS and we are working with IBM on this. We anticipate that this will be solved today. This does not impact our home directories. Best regards, The RCC Team From pvandermark at fsu.edu Wed May 22 12:56:53 2019 From: pvandermark at fsu.edu (Paul Van Der Mark) Date: Wed, 22 May 2019 16:56:53 +0000 Subject: [Hpc-notice] GPFS "no space left on device" error In-Reply-To: <1558116843.12919.231.camel@fsu.edu> References: <1558100422.12919.160.camel@fsu.edu> <1558115632.12919.219.camel@fsu.edu> <1558116843.12919.231.camel@fsu.edu> Message-ID: <1558544213.12919.304.camel@fsu.edu> Dear RCC users, During the last weekend, we re-balanced our systems and we have not encountered any new issues during the last few days. Everything should work normally now. Best regards, The RCC Team On Fri, 2019-05-17 at 14:14 -0400, Paul van der Mark wrote: Dear RCC users, The issue has returned and new files will also generate the error again. We are working on this. Best, The RCC Team. On Fri, 2019-05-17 at 13:53 -0400, Paul van der Mark wrote: Dear RCC users, We have a solution in place that fixes most issues that we have seen: - Any new files that are created will work fine. - If you append data to any existing large files, you could potentially run into the no space left on device error. We are re-balancing some of our file-sets and the last issue will hopefully disappear during the weekend/beginning of next week. Best regards, The RCC Team On Fri, 2019-05-17 at 09:40 -0400, Paul van der Mark wrote: Dear RCC users, We are running into an issue with our GPFS filesystem: occasionally when you create a new file on the research partition, you might run into this error: "No space left on device" even though the file system has plenty of free space and you are not over your quote. This is an internal issue to GPFS and we are working with IBM on this. We anticipate that this will be solved today. This does not impact our home directories. Best regards, The RCC Team