Home > HPCC Home > Downtime > Outage Log

 

HPCC Downtimes - Unscheduled Outage Log:

HPCC outages during scheduled HPCC maintenance periods, or those which don't result in jobs being aborted (graceful shutdown using batch system) are not logged:

 

Sunday April 4, 2004 6:00 A.M. - 7:00 A.M..
Scheduled network outage - A key piece of network hardware (hub250a - that supplies network connectivity to the OIT Data Center is being repaired to prevent future unscheduled outages.

Thursday March 25, 2004 Approximately 5:15 P.M. - 5:20 P.M..
There was a brief (approximately 5 minute) outage of all networking on campus Thursday 3/25/04 at approximately 5:15 p.m. due to software problems on the main network switch for the OIT Data Center.

Sunday March 14, 2004 9:00 A.M. - 10:00 A.M..
Campus-wide network outage for networking hardware maintenance. All networking connectivity to AFS fileservice was interrupted during this time.

The HPCC interactive machine stats.hpcc.nd.edu was upgraded from 4 x 250 MHz Sun Enterprise 4000 with 1 GB RAM to a 4 x 1.2 GHz Sun SUn Fire V880 server with 8 GB RAM.

Friday March 12, 2004 4:10 P.M. - 4:55 P.M..
sun29.hpcc.nd.edu crashed due to kernel stack overflow - the size of the stack was increased and the machine returned to service.

Wednesday March 3, 2004 7:50 P.M. - 8:25 A.M..
Campus-wide network outage due to network switch failure. All networking connectivity to AFS fileservice was interrupted during this time.

Sunday February 22, 2004 2:00 A.M. - 6:00 A.M..
Campus-wide network outage to patch network hardware. All networking connectivity to AFS fileservice was interrupted during this time.

Sunday December 18th, 2003 6:30 A.M. - 7:00 A.M..
Campus-wide network outage to replace failing network hardware. All networking connectivity to AFS fileservice was interrupted during this time.

Sunday December 12th, 2003 8:00 A.M. - 10:00 A.M..
The SGI "front-end" machine perseus.hpcc.nd.edu down in order to upgrade disk and RAM. The SGI Onyx2 - onyx2.hpcc.nd.edu - permenently removed from service.

Sunday November 23th, 2003 5:00 A.M. - 9:00 A.M..
Network configuration changes are being made as part of the OIT Data Center renovation. Network connectivity was unavailable for the entire campus during this time, including access to email, the Internet, AFS, and NetFile.

Sunday November 16th, 2003 5:00 A.M. - 10:00 A.M..
AFS fileservice down for maintenance. This may have caused outages for a number of HPCC users. All users affected were sent email.

Tuesday November 4th, 2003 Approximately 9:45 - 12:30 P.M.
Power outage in OIT ITC computer room caused 6 AFS servers to become unavailable causing numerous problems.

Wednesday October 8th, 2003 1:00 P.M. - 5:00 P.M.
Outage of o3000.hpcc.nd.edu Machine hung due to a software error - machine was patched with patch 5202 as vendor recommended.

Wednesday May 28th, 2003 4:10 P.M. - 5:00 P.M.
Outage of o3000.hpcc.nd.edu due to cabling error when the power distribution unit was being reconfigured on the system.

Saturday April 26th 1:00 P.M. - Sunday April 27, 2003 1:15 P.M
Outage of o3000.hpcc.nd.edu due to hardware error on machine.

Monday March 31 Approximately 8:00 A.M. - Tuesday April 1, 2003 1:40 P.M.
Gigabit switch in IBM 1300 resulting in loss of network connectivity on the private management subnet (and to mgmt1.hpcc.nd.edu). This caused cnode01-32 to hang since it's an NFS server for those nodes.

Saturday March 22 approximately 8:35 A.M. - Monday March 24th, 2003 9:00 A.M.
The batch system in the HPCC was down from approximately 8:35 A.M. on Saturday March 22nd due to an outage of the AFS fileserver reno.helios.nd.edu which holds the volume for the SGE batch system.

Tuesday March 18th, 2003 from 7:00 A.M. - 8:00 A.M.
linux1.hpcc.nd.edu down in order to move the interactive machine.

Sunday February 22nd, 2003 from 7:00 A.M. - 10:00 A.M.
The HPCC interactive machines sun1.hpcc.nd.edu, stats.hpcc.nd.edu and perseus.hpcc.nd.edu may be unavailable as software maintenance is being performed software maintenance during this time.

Wed Jan 22, 2003 12:26:27 EST 2003:
The queues for the Linux nodes cnode02-cnode16 have been disabled so that when jobs are done running on them they can be upgraded from RedHat 7.2 to RedHat 7.3. After that is done they will immediately be returned to service.

The queues for sun3, sun6, sun7, sun8 & sun9 have been disabled so that they can have the Operating System patched to the latest version. After that is done these machines will immediately be returned to service.

sun4.hpcc.nd.edu is down due to some memory problems. This machine is expected to be operational at approximately noon on 1/23/03.

December 11, 2002:
perseus.hpcc.nd.edu, the SGI interactive front end machine, will be shut down on Wednesday December 11th from 7:00 A.M. - 7:30 A.M. in order to install some additional software.

Tue Nov 26 11:17:53 EST 2002

On Wednesday November 27th from approximately 4:30 A.M. -
7:30 A.M. the network feed to the HPCC and the campus
Internet service will be unavailable due to the movement
of networking equipment in Malloy Hall. It is expected that this
outage may cause jobs running in the HPCC to die.
The HPCC batch queue has been disabled to prevent new jobs
from starting. After the outage, queues will be re-enabled.

November 5-6, 2002:

On Tuesday & Wednesday (November 5-6th) the air conditioning units in the HPCC computer room located in Malloy hall are scheduled to be repaired. Because of this we will be trying to temporarily turn off as many machines as possible to reduce the heating of the room. We will attempt to keep the HPCC machines running while the air conditioning is being repaired. However please be advised that an emergency shutdown may be performed should the temperature rise significantly. In order to facilitate this the batch queues have been disabled and will be reenabled as soon as possible.

Thu Oct 17 07:54:06 EST 2002

Please note that there is a campus wide AFS filesystem shutdown scheduled on Sunday October 20th from 1:00 A.M. - 4:00 A.M. (See details below) This will cause us to shutdown the HPCC during this time. We will also be planning an upgrade of the Sun Grid Engine (SGE) batch system software which will extend this outage till approximately 10:00 A.M. The entire batch system will be shutdown during this time. Please only run jobs which would be expected to complete prior to Sunday morning at this time.

Wednesday, September 25, 2002

The Linux cluster has been taken off line and the batch queues disabled until further notice in order to work on some file system changes. It is expected to be available in a day or two.

Friday, September 6, 2002
The 8 processor Origin 2000 (poseidon.hpcc.nd.edu) will be down until Monday, September 9th, at approximately 5:00 p.m. for hardware reconfiguration.

The batch queues for medusa and poseidon are currently disabled.

Sunday, August 11, 2002 - 11:10: A.M. - 11:55 A.M. - Scheduled

Repair of bad memory & CPU on 32 processor SGI (medusa.hpcc.nd.edu)

Friday, August 9, 2002 - 11:20 P.M. - 11:45 P.M. - Unscheduled

Crash of 32 processor SGI (medusa.hpcc.nd.edu) due to bad memory & CPU

Friday, July 19, 2002 - 10:30 A.M. - 11:10 A.M. - Scheduled

Repair of bad memory & CPU on 32 processor SGI (medusa.hpcc.nd.edu)

Thursday, July 18, 2002 - 1:00 P.M. - 1:10 P.m. - Unscheduled

Crash of 32 processor SGI (medusa.hpcc.nd.edu) due to bad memory & CPU

Thursday, July 4, 2002 - 8:00 A.M. - 6:00 P.M. - Scheduled

Entire HPCC down for hardware, OS, and Batch System and IP address changes.

Monday, July 1, 2002 10:15 A.M. - 11:00 A.M. - Scheduled

Shutdown of 8 processor SGI (poseidon.hpcc.nd.edu) in order to replace memory.

Thursday, June 13, 2002 11:45 A.M. - 1:30 P.M. - Scheduled

The Origin 3000 (o3000.hpcc.nd.edu) is down in order to run diagnostics and perform service related to the crashes on June 5, 7, & 10th.

Monday, June 10, 2002 1:57 A.M. - 2:18 A.M. - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) crashed Monday June 10th, 2002 and was down from 1:57 A.M. - 2:18 A.M. core dumps to be analyzed by SGI.

Friday, June 7th, 2002 7:50 P.M. - 8:18 P.M. - Unscheduled

o3000.hpcc.nd.edu crashed - core dumps to be analyzed by SGI.

Wednesday, June 5, 2002 1:48 P.M. - 2:13 P.M. - Unscheduled

o3000.hpcc.nd.edu crash - Apparently due to running out of memory on the system. Additional swap space has been configured.

Tuesday, May 7, 7:00 A.M.-8:00 A.M.

The HPCC SGI front end machine perseus.hpcc.nd.edu will have its operating system upgraded from 6.5.13m to 6.5.15m.

This will not cause the loss of any data being submitted via the batch system. Interactive access to this machine during this time period will be unavailable.

Wednesday, April 24th, 2002 - Thursday, April 25th 8:30 A.M. - Scheduled

In the process of upgrading software on poseidon.hpcc.nd.edu a bad disk was detected. After it's been replaced and tested the machine will be brought back online and the queue enabled; this is expected to be approximately 8:30 A.M. tomorrow April 25th.

Thursday, April 4, 2002 12:00 P.M. - 4:00 P.M. - Scheduled

The Origin 2000 (medusa.hpcc.nd.edu) was down to replace memory which caused the crash on 3/27. The delay on replacement was due to waiting for running jobs to complete. At this time the OS and System Applications were also updated.

Thursday, March 28, 2002 11:18 A.M. - 11:21 A.M. - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) inadvertantly rebooted. It is believe that this outage might have been caused by a SGI Field Service Engineer working on the machine at the time.

Wednesday, March 27, 2002 3:06 P.M. - 3:21 P.M. - Unscheduled

The 32 processor Origin 2000 (medusa.hpcc.nd.edu)crashed due to a memory problem.

Monday, March 11, 2002 10:00 A.M. - 10:25 A.M. - Scheduled

The Origin 3000 (o3000.hpcc.nd.edu) was shutdown to replace a Processor Integrated Memory Modules (PIMMs) which was reporting errors.

Monday, February 25th, 2002 11:45 A.M. - 11:55 A.M. - Scheduled

The Origin 3000 (o3000.hpcc.nd.edu) was rebooted to attempt to fix a problem with the L2 controller.

Monday, January 28, 2002 6:00 A.M. - Tuesday January 29, 2002 5:00 P.M. - Scheduled

The HPCC equipemnt moved from the Information Technology Center to Room B025 of the newly constructed computing facility in Malloy Hall

Saturday January 5th, 2002 6:51 A.M. - 6:55 A.M. Unscheduled

A router (networking hardware) in Fitzpatrick Hall connecting the HPCC switch went down due to power problem. This caused problems with jobs running on medusa, sun7, sun8 and o3000 which were using the network at that time.

Monday December 31st, 2001 2:00 P.M. - 3:30 P.M. - Scheduled

A bad CPU which was determined to have caused the crash of sun1 on 12/19 - 12/20 was replaced.

Friday December 21st, 2001 8:00 A.M. - 7:00 P.M. - Scheduled

The statistical interactive machine (stats.cc.nd.edu) was down in order to upgrade the operating system from Solaris 2.6 to Solaris 2.8 (Solaris 8).

Wednesday December 19th, 2001 7:00 P.M. - Thursday December 20th, 9:45 A.M. - Unscheduled

sun1.hpcc.nd.edu crashed due to a CPU and/or memory error.

When the migration of the batch master machine was made to the backup batch master an error was detected (no machine called noname.hpcc.nd.edu) in the batch configuration causing problems with the batch system restart.

Tuesday December 18th, 2001 10:30 P.M. - Wednesday December 19th, 11:00 A.M. - Scheduled

The bad memory which caused the crash of o3000 on 12/15/2001 was replaced. In order to more fully diagnose the problems with the Origin 3000 SGI field service person ran diagnostics in order to minimize future problems.

Saturday December 15th, 2001 2:30 A.M. - 2:45 A.M. - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) crashed due to an uncorrectable memory error. SGI called to analyze the crashdumps. Until a reponse is received from SGI the queue for the Origin 3000 has been disabled, although currently jobs are running.

Tuesday November 6, 2001 10:30 P.M. Approximately Thursday November 8th, 12:00 noon - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) crashed at approximately 10:30 P.M. on Tuesday November 6th. SGI service has been called. It is expected that the machine will be available at approximately noon on Thursday 11/8/01.

Sunday October 21, 2001 7:00 P.M. - 11:30 P.M. - Scheduled

Campus wide Kerberos 5 installation / AFS Server Outage

Sunday October 14, 2001 10:00 A.M. - 12:15 P.M. - Unscheduled

The SGI front-end - perseus.hpcc.nd.edu was down due to replace a bad disk. The operating system was also updated during this outage.

Friday October 12, 2001 3:10 P.M. - Saturday October 13th, 10:45 A.M. - Unscheduled

The SGI front-end - perseus.hpcc.nd.edu was down due to a problem caused by a bad disk.

Tuesday Oct 9 11:45 A.M. - Wednesday October 10, 2001 1:15 P.M. - Unscheduled

The 32 node Origin 2000 - medusa.hpcc.nd.edu crashed due to a power supply problem.

Tuesday September 4, 2001 - 4:30 A.M. - 5:30P.M. - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) crashed due to a noncorrectable memory error.

Wednesday August 15, 2001 - Scheduled

The Origin 3000 will be down, as jobs complete, to install a new disk. The machine will be back up as soon as the disk is installed.

Friday July 20, 2001 - 8:30 A.M. - 7:00 P.M. - Unscheduled

medusa.hpcc.nd.edu down due to CPU problem and scratch disk filled up.

Sunday July 15, 2001 - 8:00 A.M. - 12:00 noon - Scheduled

sun1.hpcc.nd.edu was down in order to load new Solaris 8 operating system.

Saturday July 14, 2001 - 12 A.M. - 1:00 P.M. - Unscheduled

The machine medusa.hpcc.nd.edu was rebooted after all batch jobs died and the batch system errored the machine. It is believed that this was due to the restart of AFS servers early this morning. medusa.hpcc.nd.edu is now functioning normally.

Saturday July 14, 2001 - 12 A.M. - 5:00 A.M. - Unscheduled

Some problems with the batch system were experienced due to the restart of AFS servers earlier this morning. The machine medusa was rebooted due to problems experienced. It is believed that all problems have been fixed. If users are experiencing further problems please send email to hpcc@nd.edu in order to get them resolved.

Thursday July 11, 2001 - 7 A.M. (approximately) - 10:45 A.M. - Unscheduled

The HPCC GRD batch system had been experiencing network service problems on the master spool machine.

Thursday June 14th, 2001 - 10:30 A.M. - 12:00 noon. - Unscheduled

The Origin 30000 (o3000.hpcc.nd.edu) down to replace memory which caused machine to crash yesterday.

Wednesday June 13th, 2001 - 9:45 A.M. - 10:00 A.M. - Unscheduled

The Origin 30000 (o3000.hpcc.nd.edu) crashed and due to a memory problem.

Tuesday May 8, 2001 - Approximately 10:00 A.M. - 2:00 P.M. - Scheduled

Origin 3000 (o3000.hpcc.nd.edu) down in order to implement Field Change Order.

Tuesday May 1, 2001 - Approximately 10:00 A.M.

At approximately 10:00 A.M. 5/1/01 all of the machines in the HPCC were disconnected from the network due to a loss of power to the Cisco 5000 switch which provides network connections for the HPCC. The loss of network connectivity lasted approximately 2 minutes. Unfortunately this outage caused problems with the GRD batch system and some jobs of users were aborted and the batch system needed to be restarted.

Tuesday April 17, 2001 - 9:45 A.M. - 10:00 A.M. - Unscheduled

The front-end machine perseus.hpcc.nd.edu was restarted due to problems with a users home directory on the machine.

Sat April 14, 2001 - Approximately 4:00 A.M. - Unscheduled

The Origin 30000 (o3000.hpcc.nd.edu) crashed and restarted automatically due to a problem with memory errors.

Thursday April 12, 2001 - Approximately 10:00 P.M. - 11:30 P.M. - Unscheduled

The front-end machine perseus.hpcc.nd.edu was restarted due a lack of swap space on the machine. A core dump was done in order to identify the user/process which used up the swap space.

Thu Mar 15, 2001 - 4:30 P.M. - 9:30 P.M. - Unscheduled

Due to a power outage in the CCMB Machine room at approximately 16:30 (4:30 P.M.) all machines in the HPCC restarted. It is now 21:30 (9:30 P.M.) and it believed that all machines and the batch system in the HPCC are now up and running again.

Mon Feb 19, 2001

Sun7.hpcc.nd.edu is down due to hardware problems. A service call has been placed with Sun Microsystems and it is expected that the machine will be running on the morning of Wednesday Feb 21 at the latest.

Wednesday Feb 7, 2001 - 10:00 A.M.

Problems with medusa and perseus and batch queueing system

Medusa is unavailable currently in order to cleanup some hardware/software problems that occurred during the Operating System upgrade. SGI fixed the hardware problems yesterday (Tuesday Feb 6 ). It is expected that medusa will be up by tomorrow A.M. ( Thursday Feb 8 ) possibly sooner.

In addition work is being done to try to determine why perseus is not responding. The Codine/GRD batch queuing system was running on perseus so that is not available at this time too.

These problems will be fixed as soon as possible

Thursday January 25, 2001 - 9:00 A.M.

Access to the SGI machine medusa has been disabled so that it can be rebooted to enable disk quotas on the /scratch filessystem as soon as all jobs are done on it.

Tuesday January 30, 2001 - 11:00 A.M. - 1:00 P.M. Hardware Maintenance Scheduled

The Origin 3000 (o3000.hpcc.nd.edu) will be shutdown Tuesday January 30th, 2001 from approximately 11:00 A.M. - 1:00 A.M. in order to replace several pieces of memory. It is believed that this contributed to the crash on 1/12/2001 & possibly 1/26/2001. See http://www.nd.edu/~hpcc/downtime.html for more info on these outages.

Friday Jan 26 08:30 EST 2001 - 1:00 A.M. - 8:00 A.M. - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) hung and was down from approximately 1:00 A.M. -8:00 A.M. SGI will be investigating this problem

Wednesday Jan 24 17:45 EST 2001 - 3:30 P.M. - 5:45 P.M. - Scheduled

The Origin 2000 (poseidon.hpcc.nd.edu) was rebooted in order to enable disk quotas on the /scratch filesystem and also to update the Operating system to Irix 6.5.10.

Wednesday Jan 17 12:00 EST 2001 - 10:35 A.M. - 11:20 A.M. - Scheduled

The Origin 3000 (o3000.hpcc.nd.edu) was rebooted in order to enable disk quotas on the /scratch filesystem.

Friday January 12, 2001 - 8:20 P.M. - 8:50 P.M. - Unscheduled

The 20 processor Origin 3000 (o3000.hpcc.nd.edu) crashed at approximately 8:20 P.M. and was down until approximately 8:50 P.M. - the cause of the crash is currently under investigation. The machine is up and running now.

Wednesday December 13, 2000 - 2:30 A.M. - 8:00 A.M. - Unscheduled

The 32 processor Origin 2000 (medusa.hpcc.nd.edu) crashed at approximately 2:30 A.M. this morning (12/13/2000) and was down until approximately 8:00 A.M. - the cause of the crash is currently under investigation. The machine is up and running now.

Monday December 11, 2000 - 10:30 A.M. - 4:00 P.M. - Unscheduled

The Origin 3000 (o3000.hpcc.nd.edu) crashed at approximately 10:30 A.M. SGI support recommended that the Operating System be upgraded to the latest version (IRIX 6.5.10). This was done and the machine is available for use.

Saturday November 25th, 2000 - 9:20 A.M. - 12:20 P.M. - Unscheduled

perseus was down due to a disk filling up.

Wednesday November 15th, 2000 - 11 A.M. - 1 P.M. - Scheduled

Origin 3000 (o3000) down to replace crashed processor.

Sunday November 5th, 2000 - 12:00 midnight - Thursday November 9th, 9:30 P.M. - Unscheduled

Origin 3000 down due to CPU error.

Monday October 30th, 2000 - 10:20 A.M. - 10:40 A.M. - Unscheduled

sun7.hpcc.nd.edu outage due to defective power supply.

Friday October 20th, 2000 - 11:00 A.M. - 12:00 noon - Scheduled

poseidon.hpcc.nd.edu outage to replace defective CPU node card

Wednesday October 18th, 2000 - 7 A.M. - 4:45 P.M. - Unscheduled

poseidon unscheduled outage due to defective CPU node card.

Wednesday October 18th 2000 - 7:00 A.M. - 7:00 P.M. - Scheduled

Planned shutdown of medusa & o3000 in order to install new disks on those systems.

Tuesday October 17th 2000 - 9:00 A.M. - 2:30 P.M. - Unscheduled

Emergency shutdown & reinstall of operating system on Origin 3000 (o3000.hpcc.nd.edu). Shutdown due to break-in and security compromise of the system.

Thursday October 5th, 2000 - 9:00 A.M.- 12:00 noon - Scheduled

The front end SGI Origin perseus (8 processor) is expected to be down to install /scratch disks.

Wednesday October 4th, 2000 -  9:00 A.M. - 5:00 P.M - Scheduled

 The SGI Origin 2000 medusa (32 processor) will be down in order to install a new fibre channel disk array.

Wednesday October 4th- Thursday 5th, 2000 9:00 A.M. - 12:00 noon - Scheduled

 The SGI Origin 2000 poseidon (8 processor) is expected to be down in order to install additional /scratch disks.

Saturday September 14th-15th, 2000 - 2:30 P.M - 10:15 A.M. - Unscheduled

The machine sun7.hpcc.nd.edu was shutdown in order to repair a memory problem.

Sunday September 10th, 2000: 11:30 A.M. - 12:05 P.M. - Scheduled

The 8 node SGI Origin 2000 (poseidon.hpcc.nd.edu) will be shutdown in order to upgrade the CPUs from 300 MHz to 400 Mhz.  Poseidon back up and running with 400 MHz processors. Outage only lasted from 11:30 A.M. - 12:05 P.M.

Sunday August 20th, 2000 - 11:15 A.M. - 4:15 P.M. - Unscheduled

Poseidon which is running in friendly user period was shutdown despite the notice that it would remain up. This was in order to physically move the machine during the maintenance scheduled below.

Sunday August 20th, 2000 - 11:15 A.M. - 4:15 P.M. - Scheduled

The HPCC SGI front end (perseus) and 32 node compute server (medusa) shutdown for maintenance. perseus upgraded from 4 nodes to 8 cpus. Perseus & medusa Operating System upgraded to Irix 6.5.8m.

July 9th, 2000 11:15 A.M. - 1:15 P.M. (Scheduled)

Scheduled outage of Medusa (32 node) system for hardware repair & maintenance

June 30, 2000, 11:00 A.M. - 1:00 P.M. (Scheduled)

Scheduled outage of Medusa (32 node) system for hardware repair & maintenance

June 13, 2000, 10:30 P.M. - 11:15 P.M.

Entire HPCC - Medusa & Perseus down due to power outage

May 30, 2000, 1:00 P.M. - 4:40 P.M.

Medusa hung for unknown reason. mmscd warnings -

January 30, 2000

Medusa was restarted to correct loss of network connectivity. Intermittent network problems began shortly after 7:30AM, restart was effected at 7:20PM. Cause is still undetermined.

January 5-14, 2000

Medusa and perseus rebooted several times to fix spinlocked AFS daemons on medusa. Current fix involved modifying kernel source code to lock the ethernet interfaces at 100mbs full-duplex and updated the AFS daemons. Things now appear stable again.

December 15-17, 1999

Medusa went down multiple times due to kernel panic. Bad CPU?

 

page modified 4/3/03