![]()

Home > HPCC Home > Downtime > Outage Log
HPCC Downtimes - Unscheduled Outage Log:
HPCC outages during scheduled HPCC maintenance periods, or those which don't result in jobs being aborted (graceful shutdown using batch system) are not logged:
Sunday April 4, 2004
6:00 A.M. - 7:00 A.M..
Scheduled network outage - A key piece of network hardware (hub250a - that supplies network
connectivity to the OIT Data Center is being repaired to prevent future
unscheduled outages.
Thursday March 25, 2004
Approximately 5:15 P.M. - 5:20 P.M..
There was a brief (approximately 5 minute) outage of all networking on
campus Thursday 3/25/04 at approximately 5:15 p.m. due to software
problems on the main network switch for the OIT Data Center.
Sunday March 14, 2004
9:00 A.M. - 10:00 A.M..
Campus-wide network outage for networking hardware maintenance. All networking connectivity to AFS
fileservice was interrupted during this time.
The HPCC interactive machine stats.hpcc.nd.edu was upgraded from 4 x 250 MHz Sun Enterprise 4000
with 1 GB RAM to a 4 x 1.2 GHz Sun SUn Fire V880 server with 8 GB RAM.
Friday March 12, 2004
4:10 P.M. - 4:55 P.M..
sun29.hpcc.nd.edu crashed due to kernel stack overflow - the size of the stack was increased
and the machine returned to service.
Wednesday March 3, 2004
7:50 P.M. - 8:25 A.M..
Campus-wide network outage due to network switch failure. All networking connectivity to AFS
fileservice was interrupted during this time.
Sunday February 22, 2004
2:00 A.M. - 6:00 A.M..
Campus-wide network outage to patch network hardware. All networking connectivity to AFS
fileservice was interrupted during this time.
Sunday December 18th, 2003
6:30 A.M. - 7:00 A.M..
Campus-wide network outage to replace failing network hardware. All networking connectivity to AFS
fileservice was interrupted during this time.
Sunday December 12th, 2003
8:00 A.M. - 10:00 A.M..
The SGI "front-end" machine perseus.hpcc.nd.edu down in order to
upgrade disk and RAM. The SGI Onyx2 - onyx2.hpcc.nd.edu - permenently removed from service.
Sunday November 23th, 2003
5:00 A.M. - 9:00 A.M..
Network configuration changes are being
made as part of the OIT Data Center renovation.
Network connectivity was unavailable for the entire campus during this
time, including access to email, the Internet, AFS, and NetFile.
Sunday November 16th, 2003
5:00 A.M. - 10:00 A.M..
AFS fileservice down for maintenance. This may have caused
outages for a number of HPCC users. All users affected were sent email.
Tuesday November 4th, 2003
Approximately 9:45 - 12:30 P.M.
Power outage in OIT ITC computer room caused
6 AFS servers to become unavailable causing numerous problems.
Wednesday October 8th, 2003
1:00 P.M. - 5:00 P.M.
Outage of o3000.hpcc.nd.edu
Machine hung due to a software error - machine was patched with patch 5202 as vendor recommended.
Wednesday May 28th, 2003
4:10 P.M. - 5:00 P.M.
Outage of o3000.hpcc.nd.edu
due to cabling error when the power distribution unit was being reconfigured on the system.
Saturday April 26th
1:00 P.M. - Sunday April 27, 2003 1:15 P.M
Outage of o3000.hpcc.nd.edu
due to hardware error on machine.
Monday March 31
Approximately 8:00 A.M. - Tuesday April 1, 2003 1:40 P.M.
Gigabit switch in IBM 1300 resulting in
loss of network connectivity on the private management subnet (and to mgmt1.hpcc.nd.edu).
This caused cnode01-32 to hang since it's an NFS server for those nodes.
Saturday March 22
approximately 8:35 A.M. - Monday March 24th, 2003 9:00 A.M.
The batch system in the HPCC was down from approximately 8:35 A.M.
on Saturday March 22nd due to an outage of the AFS fileserver
reno.helios.nd.edu which holds the volume for the SGE batch system.
Tuesday March
18th, 2003 from 7:00 A.M. - 8:00 A.M.
linux1.hpcc.nd.edu
down in order to move the interactive machine.
Sunday February
22nd, 2003 from 7:00 A.M. - 10:00 A.M.
The HPCC interactive machines sun1.hpcc.nd.edu, stats.hpcc.nd.edu
and perseus.hpcc.nd.edu may be unavailable as software maintenance is being
performed software maintenance during this time.
Wed Jan 22, 2003 12:26:27
EST 2003:
The queues for the Linux nodes cnode02-cnode16 have been disabled so that
when jobs are done running on them they can be upgraded from RedHat 7.2 to
RedHat 7.3. After that is done they will immediately be returned to service.
The queues for sun3, sun6, sun7, sun8 & sun9 have been disabled so that they can have the Operating System patched to the latest version. After that is done these machines will immediately be returned to service.
sun4.hpcc.nd.edu is down due to some memory problems. This machine is expected to be operational at approximately noon on 1/23/03.
December 11, 2002:
perseus.hpcc.nd.edu,
the SGI interactive front end machine, will be shut down on Wednesday December
11th from 7:00 A.M. - 7:30 A.M. in order to install some additional software.
Tue Nov 26 11:17:53
EST 2002
On Wednesday November 27th from approximately 4:30 A.M. -
7:30 A.M. the network feed to the HPCC and the campus
Internet service will be unavailable due to the movement
of networking equipment in Malloy Hall. It is expected that this
outage may cause jobs running in the HPCC to die.
The HPCC batch queue has been disabled to prevent new jobs
from starting. After the outage, queues will be re-enabled.
November 5-6, 2002:
On Tuesday & Wednesday (November 5-6th) the air conditioning units in the HPCC computer room located in Malloy hall are scheduled to be repaired. Because of this we will be trying to temporarily turn off as many machines as possible to reduce the heating of the room. We will attempt to keep the HPCC machines running while the air conditioning is being repaired. However please be advised that an emergency shutdown may be performed should the temperature rise significantly. In order to facilitate this the batch queues have been disabled and will be reenabled as soon as possible.
Thu Oct 17 07:54:06
EST 2002
Please note that there is a campus wide AFS filesystem shutdown scheduled
on Sunday October 20th from 1:00 A.M. - 4:00 A.M. (See details below) This
will cause us to shutdown the HPCC during this time. We will also be planning
an upgrade of the Sun Grid Engine (SGE) batch system software which will extend
this outage till approximately 10:00 A.M. The entire batch system will be
shutdown during this time. Please only run jobs which would be expected to
complete prior to Sunday morning at this time.
Wednesday, September
25, 2002
The Linux cluster has been taken off line and the batch queues disabled until
further notice in order to work on some file system changes. It is expected
to be available in a day or two.
Friday, September 6,
2002
The 8 processor Origin 2000 (poseidon.hpcc.nd.edu) will be down until Monday,
September 9th, at approximately 5:00 p.m. for hardware reconfiguration.
The batch queues for medusa and poseidon are currently disabled.
Sunday, August 11, 2002 - 11:10: A.M. - 11:55 A.M. - Scheduled
Repair of bad memory & CPU on 32 processor SGI (medusa.hpcc.nd.edu)
Friday, August 9, 2002 - 11:20 P.M. - 11:45 P.M. - Unscheduled
Crash of 32 processor SGI (medusa.hpcc.nd.edu) due to bad memory & CPU
Friday, July 19, 2002 - 10:30 A.M. - 11:10 A.M. - Scheduled
Repair of bad memory & CPU on 32 processor SGI (medusa.hpcc.nd.edu)
Thursday, July 18, 2002 - 1:00 P.M. - 1:10 P.m. - Unscheduled
Crash of 32 processor SGI (medusa.hpcc.nd.edu) due to bad memory & CPU
Thursday, July 4, 2002 - 8:00 A.M. - 6:00 P.M. - Scheduled
Entire HPCC down for hardware, OS, and Batch System and IP address changes.
Monday, July 1, 2002 10:15 A.M. - 11:00 A.M. - Scheduled
Shutdown of 8 processor SGI (poseidon.hpcc.nd.edu) in order to replace memory.
Thursday, June 13, 2002 11:45 A.M. - 1:30 P.M. - Scheduled
The Origin 3000 (o3000.hpcc.nd.edu) is down in order to run diagnostics and perform service related to the crashes on June 5, 7, & 10th.
Monday, June 10, 2002 1:57 A.M. - 2:18 A.M. - Unscheduled
The Origin 3000 (o3000.hpcc.nd.edu) crashed Monday June 10th, 2002 and was down from 1:57 A.M. - 2:18 A.M. core dumps to be analyzed by SGI.
Friday, June 7th, 2002 7:50 P.M. - 8:18 P.M. - Unscheduled
o3000.hpcc.nd.edu crashed - core dumps to be analyzed by SGI.
Wednesday, June 5, 2002 1:48 P.M. - 2:13 P.M. - Unscheduled
o3000.hpcc.nd.edu
crash - Apparently due to running out of memory on the system. Additional
swap space has been configured.
Tuesday, May 7, 7:00 A.M.-8:00 A.M.
The HPCC SGI front end machine perseus.hpcc.nd.edu will have its operating system upgraded from 6.5.13m to 6.5.15m.
This will not cause
the loss of any data being submitted via the batch system. Interactive access
to this machine during this time period will be unavailable.
Wednesday, April 24th, 2002 - Thursday, April 25th 8:30 A.M. - Scheduled
In the process of upgrading software on poseidon.hpcc.nd.edu a bad disk was detected. After it's been replaced and tested the machine will be brought back online and the queue enabled; this is expected to be approximately 8:30 A.M. tomorrow April 25th.
Thursday, April 4, 2002 12:00 P.M. - 4:00 P.M. - Scheduled
The Origin 2000 (medusa.hpcc.nd.edu) was down to replace memory which caused the crash on 3/27. The delay on replacement was due to waiting for running jobs to complete. At this time the OS and System Applications were also updated.
The Origin 3000 (o3000.hpcc.nd.edu) inadvertantly rebooted. It is believe that this outage might have been caused by a SGI Field Service Engineer working on the machine at the time.
The 32 processor Origin 2000 (medusa.hpcc.nd.edu)crashed due to a memory problem.
The Origin 3000 (o3000.hpcc.nd.edu) was shutdown to replace a Processor Integrated Memory Modules (PIMMs) which was reporting errors.
The Origin 3000 (o3000.hpcc.nd.edu) was rebooted to attempt to fix a problem with the L2 controller.
The HPCC equipemnt moved from the Information Technology Center to Room B025 of the newly constructed computing facility in Malloy Hall
Saturday January 5th, 2002 6:51 A.M. - 6:55 A.M. Unscheduled
A router (networking hardware) in Fitzpatrick Hall connecting the HPCC switch went down due to power problem. This caused problems with jobs running on medusa, sun7, sun8 and o3000 which were using the network at that time.
Monday December 31st, 2001 2:00 P.M. - 3:30 P.M. - Scheduled
A bad CPU which was determined to have caused the crash of sun1 on 12/19 - 12/20 was replaced.
Friday December 21st, 2001 8:00 A.M. - 7:00 P.M. - Scheduled
The statistical interactive machine (stats.cc.nd.edu) was down in order to upgrade the operating system from Solaris 2.6 to Solaris 2.8 (Solaris 8).
Wednesday December 19th, 2001 7:00 P.M. - Thursday December 20th, 9:45 A.M. - Unscheduled
sun1.hpcc.nd.edu crashed due to a CPU and/or memory error.
Tuesday December 18th, 2001 10:30 P.M. - Wednesday December 19th, 11:00 A.M. - Scheduled
The bad memory which caused the crash of o3000 on 12/15/2001 was replaced. In order to more fully diagnose the problems with the Origin 3000 SGI field service person ran diagnostics in order to minimize future problems.
Saturday December 15th, 2001 2:30 A.M. - 2:45 A.M. - Unscheduled
The Origin 3000 (o3000.hpcc.nd.edu) crashed due to an uncorrectable memory error. SGI called to analyze the crashdumps. Until a reponse is received from SGI the queue for the Origin 3000 has been disabled, although currently jobs are running.
Tuesday November 6, 2001 10:30 P.M. Approximately Thursday November 8th, 12:00 noon - Unscheduled
The Origin 3000 (o3000.hpcc.nd.edu) crashed at approximately 10:30 P.M. on Tuesday November 6th. SGI service has been called. It is expected that the machine will be available at approximately noon on Thursday 11/8/01.
Sunday October 21, 2001 7:00 P.M. - 11:30 P.M. - Scheduled
Campus wide Kerberos 5 installation / AFS Server Outage
Sunday October 14, 2001 10:00 A.M. - 12:15 P.M. - Unscheduled
The SGI front-end - perseus.hpcc.nd.edu was down due to replace a bad disk. The operating system was also updated during this outage.
Friday October 12, 2001 3:10 P.M. - Saturday October 13th, 10:45 A.M. - Unscheduled
The SGI front-end - perseus.hpcc.nd.edu was down due to a problem caused by a bad disk.
Tuesday Oct 9 11:45 A.M. - Wednesday October 10, 2001 1:15 P.M. - Unscheduled
The 32 node Origin 2000 - medusa.hpcc.nd.edu crashed due to a power supply problem.
Tuesday September 4, 2001 - 4:30 A.M. - 5:30P.M. - Unscheduled
The Origin 3000 (o3000.hpcc.nd.edu) crashed due to a noncorrectable memory error.
Wednesday August 15, 2001 - Scheduled
The Origin 3000 will be down, as jobs complete, to install a new disk. The machine will be back up as soon as the disk is installed.
medusa.hpcc.nd.edu down due to CPU problem and scratch disk filled up.
Sunday July 15, 2001 - 8:00 A.M. - 12:00 noon - Scheduled
Saturday July 14, 2001 - 12 A.M. - 1:00 P.M. - Unscheduled
The machine medusa.hpcc.nd.edu was rebooted after all batch jobs died and the batch system errored the machine. It is believed that this was due to the restart of AFS servers early this morning. medusa.hpcc.nd.edu is now functioning normally.
Saturday July 14, 2001 - 12 A.M. - 5:00 A.M. - Unscheduled
Some problems with the batch system were experienced due to the restart of AFS servers earlier this morning. The machine medusa was rebooted due to problems experienced. It is believed that all problems have been fixed. If users are experiencing further problems please send email to hpcc@nd.edu in order to get them resolved.
Thursday July 11, 2001 - 7 A.M. (approximately) - 10:45 A.M. - Unscheduled
The HPCC GRD batch system had been experiencing network service problems on the master spool machine.
Thursday June 14th, 2001 - 10:30 A.M. - 12:00 noon. - Unscheduled
The Origin 30000 (o3000.hpcc.nd.edu) down to replace memory which caused machine to crash yesterday.
Wednesday June 13th, 2001 - 9:45 A.M. - 10:00 A.M. - Unscheduled
The Origin 30000 (o3000.hpcc.nd.edu) crashed and due to a memory problem.
Tuesday May 8, 2001 - Approximately 10:00 A.M. - 2:00 P.M. - Scheduled
Origin 3000 (o3000.hpcc.nd.edu) down in order to implement Field Change Order.
Tuesday May 1, 2001 - Approximately 10:00 A.M.
At approximately 10:00 A.M. 5/1/01 all of the machines in the HPCC were disconnected from the network due to a loss of power to the Cisco 5000 switch which provides network connections for the HPCC. The loss of network connectivity lasted approximately 2 minutes. Unfortunately this outage caused problems with the GRD batch system and some jobs of users were aborted and the batch system needed to be restarted.
Tuesday April 17, 2001 - 9:45 A.M. - 10:00 A.M. - Unscheduled
The front-end machine perseus.hpcc.nd.edu was restarted due to problems with a users home directory on the machine.
Sat April 14, 2001 - Approximately 4:00 A.M. - Unscheduled
The Origin 30000 (o3000.hpcc.nd.edu) crashed and restarted automatically due to a problem with memory errors.
Thursday April 12, 2001 - Approximately 10:00 P.M. - 11:30 P.M. - Unscheduled
The front-end machine perseus.hpcc.nd.edu was restarted due a lack of swap space on the machine. A core dump was done in order to identify the user/process which used up the swap space.
Thu Mar 15, 2001 - 4:30 P.M. - 9:30 P.M. - Unscheduled
Due to a power outage in the CCMB Machine room at approximately 16:30 (4:30 P.M.) all machines in the HPCC restarted. It is now 21:30 (9:30 P.M.) and it believed that all machines and the batch system in the HPCC are now up and running again.
Mon Feb 19, 2001
Sun7.hpcc.nd.edu is down due to hardware problems. A service call has been placed with Sun Microsystems and it is expected that the machine will be running on the morning of Wednesday Feb 21 at the latest.
Wednesday Feb 7, 2001 - 10:00 A.M.
Medusa is unavailable currently in order to cleanup some hardware/software problems that occurred during the Operating System upgrade. SGI fixed the hardware problems yesterday (Tuesday Feb 6 ). It is expected that medusa will be up by tomorrow A.M. ( Thursday Feb 8 ) possibly sooner.
In addition work is being done to try to determine why perseus is not responding. The Codine/GRD batch queuing system was running on perseus so that is not available at this time too.
These problems will be fixed as soon as possible
Access to the SGI machine medusa has been disabled so that it can be rebooted to enable disk quotas on the /scratch filessystem as soon as all jobs are done on it.
Tuesday January 30, 2001 - 11:00 A.M. - 1:00 P.M. Hardware Maintenance Scheduled
The Origin 3000 (o3000.hpcc.nd.edu) will be shutdown Tuesday January 30th, 2001 from approximately 11:00 A.M. - 1:00 A.M. in order to replace several pieces of memory. It is believed that this contributed to the crash on 1/12/2001 & possibly 1/26/2001. See http://www.nd.edu/~hpcc/downtime.html for more info on these outages.
The Origin 3000 (o3000.hpcc.nd.edu) hung and was down from approximately 1:00 A.M. -8:00 A.M. SGI will be investigating this problem
The Origin 2000 (poseidon.hpcc.nd.edu) was rebooted in order to enable disk quotas on the /scratch filesystem and also to update the Operating system to Irix 6.5.10.
The Origin 3000 (o3000.hpcc.nd.edu) was rebooted in order to enable disk quotas on the /scratch filesystem.
The 20 processor Origin 3000 (o3000.hpcc.nd.edu) crashed at approximately 8:20 P.M. and was down until approximately 8:50 P.M. - the cause of the crash is currently under investigation. The machine is up and running now.
The 32 processor Origin 2000 (medusa.hpcc.nd.edu) crashed at approximately 2:30 A.M. this morning (12/13/2000) and was down until approximately 8:00 A.M. - the cause of the crash is currently under investigation. The machine is up and running now.
perseus was down due to a disk filling up.
Origin 3000 (o3000) down to replace crashed processor.
Origin 3000 down due to CPU error.
sun7.hpcc.nd.edu outage due to defective power supply.
poseidon.hpcc.nd.edu outage to replace defective CPU node card
poseidon unscheduled outage due to defective CPU node card.
Planned shutdown of medusa & o3000 in order to install new disks on those systems.
Emergency shutdown & reinstall of operating system on Origin 3000 (o3000.hpcc.nd.edu). Shutdown due to break-in and security compromise of the system.
The SGI Origin 2000 medusa (32 processor) will be down in order to install a new fibre channel disk array.
The SGI Origin 2000 poseidon (8 processor) is expected to be down in order to install additional /scratch disks.
Sunday September 10th, 2000: 11:30 A.M. - 12:05 P.M. - Scheduled
The 8 node SGI Origin 2000 (poseidon.hpcc.nd.edu) will be shutdown in order to upgrade the CPUs from 300 MHz to 400 Mhz. Poseidon back up and running with 400 MHz processors. Outage only lasted from 11:30 A.M. - 12:05 P.M.
Poseidon which is running in friendly user period was shutdown despite the notice that it would remain up. This was in order to physically move the machine during the maintenance scheduled below.
The HPCC SGI front end (perseus) and 32 node compute server (medusa) shutdown for maintenance. perseus upgraded from 4 nodes to 8 cpus. Perseus & medusa Operating System upgraded to Irix 6.5.8m.
Scheduled outage of Medusa (32 node) system for hardware repair & maintenance
Scheduled outage of Medusa (32 node) system for hardware repair & maintenance
Entire HPCC - Medusa & Perseus down due to power outage
Medusa hung for unknown reason. mmscd warnings -
Medusa was restarted to correct loss of network connectivity. Intermittent network problems began shortly after 7:30AM, restart was effected at 7:20PM. Cause is still undetermined.
Medusa and perseus rebooted several times to fix spinlocked AFS daemons on medusa. Current fix involved modifying kernel source code to lock the ethernet interfaces at 100mbs full-duplex and updated the AFS daemons. Things now appear stable again.
Medusa went down multiple times due to kernel panic. Bad CPU?
page modified 4/3/03