Reported Outages

Jupyter availability issues Dec 7 2023

Resolved Posted by Michael Sherman on December 07, 2023
Outage start Thursday, December 07, 2023 12:33 p.m.
Expected end Thursday, December 07, 2023 2:22 p.m.

2:22 PM: Issues affecting Jupyterhub are resolved.

The root cause was a network interruption affecting availability of the storage services which Juptyerhub's infra depends on. After the network interruption was resolved, dependent services, including Jupyterhub, returned to a healthy state.

We are still investigating other issues affecting KVM@TACC, unrelated to this outage. If your work in Jupyter depends on KVM, you may be able to continue your work by using a different Chameleon site.

KVM@TACC Outage December 6-13, 2023

Resolved Posted by Cody Hammock on December 06, 2023
Outage start Wednesday, December 06, 2023 3 p.m.
Expected end Monday, December 11, 2023 6 p.m.

Dec 13: The outage is now resolved.

You should observe that I/O performance on ephemeral disks is back to normal, as they have been migrated to a different storage backend.
Note: If you attach an additonal cinder volume of type "ceph-hdd", that volume may experience unexpectedly delayed I/O as the system state settles, but this will not impact the majority of instances.

Jupyterhub Instability

Resolved Posted by Michael Sherman on November 30, 2023
Outage start Thursday, November 30, 2023 11:30 a.m.
Expected end Thursday, November 30, 2023 6 p.m.

We're observing intermittent issues accessing Chameleon's Jupyterhub. There was a failure in one of the VMs underlying it, and in combination, some requests are still being routed to the failed instance in error.

In some cases, users may be able to work around this error by refreshing their browser.

Site staff are investigating to revive both the VM, and to correct the request routing.

CHI@UC outage

Resolved Posted by Michael Sherman on November 11, 2023
Outage start Friday, November 10, 2023 8:30 p.m.
Expected end Saturday, November 11, 2023 1:27 p.m.

The outage is now resolved.

Due to a combination of factors, a misconfigured node was able to cause a DoS of control plane services. The relevant traffic has been blocked, and the site is back online.

JupyterHub Outage

Resolved Posted by Mark Powers on November 03, 2023
Outage start Sunday, November 05, 2023 12 p.m.
Expected end Friday, November 10, 2023 2:07 p.m.

Chameleon's JupyterHub will be taken down Sunday, November 5 at 12:00 PM CT to perform an urgent infrastructure migration.

----

UPDATE: The upgrade is now complete. If you encounter any issues with Jupyter or Trovi, please let us know via the help desk . You may experience an unresponsive host while DNS updates propagate.

CHI@TACC NVIDIA GPU node unavailable

Resolved Posted by Cody Hammock on October 27, 2023
Outage start Thursday, October 26, 2023 10 p.m.
Expected end Monday, October 30, 2023 3:44 p.m.

Resolved The switch has been restored, and running instances have been reconnected.

The network switch connecting the nodes equiped with NVIDIA M40, K80, P100, and V100 GPUs has failed. A replacement switch is expected to be installed on Oct 30, 2023. 

Existing leases for the affected nodes will be extended to prevent instances from being shut down.

Upcoming Maintenance window for Chameleon Auth server

Resolved Posted by Michael Sherman on October 18, 2023
Outage start Monday, October 23, 2023 8 a.m.
Expected end Monday, October 23, 2023 8:44 a.m.

This maintenance is now completed.


On the morning of Monday, October 23rd, there will be a brief outage affecting login to all Chameleon sites and services.

We expect a 5-10 minute outage while we apply updates to the service that handles federated login.
This won't affect any running workloads or nodes, but you may need to refresh your browser once the outage ends.

CHI@TACC Haswell+InfiniBand nodes extended unavailability

Resolved Posted by Cody Hammock on October 04, 2023
Outage start Wednesday, September 20, 2023 8 a.m.
Expected end Monday, October 09, 2023 8:32 a.m.

Resolved: the work is complete.

Starting at 9/20/2023 the Haswell+InfiniBand nodes have been unavailable while they are reconfigured in order to decommission a portion of them. This work is taking longer than anticipated to complete, but is expected to conclude by 10/06/2023.

We thank you for your patience.

CHI@NU offline due to water leak

Resolved Posted by Michael Sherman on September 18, 2023
Outage start Sunday, September 17, 2023 6 p.m.
Expected end Monday, September 18, 2023 8 p.m.

 

Due to a water leak in the StarLight datacenter, power has been cut to the CHI@NU racks.

We're monitoring the situation, and will update when we have more information on an estimated time to resolution. 

CHI@NU down

Resolved Posted by Michael Sherman on September 13, 2023
Outage start Wednesday, September 13, 2023 8:31 a.m.
Expected end Wednesday, September 13, 2023 6 p.m.

CHI@NU is currently inaccessable due to a certificate issue. Site staff are investigating.