Reported Outages

KVM@TACC System maintenance

Resolved Posted by Cody Hammock on March 06, 2024
Outage start Wednesday, March 27, 2024 8 a.m.
Expected end Wednesday, March 27, 2024 10:33 a.m.

Resolved: Work is complete and KVM@TACC is available.

On Wednesday, March 27 2024 KVM@TACC will be unavailable between 8:00am CDT and 4:00pm CDT to perform necessary system maintenance. Access to the KVM@TACC web interface and API will be unavaiable, and network access to instances will be interrupted. Instances will continue to run, even though they will not be reachable.

 

CHI@UC: Network Uplink Maintenance

Resolved Posted by Michael Sherman on March 04, 2024
Outage start Monday, March 11, 2024 9 a.m.
Expected end Tuesday, March 12, 2024 1 p.m.

5:00 PM We have a workaround in place that should restore connectivity to all floating IPs, however we are still investigating the root cause. There may be additonal minor interruptions while we troubleshoot, but currently all floating IPs tested respond to ping and ssh.


1:30 PM: The maintenance is done, but we are receiving reports of failure to SSH to public IPs from for some users and some instances, and are investigating. Access to the API and dashboard have been fully restored.

 

Chameleon Portal, CHI@Edge and Jupyterhub

Resolved Posted by Mark Powers on January 30, 2024
Outage start Tuesday, January 30, 2024 8:30 a.m.
Expected end Wednesday, January 31, 2024 5:59 p.m.

Update 5:45 PM Jan 31: Jupyterhub and CHI@Edge are back online.


This morning the VM cluster that hosts Portal, CHI@Edge, and JupyterHub went down. This affected the ability to create new leases at any Chameleon site, and access to the help desk.

CHI@UC Site Maintenance Feb 6th 2024

Resolved Posted by Michael Sherman on January 24, 2024
Outage start Tuesday, February 06, 2024 6 a.m.
Expected end Thursday, February 08, 2024 3:21 p.m.

Update: 3pm Feb 8: Outage is resolved, all skylake and rtx_6000 nodes now have 2 10G network interfaces (up from one each)


Update: 7PM Feb 6th: The site is online, and all P3 nodes (everything except compute_skylake and gpu_rtx_6000) are accessible.


On Feb 6th, 2024, we'll be taking the CHI@UC site down for maintenance, in order to replace some failing network hardware.

CHI@NCAR maintenance

Resolved Posted by Michael Sherman on January 22, 2024
Outage start Friday, January 26, 2024 6 a.m.
Expected end Monday, January 29, 2024 6 p.m.

CHI@NCAR will be down for maintenance from January 26th to January 29th inclusive, in order to facilitate work on the datacenter racks currently hosting it.

Jupyter Planned Upgrade Outage

Resolved Posted by Mark Powers on January 03, 2024
Outage start Monday, January 08, 2024 9:15 a.m.
Expected end Monday, January 08, 2024 9:20 a.m.

Update: The upgrade is complete, and Jupyter service should be as normal

---

We will be upgrading JupyterHub versions, which will take down JupyterHub and running JupyterLab instances for a short duration.

Jupyter availability issues Dec 7 2023

Resolved Posted by Michael Sherman on December 07, 2023
Outage start Thursday, December 07, 2023 12:33 p.m.
Expected end Thursday, December 07, 2023 2:22 p.m.

2:22 PM: Issues affecting Jupyterhub are resolved.

The root cause was a network interruption affecting availability of the storage services which Juptyerhub's infra depends on. After the network interruption was resolved, dependent services, including Jupyterhub, returned to a healthy state.

We are still investigating other issues affecting KVM@TACC, unrelated to this outage. If your work in Jupyter depends on KVM, you may be able to continue your work by using a different Chameleon site.

KVM@TACC Outage December 6-13, 2023

Resolved Posted by Cody Hammock on December 06, 2023
Outage start Wednesday, December 06, 2023 3 p.m.
Expected end Monday, December 11, 2023 6 p.m.

Dec 13: The outage is now resolved.

You should observe that I/O performance on ephemeral disks is back to normal, as they have been migrated to a different storage backend.
Note: If you attach an additonal cinder volume of type "ceph-hdd", that volume may experience unexpectedly delayed I/O as the system state settles, but this will not impact the majority of instances.

Jupyterhub Instability

Resolved Posted by Michael Sherman on November 30, 2023
Outage start Thursday, November 30, 2023 11:30 a.m.
Expected end Thursday, November 30, 2023 6 p.m.

We're observing intermittent issues accessing Chameleon's Jupyterhub. There was a failure in one of the VMs underlying it, and in combination, some requests are still being routed to the failed instance in error.

In some cases, users may be able to work around this error by refreshing their browser.

Site staff are investigating to revive both the VM, and to correct the request routing.

CHI@UC outage

Resolved Posted by Michael Sherman on November 11, 2023
Outage start Friday, November 10, 2023 8:30 p.m.
Expected end Saturday, November 11, 2023 1:27 p.m.

The outage is now resolved.

Due to a combination of factors, a misconfigured node was able to cause a DoS of control plane services. The relevant traffic has been blocked, and the site is back online.