Reported Outages

Chameleon Portal, CHI@TACC, KVM@TACC, and CHI@EDGE networking outage

Resolved Posted by Mark Powers on September 02, 2022
Outage start Friday, September 02, 2022 10:46 a.m.
Expected end Friday, September 02, 2022 3:04 p.m.

UPDATE: All services should now be working as expected


Due to a networking issue at TACC, Chameleon’s portal, CHI@TACC, KVM@TACC, and CHI@EDGE are currently unavailable. 
This affects site access as well as already running resources. Site networking staff are investigating, but there is no ETA
for resolution at this time.

Object Store at UC currently unavailable

Resolved Posted by Michael Sherman on August 24, 2022
Outage start Wednesday, August 24, 2022 5:22 p.m.
Expected end Thursday, August 25, 2022 10:52 a.m.

The outage is resolved, and the object store is available again.

A rebalancing operation in the underlying storage cluster caused higher than normal I/O, and pushed some of the storage nodes into an unstable state. The node configuration has been tuned to prevent this instability from recurring.

Scheduled downtime for authentication systems

Resolved Posted by Adam Cooper on August 23, 2022
Outage start Tuesday, September 06, 2022 11 a.m.
Expected end Tuesday, September 06, 2022 12 p.m.

On Tuesday, September 6th, our authentication services will be down from 11 AM to 12 PM US Central Time.

http requests being dropped for instances at CHI@UC

Resolved Posted by Michael Sherman on August 18, 2022
Outage start Wednesday, August 17, 2022 12 p.m.
Expected end Thursday, August 18, 2022 3:29 p.m.

3:30PM: The issue has been resolved.

We're observing that http requests to non-https endpoints from nodes at CHI@UC to the internet are being dropped, and are working with our upstream network provider to investigate.

As a workound, you can make requests to an https endpoint, if available.

Jupyterhub storage slowness

Resolved Posted by Michael Sherman on August 05, 2022
Outage start Friday, August 05, 2022 12 p.m.
Expected end Friday, August 05, 2022 2 p.m.

Update 4PM: is back online, and the storage system has stabilized.

Resolving this required more changes than expected, so there may be some remaining instability over the weekend.
Please reach out to the helpdesk if you encounter issues.

Datacenter Cooling failure for CHI@UC

Resolved Posted by Michael Sherman on August 01, 2022
Outage start Monday, August 01, 2022 4:34 p.m.
Expected end Monday, August 01, 2022 9 p.m.

Update 9pm: The chillers have been repaired, and we've received the all clear from ANL staff. CHI@UC is now back to normal operation.

The datacenter hosting CHI@UC has experienced a failure in its cooling system. To reduce load on the remaining cooling, we are blocking new node reservations at UC until the failure is resolved.

maintenance window for CHI@NU

Resolved Posted by Michael Sherman on July 26, 2022
Outage start Tuesday, August 02, 2022 10 a.m.
Expected end Tuesday, August 02, 2022 2 p.m.

On August 2nd, CHI@NU will be down for a major version upgrade.

Running instances will be inaccessible for the duration, but not otherwise impacted.

Other sites are not affected.

Elevated error rates at CHI@UC

Resolved Posted by Michael Sherman on July 13, 2022
Outage start Wednesday, July 13, 2022 2:41 p.m.
Expected end Monday, July 18, 2022 2:41 p.m.

We're currently seeing elevated error rates when provisoning nodes at CHI@UC.

When deploying an instance at CHI@UC, if it moves to state "error", with a message about "provisioning timed out", please reach out to the helpdesk, and include the UUID of your instance and node(s).

We're currently working to narrow down which nodes are affected and reproduce the issue.

KVM@TACC API and WebUI down

Resolved Posted by Michael Sherman on July 11, 2022
Outage start Monday, July 11, 2022 12:40 p.m.
Expected end Monday, July 11, 2022 2:14 p.m.

2:14PM: The issue is now resolved, and KVM is accessible again.