Reported Outages

CHI@UC: some rtx_6000 nodes not provisioning

Resolved Posted by Michael Sherman on August 05, 2024
Outage start Thursday, August 01, 2024 6 p.m.
Expected end Tuesday, August 06, 2024 6 p.m.

We're observing intermittent provisioning failures for a small set of nodes at CHI@UC, all of which are "phase 2" nodes, mostly rtx_6000s.

The nodes currently known to be affected are listed below, and have been placed into a non-reservable maintenance mode until we can resolve the issue.

CHI@UC Lease enforcement

Resolved Posted by Michael Sherman on July 25, 2024
Outage start Thursday, July 25, 2024 12 p.m.
Expected end Thursday, July 25, 2024 6 p.m.

At CHI@UC, when submitting a request for a new lease, users are intermittently receiving an error message about "enforcement failed". After restarting a relevant service, lease requests are now succeeding again.

We suspect this issue is related to a service token expiring and not being correctly renewed, and are investigating a proper fix.

Upcoming Maintenance window for Chameleon Auth server

Resolved Posted by Mark Powers on July 16, 2024
Outage start Tuesday, July 23, 2024 9:30 a.m.
Expected end Tuesday, July 23, 2024 9:35 a.m.

On the morning of Tuesday, July 23rd, there will be a brief outage affecting login to all Chameleon sites and services at 9:30 AM central time.

We expect a 5-10 minute outage while we apply updates to the service that handles federated login.
This won't affect any running workloads or nodes, but you may need to refresh your browser once the outage ends.

CHI@UC Maintenance window for Openstack version upgrade

Resolved Posted by Michael Sherman on July 02, 2024
Outage start Monday, July 22, 2024 9 a.m.
Expected end Tuesday, July 23, 2024 1 p.m.

As of 1pm, July 23rd, reservation enformenents are fixed, and we're declaring the CHI@UC outage over.
You may notice some minor changes to the horizon dashboard as we fix some remaining UI issues in the instance launch dialog, but these should not affect any functionality.

Upcoming Maintenance window for Chameleon Auth server

Resolved Posted by Mark Powers on June 11, 2024
Outage start Tuesday, June 18, 2024 10 a.m.
Expected end Tuesday, June 18, 2024 10:06 a.m.

UPDATE: All services are updated and should be working as expected


On the morning of Tuesday, June 18th, there will be a brief outage affecting login to all Chameleon sites and services at 9:30 AM central time.

We expect a 5-10 minute outage while we apply updates to the service that handles federated login.
This won't affect any running workloads or nodes, but you may need to refresh your browser once the outage ends.

CHI@UC Uplink networking

Resolved Posted by Michael Sherman on May 29, 2024
Outage start Wednesday, May 29, 2024 10:28 a.m.
Expected end Wednesday, May 29, 2024 11:28 a.m.

CHI@UC had a brief interruption in connectivity, preventing access to chi.uc.chameleoncloud.org. This issue manifested as a failure for the UC control-plane servers to contact the chameleon authentication server, triggered by what was planned to be unrelated work elsewhere in the network.

Upon discovering the issue, the other network changes were backed out, and service has been restored.

CHI@TACC Outage May 25-27, 2024

Resolved Posted by Cody Hammock on May 28, 2024
Outage start Saturday, May 25, 2024 8:30 a.m.
Expected end Tuesday, May 28, 2024 10 a.m.

CHI@TACC was unavailable from May 25 through the morning of May 28, 2024. This effected the API services and web interface, but did not impact running instances.

We have corrected the problem, and service is restored.

Upcoming Maintenance window for Chameleon Auth server

Resolved Posted by Mark Powers on May 28, 2024
Outage start Tuesday, June 04, 2024 9:30 a.m.
Expected end Tuesday, June 04, 2024 10:10 a.m.

UPDATE: All services are updated and should be working as expected


On the morning of Tuesday, June 4th, there will be a brief outage affecting login to all Chameleon sites and services at 9:30 AM central time.

We expect a 5-10 minute outage while we apply updates to the service that handles federated login.
This won't affect any running workloads or nodes, but you may need to refresh your browser once the outage ends.

CHI@TACC Certificate Expiry

Resolved Posted by Cody Hammock on May 21, 2024
Outage start Monday, May 20, 2024 7:45 p.m.
Expected end Tuesday, May 21, 2024 9:40 a.m.

CHI@TACC was unavailble between 7:15PM CDT Monday May 20, 2024 and 9:40AM CDT Tuesday May 21, 2024 due to an automation failure issuing an updated SSL certificate. The issue has been corrected and we do not expect further disruption.