Reported Outages

Network switch failure for P2 nodes at UC

Resolved Posted by Michael Sherman on May 04, 2022
Outage start Wednesday, May 04, 2022 10 a.m.
Expected end Tuesday, May 10, 2022 6:31 p.m.

Update: 6pm 05/10/22: The outage is now resolved. Both switches are now functional, and P2 nodes from nc01-nc64 are back online. New instances have no issues, existing instances may still have connectivity issues. If you have those issues, please try removing and re-attaching the network port to your instance.

Provisioning network failure at CHI@UC

Resolved Posted by Michael Sherman on May 02, 2022
Outage start Friday, April 29, 2022 11:20 a.m.
Expected end Monday, May 02, 2022 1:17 p.m.

Update 05/03/22: This issue is now resolved. It was caused by a combination of two factors: misconfiguration of the DHCP behavior for out-of-band interfaces, and a failure causing an out of band switch to power off.

All affected nodes should be reservable again. If you have an instance that has become inaccessable, please get in touch with us via the helpdesk.

kvm@TACC Unavailable April 22, 2022

Resolved Posted by Cody Hammock on April 22, 2022
Outage start Thursday, April 21, 2022 8 p.m.
Expected end Friday, April 22, 2022 4:08 p.m.

KVM@TACC was unavailable starting in the evening of April 21, 2022. It has been resolved.

CHI@UC down

Resolved Posted by Michael Sherman on March 24, 2022
Outage start Thursday, March 24, 2022 10:25 a.m.
Expected end Thursday, March 24, 2022 12 p.m.

Update: This has been resolved as of 11:42 AM, and the site is back up. Running nodes should not have been affected, aside from the temporary loss of network connectivity.


CHI@UC is currently down due to a failure of the controller node's load-balancer. We will update here with more information.

Network Switch failure at UC

Resolved Posted by Michael Sherman on March 01, 2022
Outage start Tuesday, March 01, 2022 4:04 p.m.
Expected end Sunday, May 01, 2022 4:04 p.m.

Update: Connectivity has been restored. Root cause was a software bug preventing the creation of a PVST instance on the switch, due to a large number of configured vlans. Using a single instance for all VLANs restored functionality.


The 1g switch serving out-of-band access for nodes in rack BG-41 has encountered a (so far) unrecoverable software error, preventing traffic to the out of band interface on nodes P3-CPU-020 to P3-CPU-038.

Networking outage at UC

Resolved Posted by Michael Sherman on February 22, 2022
Outage start Monday, February 21, 2022 3 p.m.
Expected end Tuesday, February 22, 2022 11:17 a.m.

Update 11:16 CST: This should now be resolved. A forwarding loop in the underlying network topology caused some ports to become shut down. Instance provisioning and floating IPs should now be working again. Please reach out if you're still seeing issues on the UC site.


We're currently observing networking issues at UC. New instances are failing to provision, and existing ones are unreachable. We're still investigating the root cause, but will update here when resolved. Other sites are unaffected.

TACC Network maintenance 6 March 2022

Resolved Posted by Cody Hammock on February 21, 2022
Outage start Sunday, March 06, 2022 10 a.m.
Expected end Sunday, March 06, 2022 4 p.m.

Update: The work completed at 12:00 PM (CST).

Network maintenance will be carried out at the TACC site between 10:00 AM and 4:00 PM (CST) on Sunday, March 6th. Access to all systems hostetd at TACC will be unavailable during this time, includeing CHI@TACC, KVM@TACC, CHI@Edge, and the Chameleon Portal. Instances will continue to run, but users will have no access to TACC services and systems until the upgrade is complete.

 

Please submit any questions you may have via the Chameleon Helpdesk: https://chameleoncloud.org/user/help/

CHI@NU currently down

Resolved Posted by Michael Sherman on February 17, 2022
Outage start Thursday, February 17, 2022 4:50 p.m.
Expected end Friday, February 18, 2022 5:50 p.m.

This outage has been resolved, and CHI@NU is fully operational.


The CHI@NU site is currently inaccessible due to unexpected issues during a service upgrade. Any running nodes should be unaffected, but are currently inaccessible, along with the Horizon WebUI and API. Other sites are unaffected.

If this outage interrupts your work, feel free to use resources at another site, and please let us know via the helpdesk if you have a use-case that requires the CHI@NU site.

CHI@TACC, KVM@TACC, and CHI@EDGE networking outage

Resolved Posted by Francois Halbach on February 11, 2022
Outage start Thursday, February 10, 2022 4:14 p.m.
Expected end Thursday, February 10, 2022 5:38 p.m.

Outage Start: 2022-02-10 16:14

Outage End: 2022-02-10 17:38

Update: this outage has been resolved.


Due to a networking issue at TACC, CHI@TACC, KVM@TACC, and CHI@EDGE are currently unavailable. 

This affects site access as well as already running resources.

Site networking staff are investigating, but there is no ETA for resolution at this time.

CHI@UC Networking outage

Resolved Posted by Michael Sherman on February 03, 2022
Outage start Thursday, February 03, 2022 11:27 a.m.
Expected end Friday, February 04, 2022 3:10 p.m.

Update - 3:09 PM CST - The outage should now be resolved.


Due to an upstream hardware failure, L2 stitching connectivity, and the creation of new instances is failing at CHI@UC.

Existing instances not using stitching or the SharedWAN network are not affected.

Site networking staff are investingating, but there is no ETA for resoluton at this time.