Network switch failure for P2 nodes at UC

Resolved Posted by Michael Sherman on May 04, 2022
Outage start Wednesday, May 04, 2022 10 a.m.
Expected end Tuesday, May 10, 2022 6:31 p.m.

Update: 6pm 05/10/22: The outage is now resolved. Both switches are now functional, and P2 nodes from nc01-nc64 are back online. New instances have no issues, existing instances may still have connectivity issues. If you have those issues, please try removing and re-attaching the network port to your instance.


Update: 6pm 05/09/22: The configuration of the single remaining Corsa switch is now done, and all user networks have been restored on that switch. If your instance (on nodes nc37-nc64) still lacks connectivity, you can try removing the network port from your instance, and reattaching a new one. You may need to use the console, and manually run `dhclient` to get a new DHCP address after donig so.

In addition, replacement cables have been obtained for the new, non-sdn switch, now connecting nodes nc01-nc36. We expect to restore functionality for those nodes tomorrow, 05/10/22.

Update: 6pm 05/06/22: Our secondary corsa switch is now configured as primary, and nodes nc37 - nc64 should have connectivity again. Newly provisioned nodes will have networking set up correctly, but existing nodes have lost their networking "state".

Therefore, only if you are NOT concerned with data loss on your instance, the fastest way get up and running will be to delete and re-create the instance. Otherwise, if your instance is active but does not have connectivity, please file a help-desk ticket. We can attempt to manually resolve this, but it will take some time due to staff availability.

We are continuing to work on restoring connectivity to the other 32 nodes, and this work may cause some futher network instability over the weekend.

Finally, in an effort to communicate more specifically on what nodes are and are not available, nodes in "maintenance" are now shown as greyed out on the host reservation calendar at https://chi.uc.chameleoncloud.org/project/leases/calendar/host/ . (Due to a rendering issue, this page is best viewed on firefox, rather than chrome)

Thank you for your patience as we bring these racks back online.

Mike Sherman
Infrastructure Lead – Chameleon


Update: 1PM 05/05/22: We are reconfiguring the new switch now, and expect to have 1/2 of the nodes back online by end of day. Further hardware issues (faulty cables, among others) may prevent bringing the other half online until replacments can be sourced.


Update: 6PM 05/04/22:
One of the Corsa SDN switches has a hardware failure, and we're working to reconfigure a "spare" switch to take its place.
All P2 nodes will likely be down until Mid-Thursday at the earliest.


The Corsa dataplane switches at UC are currently offline, due to a suspected power failure. Datacenter staff are investigating.

All instances at UC that use phase2 nodes (names starting with ncXX, or node types compute_skylake / gpu_rtx6000) currently have no connectivity, but their power states and data are unaffected.

We will update when connectivity is restored.