IGNITE-22319
IGNITE-22319, titled "Node crashes if a snapshot restore cancelled due to network issues," describes a critical bug in Apache Ignite version 2.16 where a node performing a snapshot restore will crash if the connection to the source node is lost. This failure, triggered by an unexpected cluster topology change, makes the restore operation unreliable and can compromise cluster stability. The bug is particularly problematic as it represents a failure in the system's core resiliency, turning a routine network issue into a fatal event for a node.
The root cause is a race condition within the IgniteSnapshotManager, where there is inadequate synchronization between the thread receiving snapshot files and the thread monitoring cluster topology changes. Depending on precise thread timing, this bug manifests in one of two ways: either as a fatal AssertionError that immediately halts the JVM to prevent data corruption, or as a DEADLOCKED THREAD condition where multiple system threads lock each other, causing the node to hang indefinitely. Both outcomes result in the loss of a node from the cluster.
The fix, released in Ignite version 2.17, resolves this by strengthening the synchronization logic within the SequentialRemoteSnapshotManager. This change ensures that a node departure is handled gracefully as an exception rather than a catastrophic failure. Instead of crashing, the system now safely cancels the in-progress restore operation and throws a manageable ClusterTopologyCheckedException, preserving node stability and allowing for proper error handling.
Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.
Request daypassIf you do not have an active Chameleon allocation, or would prefer to not use your allocation, you can request a temporary one from the PI of the project this artifact belongs to.
Download ArchiveDownload an archive containing the files of this artifact.