Skip to content

[Bug] v3.7.0-beta.0: Control-Plane Deadlock and Cluster Consensus Availability Loss During Single-Node Downgrade #21767

@Champbreed

Description

@Champbreed

Bug report criteria

What happened?

A operational logic deadlock occurs in v3.7.0-beta.0 during downgrade handling. When a single-node etcd instance triggers a minor-version rollback (downgrade enable 3.6.0), the system enters a deadlock state instead of executing a schema transition or safely rejecting the request. If this state coincides with high transaction volume or cluster membership adjustments, the backend storage and Raft consensus mechanics degrade, leading to permanent DeadlineExceeded errors across all KV and Maintenance API endpoints.

What did you expect to happen?

  • Single-Node Optimization: The cluster should detect it is a single-node deployment (100% of the quorum) and execute the local schema downgrade without requiring multi-node consensus validation.

  • Graceful Handling: Contradictory pipelines (e.g., downgrade enable vs downgrade cancel) should be serialized to avoid storage-layer context timeouts.

  • Build Consistency: The module compilation graph should track v3.7.x upstream library sources uniformly, avoiding the mixed-state logs currently observed referencing v3.6.0-beta.0 internal module paths.

How can we reproduce it (as minimally and precisely as possible)?

How can we reproduce it?

  1. Initialize a single-node instance:
/tmp/etcd-download-test/etcd --data-dir=/tmp/etcd-migration-test
  1. Submit the downgrade request:
/tmp/etcd-download-test/etcdctl downgrade enable 3.6.0
  1. Simulate topology adjustment during the loop:
/tmp/etcd-download-test/etcdctl member add s2 --peer-urls=http://localhost:2381
  1. Stress the system to trigger the deadlock:
for i in {1..50}; do
  /tmp/etcd-download-test/etcdctl put /stress/key$i "value$i" &
  /tmp/etcd-download-test/etcdctl downgrade cancel &
done

Relevant log output

1. Version Monitor Loop

{"level":"warn","ts":"2026-05-20T02:12:32.772798+0100","caller":"version/monitor.go:212","msg":"remotes server has mismatching etcd version","remote-member-id":"8e9e05c52164694d","current-server-version":"3.7.0","target-version":"3.6.0"}
{"level":"warn","ts":"2026-05-20T02:15:35.180190+0100","caller":"etcdserver/v3_server.go:1260","msg":"reject downgrade request","error":"etcdserver: invalid downgrade target version"}

2. Raft Consensus Deadlock

{"level":"info","ts":"2026-05-20T02:31:25.635324+0100","logger":"raft","caller":"v3@v3.6.0-beta.0.0.20260116184858-6d944ca211ee/raft.go:930","msg":"8e9e05c52164694d became pre-candidate at term 2"}
{"level":"warn","ts":"2026-05-20T02:31:25.800941+0100","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"91fba30a589a5910","rtt":"0s","error":"dial tcp 127.0.0.1:2381: connect: connection refused"}

3. Client API Timeout

{"level":"warn","ts":"2026-05-20T03:01:42.635561+0100","logger":"etcd-client","caller":"v3/retry_interceptor.go:68","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x3d1241a2000/localhost:2379","peer":"Peer{Addr: '127.0.0.1:2379', LocalAddr: '127.0.0.1:32916', AuthInfo: 'insecure'}","method":"/etcdserverpb.KV/Put","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = stream terminated by RST_STREAM with error code: CANCEL"}

Anything else we need to know?

This bug was reliably verified inside an isolated local testing sandbox environment. Combining a pending version rollback state machine transition with rapid-fire infrastructure lifecycle alterations completely breaks the Raft voting matrix, locking the single node into an irreversible candidate election loop.

Etcd version (please run commands below)

Details
$ etcd --version
etcd Version: 3.7.0-beta.0
Git SHA: 7ee95a6
Go Version: go1.26.3
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.7.0-beta.0
API version: 3.7

Etcd configuration (command line flags or environment variables)

Details

Launched as a single-node instance via standard CLI invocation:

/tmp/etcd-download-test/etcd --data-dir=/tmp/etcd-migration-test

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Details

Standard state table diagnostics cannot execute natively due to the operational deadlock. Client network contexts are instantly dropped via a gRPC stream termination event:

$ etcdctl member list -w table
Error: context deadline exceeded

$ etcdctl endpoint status -w table
Error: context deadline exceeded

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions