Synchronous Promote - Part 5 - Unify all election modes in their Raft state machine usage#12696
Synchronous Promote - Part 5 - Unify all election modes in their Raft state machine usage#12696Gerold103 wants to merge 6 commits into
Conversation
Gerold103
commented
May 15, 2026
sergepetrenko
left a comment
There was a problem hiding this comment.
Thanks for the patch!
693eb3e to
27f8ab0
Compare
|
coverage: 87.674% (-0.02%) from 87.69% — gerold103/gh-8095-synchro-promote-part5 into master |
27f8ab0 to
05689e1
Compare
sergepetrenko
left a comment
There was a problem hiding this comment.
Thanks for your answers and sorry for the delay in review!
Just a couple more questions.
| assert(raft->state == RAFT_STATE_FOLLOWER); | ||
| diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'", | ||
| "manual elections"); | ||
| if (box_raft_try_promote() != 0) |
There was a problem hiding this comment.
Why not turn it into a no-op when the node is already a raft leader? Like it was done before.
There was a problem hiding this comment.
It is still nop. box_raft_try_promote() does the same state check.
There was a problem hiding this comment.
I don't see that, sorry.
AFAIU box_raft_try_promote() unconditionally calls raft_sm_schedule_new_term() from raft_promote(), making this node step off in case it was a leader.
There was a problem hiding this comment.
Oh, you are right indeed, I mislooked it somewhere else. Trying to fix it I found an interesting problem - the test replication-luatest/gh_8996_synchro_filter_enable_test.lua fails when this is fixed. Because we might get a split-brain like this:
- Node1 and node2 are raft-leaders and limbo-owners in the same term (due to too low quorum or simply due to
election_mode='off'being used). They are separated. - Replication is restored.
- The nodes receive PROMOTE entries of each other, but split-brain detection is disabled, because their have too low schema versions. The promotions get applied.
- Now these nodes are both raft-leaders, but are limbo-replicas. Raft and limbo terms are matching.
The only ways to get out of this (considering we even allow this situation) are:
- Demote and promote again.
- Allow promote by looking only at the limbo state being not a leader, which my patch did.
This case is only possible on an old schema version and with the old behaviour of election_mode='off' and previously it was simply doing the term bump again on promote.
So you are right that we could have skipped it in a normal situation, but weirdly my patch is also right because it preserves the old behaviour here.
I am not sure what to do with this, so I will leave it up to you to decide.
There was a problem hiding this comment.
I see, thanks!
Looks like it'd be better to skip promotion when we consider ourselves a raft leader, then.
Even for election_mode off.
The test can still be fixed by changing box.ctl.promote() to demote + promote, right?
05689e1 to
28146bf
Compare
sergepetrenko
left a comment
There was a problem hiding this comment.
Thanks for the patch!
LGTM.
| * Fixed a crash that could occur when a replication source was removed from | ||
| `box.cfg.replication` at the same time that the node reached its maximum WAL | ||
| queue size (configured by `box.cfg.wal_queue_max_size`) (gh-12719). |
There was a problem hiding this comment.
- Fixed a crash that could occur when a replication source was removed from
box.cfg.replicationat the same time as the node reached its maximum WAL
queue size (configured bybox.cfg.wal_queue_max_size) (applier: crash on cancellation while waiting for journal submit #12719).
| * `election_mode='off'` has been reworked internally. Among visible behavior | ||
| changes is that the promoted node in off-mode will be visible as `'leader'` in | ||
| `box.info.election.state`. Also such nodes are able to vote for other nodes | ||
| which might be having other election modes. Some other minor changes are also | ||
| done which shouldn't affect anything (gh-8095). |
There was a problem hiding this comment.
election_mode='off'has been reworked internally. A promoted off-mode
node will now be reported as'leader'inbox.info.election.state.
Additionally, such nodes are now able to vote for other nodes that may
have different election modes. Other minor changes have also been made
that should not affect functionality (There's no guarantee a PROMOTE entry from an old leader is present on the new one. #8095).
The applier fiber can get blocked while waiting for journal submit, if the journal queue is full. While waiting, the applier can get cancelled (for example, when its peer is removed from box.cfg.replication). This produced the error code JOURNAL_ENTRY_ERR_CANCELLED for the transaction's signature, which wasn't handled on its transformation into an error object in diag_set_journal_res_detailed(), causing a panic-crash. There was only one place, where this error code wasn't immediately transformed into an error object and had to be carried via the signature code: applier_rollback_by_wal_io(). The design of this function having to create an error object before txn_commit_submit() returns is probably questionable, but not big enough of a problem to redesign it right now. Closes #12719 NO_DOC=bugfix
There were 3 places waiting on a hardcoded term number to appear in the logs, exactly like if just the term alone of a specific number was persisted. Somehow it was working, but it is very fragile. This became obvious in the next commits where off-election promotion/demotion was reworked, and these error injections and hardcoded term numbers stopped working. Lets rework these test cases to use a more generic approach without having to wait for specific number of WAL writes and hardcoded term numbers. This probably wasn't possible previously because the convenient error injection ERRINJ_TXN_LIMBO_BEGIN_DELAY_COUNTDOWN is relatively fresh. In scope of #8095 NO_DOC=test NO_CHANGELOG=test
It was unstable in 3 places for the same scenario: - A promote/demote is done on one node. - Another node expects to receive term bump as the first row, and promote/demote as the second row. - The node sets a WAL error injection to fail promote/demote WAL write. What actually happened sometimes is that the term bump came after promote/demote. Which made the test cases fail the Raft term WAL write, making the instance panic and exit. The reordering itself is not a problem, Raft and limbo code are handling it already. Lets stabilize the tests so they bump the term, replicate it, and then write promote/demote. NO_DOC=test NO_CHANGELOG=test
It was flaky in 2 places for the same reason: relay thread was supposed to get something from the WAL thread, but the WAL thread was already blocked by an error injection. In one case the injection ERRINJ_RELAY_FASTER_THAN_TX was sometimes set while the WAL thread was serving older updates to the relay thread. As a result, the new sync txns were saved for later and not delivered to the relay. And later they already couldn't be sent, because the WAL thread got blocked. In the other case the injection ERRINJ_WAL_DELAY was sometimes set while the relay thread was reading an old update like ENTRY-1, the ENTRY-2 was pending. Then 2 new transactions were sent to the journal as ENTRY-3, lets say, and blocked the WAL thread. Then the relay finished processing ENTRY-1, but already couldn't get ENTRY-2, because the WAL thread was blocked by now. As a result, the replica couldn't catch up with the master. NO_DOC=test NO_CHANGELOG=test
There were 2 triggers: replicaset_on_quorum_gain, replicaset_on_quorum_loss. No point in having 2 of them. The only usecase is Raft, which anyway handles both these triggers using one function. And any other place can do the same. Would even be encouraged to do the same, because it forces not to store any sort of "health state", and requires to recalculate it on each update. Regarding performance it will make zero difference since the event is very rare. Noticed in scope of #8095, but isn't required for anything. NO_DOC=refactoring NO_CHANGELOG=refactoring NO_TEST=refactoring
Instance with election_mode='off' still allows box.ctl.promote(). This was a mistake to enable this in the public API, but it is already done, and deprecation will take a long while. From user's PoV such usage of synchronous replication looks deceitfully convenient, but is very dangerous as it will sporadically result into split-brain errors even when the leader election works completely smooth. This will stay. And the reason for such behavior is that elections with election_mode='off' effectively use election quorum 1. No matter what replication_synchro_quorum is configured to. From dev's PoV the biggest pain was that with off-mode the limbo had to ignore the Raft state machine. It manifested in many places in the code as having to check raft_is_enabled() being true or false. In the future it would be very hard to make this hack co-exist with the synchronous promotions. It would bloat the branching on raft_is_enabled() and its corner cases even more. The current patch does a simple thing - makes limbo and Raft work literally like the high-level observable behavior of promotions with election_mode='off'. Which means - use election quorum 1 and txn confirmation quorum as configured. Then Raft being always enabled (after recovery is complete) is actually ok, and makes the code way simpler, allowing to eliminate tons of corner cases and hacky checks, while preserving most of the (buggy by design) behavior. It will also be able to co-exist with the synchro promotions, as these new promotion entries in the future will simply use the election_quorum (= 1 in case of 'off' mode), and hence will work like before. Nonetheless some behavior changes might still be observed. Could be considered bug fixes. Specifically, now even with election_mode='off': - on WAL error the leader will resign. - box.info.ro_reason "election" is deleted, only "synchro" remains. - box.info.election.state will show 'leader' when the instance is actually the leader. - switching 'off' -> 'manual'/'candidate' will preserve the leadership state without need for new term elections (if the instance was actually the leader). - some error messages about concurrent or conflicting elections might be different. - box.ctl.demote() requires the node to own the limbo in the latest known Raft term. If Raft term is newer than the limbo, then it is too unsafe to allow persistent demotion. The node must firstly promote itself and then can do the demotion. - the node will be able to vote, when it receives a vote request and the candidate looks suitable. Previously the inability to vote could result into candidate-nodes being unable to win any elections, because the off-nodes are still counted towards quorum 50% + 1 (they are in `_cluster` space), but if they never vote and there is more than half of them, then no candidate-node would ever win any elections. Part of #8095 @TarantoolBot document Title: `election_mode` update Here: https://www.tarantool.io/en/doc/latest/reference/configuration/#confval-election_mode. The doc says that the `election_mode='off'` makes it not participate in the election process, and it won't run the Raft state machine. But it is gonna change. Need to update the text between "Possible values" and "Since version 2.8.2, .....". Like this: ``` Participation of a replica set node in the automated leader election can be configured with this option. The default value is `'off'`. All nodes run the Raft state machine internally, talking to other nodes according to the Raft leader election protocol. All nodes can vote for other nodes in elections and ack synchronous transactions except for anonymous replicas (they are basically ignored by elections and synchronous transactions ack counting). When the mode is `'off'`, the node can be writable even when it is not the leader. Normally in this state the Raft state machine is just idle on all the nodes, and the leadership management is done externally (if done at all). If in `'off'` mode the node sees the synchronous transaction queue claimed by any node (`box.info.synchro.queue.owner` is not 0), then it will be read-only unless it is the owner of the queue and is the elected leader. To disable synchronous replication entirely and to be able to make nodes writable regardless of Raft and the synchronous transaction queue you can switch into `'off'` mode, and then do `box.ctl.promote()` + `box.ctl.demote()`. The mode `'off'` is mostly just a way to run asynchronous replication, and also a way to get off the synchronous replication. It must not be used for actual production workload as it can result into split-brain errors even in a seemingly healthy replicaset. The usage ways are normally either mode `'off'` on all nodes and no `box.ctl.promote()` at all, or election modes `'voter'`, `'candidate'` and `'manual'` on all nodes and full reliance on the synchronous replication and Raft elections. When a node is a `'voter'`, it will be able to vote, but will not be a leader. It will also be read-only guaranteed. When a node is a `'candidate'`, it will watch the state of the current leader, and will actively try to get itself elected when the leader is not healthy. ```
28146bf to
d025e64
Compare