Synchronous Promote - Part 5 - Unify all election modes in their Raft state machine usage by Gerold103 · Pull Request #12696 · tarantool/tarantool

Gerold103 · 2026-05-15T10:55:21Z

limbo: sync election_mode=off with Raft

Instance with election_mode='off' still allows box.ctl.promote().

This was a mistake to enable this in the public API, but it is
already done, and deprecation will take a long while.

From user's PoV such usage of synchronous replication looks
deceitfully convenient, but is very dangerous as it will
sporadically result into split-brain errors even when the leader
election works completely smooth. This will stay. And the reason
for such behavior is that elections with election_mode='off'
effectively use election quorum 1. No matter what
replication_synchro_quorum is configured to.

From dev's PoV the biggest pain was that with off-mode the limbo
had to ignore the Raft state machine. It manifested in many places
in the code as having to check raft_is_enabled() being true or
false.

In the future it would be very hard to make this hack co-exist
with the synchronous promotions. It would bloat the branching on
raft_is_enabled() and its corner cases even more.

The current patch does a simple thing - makes limbo and Raft
work literally like the high-level observable behavior of
promotions with election_mode='off'. Which means - use election
quorum 1 and txn confirmation quorum as configured.

Then Raft being always enabled (after recovery is complete) is
actually ok, and makes the code way simpler, allowing to eliminate
tons of corner cases and hacky checks, while preserving most of
the (buggy by design) behavior.

It will also be able to co-exist with the synchro promotions, as
these new promotion entries in the future will simply use the
election_quorum (= 1 in case of 'off' mode), and hence will work
like before.

Nonetheless some behavior changes might still be observed. Could
be considered bug fixes. Specifically, now even with
election_mode='off':

- on WAL error the leader will resign.

- box.info.ro_reason "election" is deleted, only "synchro"
  remains.

- box.info.election.state will show 'leader' when the instance is
  actually the leader.

- switching 'off' -> 'manual'/'candidate' will preserve the
  leadership state without need for new term elections (if the
  instance was actually the leader).

- some error messages about concurrent or conflicting elections
  might be different.

- box.ctl.demote() requires the node to own the limbo in the
  latest known Raft term. If Raft term is newer than the limbo,
  then it is too unsafe to allow persistent demotion. The node
  must firstly promote itself and then can do the demotion.

- the node will be able to vote, when it receives a vote request
  and the candidate looks suitable. Previously the inability to
  vote could result into candidate-nodes being unable to win any
  elections, because the off-nodes are still counted towards
  quorum 50% + 1 (they are in `_cluster` space), but if they never
  vote and there is more than half of them, then no candidate-node
  would ever win any elections.

Part of #8095

sergepetrenko

Thanks for the patch!

coveralls · 2026-05-22T11:48:31Z

coverage: 87.674% (-0.02%) from 87.69% — gerold103/gh-8095-synchro-promote-part5 into master

sergepetrenko

Thanks for your answers and sorry for the delay in review!
Just a couple more questions.

sergepetrenko · 2026-06-01T15:08:44Z

-		assert(raft->state == RAFT_STATE_FOLLOWER);
-		diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
-			 "manual elections");
+	if (box_raft_try_promote() != 0)


Why not turn it into a no-op when the node is already a raft leader? Like it was done before.

It is still nop. box_raft_try_promote() does the same state check.

I don't see that, sorry.
AFAIU box_raft_try_promote() unconditionally calls raft_sm_schedule_new_term() from raft_promote(), making this node step off in case it was a leader.

Oh, you are right indeed, I mislooked it somewhere else. Trying to fix it I found an interesting problem - the test replication-luatest/gh_8996_synchro_filter_enable_test.lua fails when this is fixed. Because we might get a split-brain like this:

Node1 and node2 are raft-leaders and limbo-owners in the same term (due to too low quorum or simply due to election_mode='off' being used). They are separated.

Replication is restored.

The nodes receive PROMOTE entries of each other, but split-brain detection is disabled, because their have too low schema versions. The promotions get applied.

Now these nodes are both raft-leaders, but are limbo-replicas. Raft and limbo terms are matching.

The only ways to get out of this (considering we even allow this situation) are:

Demote and promote again.

Allow promote by looking only at the limbo state being not a leader, which my patch did.

This case is only possible on an old schema version and with the old behaviour of election_mode='off' and previously it was simply doing the term bump again on promote.

So you are right that we could have skipped it in a normal situation, but weirdly my patch is also right because it preserves the old behaviour here.

I am not sure what to do with this, so I will leave it up to you to decide.

I see, thanks!

Looks like it'd be better to skip promotion when we consider ourselves a raft leader, then.
Even for election_mode off.

The test can still be fixed by changing box.ctl.promote() to demote + promote, right?

sergepetrenko

Thanks for the patch!
LGTM.

AArdeev · 2026-06-02T12:29:06Z

+* Fixed a crash that could occur when a replication source was removed from
+  `box.cfg.replication` at the same time that the node reached its maximum WAL
+  queue size (configured by `box.cfg.wal_queue_max_size`) (gh-12719).


Fixed a crash that could occur when a replication source was removed from
box.cfg.replication at the same time as the node reached its maximum WAL
queue size (configured by box.cfg.wal_queue_max_size) (applier: crash on cancellation while waiting for journal submit #12719).

AArdeev · 2026-06-02T12:33:16Z

+* `election_mode='off'` has been reworked internally. Among visible behavior
+  changes is that the promoted node in off-mode will be visible as `'leader'` in
+  `box.info.election.state`. Also such nodes are able to vote for other nodes
+  which might be having other election modes. Some other minor changes are also
+  done which shouldn't affect anything (gh-8095).


election_mode='off' has been reworked internally. A promoted off-mode
node will now be reported as 'leader' in box.info.election.state.
Additionally, such nodes are now able to vote for other nodes that may
have different election modes. Other minor changes have also been made
that should not affect functionality (There's no guarantee a PROMOTE entry from an old leader is present on the new one. #8095).

The applier fiber can get blocked while waiting for journal submit, if the journal queue is full. While waiting, the applier can get cancelled (for example, when its peer is removed from box.cfg.replication). This produced the error code JOURNAL_ENTRY_ERR_CANCELLED for the transaction's signature, which wasn't handled on its transformation into an error object in diag_set_journal_res_detailed(), causing a panic-crash. There was only one place, where this error code wasn't immediately transformed into an error object and had to be carried via the signature code: applier_rollback_by_wal_io(). The design of this function having to create an error object before txn_commit_submit() returns is probably questionable, but not big enough of a problem to redesign it right now. Closes #12719 NO_DOC=bugfix

There were 3 places waiting on a hardcoded term number to appear in the logs, exactly like if just the term alone of a specific number was persisted. Somehow it was working, but it is very fragile. This became obvious in the next commits where off-election promotion/demotion was reworked, and these error injections and hardcoded term numbers stopped working. Lets rework these test cases to use a more generic approach without having to wait for specific number of WAL writes and hardcoded term numbers. This probably wasn't possible previously because the convenient error injection ERRINJ_TXN_LIMBO_BEGIN_DELAY_COUNTDOWN is relatively fresh. In scope of #8095 NO_DOC=test NO_CHANGELOG=test

It was unstable in 3 places for the same scenario: - A promote/demote is done on one node. - Another node expects to receive term bump as the first row, and promote/demote as the second row. - The node sets a WAL error injection to fail promote/demote WAL write. What actually happened sometimes is that the term bump came after promote/demote. Which made the test cases fail the Raft term WAL write, making the instance panic and exit. The reordering itself is not a problem, Raft and limbo code are handling it already. Lets stabilize the tests so they bump the term, replicate it, and then write promote/demote. NO_DOC=test NO_CHANGELOG=test

It was flaky in 2 places for the same reason: relay thread was supposed to get something from the WAL thread, but the WAL thread was already blocked by an error injection. In one case the injection ERRINJ_RELAY_FASTER_THAN_TX was sometimes set while the WAL thread was serving older updates to the relay thread. As a result, the new sync txns were saved for later and not delivered to the relay. And later they already couldn't be sent, because the WAL thread got blocked. In the other case the injection ERRINJ_WAL_DELAY was sometimes set while the relay thread was reading an old update like ENTRY-1, the ENTRY-2 was pending. Then 2 new transactions were sent to the journal as ENTRY-3, lets say, and blocked the WAL thread. Then the relay finished processing ENTRY-1, but already couldn't get ENTRY-2, because the WAL thread was blocked by now. As a result, the replica couldn't catch up with the master. NO_DOC=test NO_CHANGELOG=test

There were 2 triggers: replicaset_on_quorum_gain, replicaset_on_quorum_loss. No point in having 2 of them. The only usecase is Raft, which anyway handles both these triggers using one function. And any other place can do the same. Would even be encouraged to do the same, because it forces not to store any sort of "health state", and requires to recalculate it on each update. Regarding performance it will make zero difference since the event is very rare. Noticed in scope of #8095, but isn't required for anything. NO_DOC=refactoring NO_CHANGELOG=refactoring NO_TEST=refactoring

@TarantoolBot

Instance with election_mode='off' still allows box.ctl.promote(). This was a mistake to enable this in the public API, but it is already done, and deprecation will take a long while. From user's PoV such usage of synchronous replication looks deceitfully convenient, but is very dangerous as it will sporadically result into split-brain errors even when the leader election works completely smooth. This will stay. And the reason for such behavior is that elections with election_mode='off' effectively use election quorum 1. No matter what replication_synchro_quorum is configured to. From dev's PoV the biggest pain was that with off-mode the limbo had to ignore the Raft state machine. It manifested in many places in the code as having to check raft_is_enabled() being true or false. In the future it would be very hard to make this hack co-exist with the synchronous promotions. It would bloat the branching on raft_is_enabled() and its corner cases even more. The current patch does a simple thing - makes limbo and Raft work literally like the high-level observable behavior of promotions with election_mode='off'. Which means - use election quorum 1 and txn confirmation quorum as configured. Then Raft being always enabled (after recovery is complete) is actually ok, and makes the code way simpler, allowing to eliminate tons of corner cases and hacky checks, while preserving most of the (buggy by design) behavior. It will also be able to co-exist with the synchro promotions, as these new promotion entries in the future will simply use the election_quorum (= 1 in case of 'off' mode), and hence will work like before. Nonetheless some behavior changes might still be observed. Could be considered bug fixes. Specifically, now even with election_mode='off': - on WAL error the leader will resign. - box.info.ro_reason "election" is deleted, only "synchro" remains. - box.info.election.state will show 'leader' when the instance is actually the leader. - switching 'off' -> 'manual'/'candidate' will preserve the leadership state without need for new term elections (if the instance was actually the leader). - some error messages about concurrent or conflicting elections might be different. - box.ctl.demote() requires the node to own the limbo in the latest known Raft term. If Raft term is newer than the limbo, then it is too unsafe to allow persistent demotion. The node must firstly promote itself and then can do the demotion. - the node will be able to vote, when it receives a vote request and the candidate looks suitable. Previously the inability to vote could result into candidate-nodes being unable to win any elections, because the off-nodes are still counted towards quorum 50% + 1 (they are in `_cluster` space), but if they never vote and there is more than half of them, then no candidate-node would ever win any elections. Part of #8095 @TarantoolBot document Title: `election_mode` update Here: https://www.tarantool.io/en/doc/latest/reference/configuration/#confval-election_mode. The doc says that the `election_mode='off'` makes it not participate in the election process, and it won't run the Raft state machine. But it is gonna change. Need to update the text between "Possible values" and "Since version 2.8.2, .....". Like this: ``` Participation of a replica set node in the automated leader election can be configured with this option. The default value is `'off'`. All nodes run the Raft state machine internally, talking to other nodes according to the Raft leader election protocol. All nodes can vote for other nodes in elections and ack synchronous transactions except for anonymous replicas (they are basically ignored by elections and synchronous transactions ack counting). When the mode is `'off'`, the node can be writable even when it is not the leader. Normally in this state the Raft state machine is just idle on all the nodes, and the leadership management is done externally (if done at all). If in `'off'` mode the node sees the synchronous transaction queue claimed by any node (`box.info.synchro.queue.owner` is not 0), then it will be read-only unless it is the owner of the queue and is the elected leader. To disable synchronous replication entirely and to be able to make nodes writable regardless of Raft and the synchronous transaction queue you can switch into `'off'` mode, and then do `box.ctl.promote()` + `box.ctl.demote()`. The mode `'off'` is mostly just a way to run asynchronous replication, and also a way to get off the synchronous replication. It must not be used for actual production workload as it can result into split-brain errors even in a seemingly healthy replicaset. The usage ways are normally either mode `'off'` on all nodes and no `box.ctl.promote()` at all, or election modes `'voter'`, `'candidate'` and `'manual'` on all nodes and full reliance on the synchronous replication and Raft elections. When a node is a `'voter'`, it will be able to vote, but will not be a leader. It will also be read-only guaranteed. When a node is a `'candidate'`, it will watch the state of the current leader, and will actively try to get itself elected when the leader is not healthy. ```

Gerold103 requested a review from a team as a code owner May 15, 2026 10:55

Gerold103 self-assigned this May 15, 2026

Gerold103 requested a review from sergepetrenko May 15, 2026 10:59

Gerold103 assigned sergepetrenko May 15, 2026

sergepetrenko reviewed May 19, 2026

View reviewed changes

Comment thread src/box/raft.c

Comment thread src/box/raft.c

sergepetrenko removed their assignment May 19, 2026

Gerold103 requested a review from sergepetrenko May 20, 2026 11:35

Gerold103 assigned sergepetrenko and unassigned sergepetrenko May 20, 2026

Gerold103 removed the request for review from sergepetrenko May 20, 2026 13:27

Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 693eb3e to 27f8ab0 Compare May 22, 2026 11:29

Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 27f8ab0 to 05689e1 Compare May 25, 2026 12:18

Gerold103 requested a review from sergepetrenko May 25, 2026 12:56

Gerold103 assigned sergepetrenko May 25, 2026

sergepetrenko reviewed Jun 1, 2026

View reviewed changes

sergepetrenko removed their assignment Jun 1, 2026

Gerold103 requested a review from sergepetrenko June 1, 2026 18:28

Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 05689e1 to 28146bf Compare June 1, 2026 18:28

sergepetrenko assigned sergepetrenko and unassigned Gerold103 Jun 2, 2026

sergepetrenko approved these changes Jun 2, 2026

View reviewed changes

sergepetrenko assigned Gerold103 and unassigned sergepetrenko Jun 2, 2026

AArdeev approved these changes Jun 2, 2026

View reviewed changes

Gerold103 added 5 commits June 3, 2026 00:06

Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 28146bf to d025e64 Compare June 2, 2026 22:07

Gerold103 assigned Serpentian Jun 2, 2026

Gerold103 requested a review from Serpentian June 2, 2026 22:08

Conversation

Gerold103 commented May 15, 2026

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coveralls commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coveralls commented May 22, 2026 •

edited

Loading