Skip to content

Synchronous Promote - Part 5 - Unify all election modes in their Raft state machine usage#12696

Open
Gerold103 wants to merge 6 commits into
masterfrom
gerold103/gh-8095-synchro-promote-part5
Open

Synchronous Promote - Part 5 - Unify all election modes in their Raft state machine usage#12696
Gerold103 wants to merge 6 commits into
masterfrom
gerold103/gh-8095-synchro-promote-part5

Conversation

@Gerold103
Copy link
Copy Markdown
Collaborator

limbo: sync election_mode=off with Raft

Instance with election_mode='off' still allows box.ctl.promote().

This was a mistake to enable this in the public API, but it is
already done, and deprecation will take a long while.

From user's PoV such usage of synchronous replication looks
deceitfully convenient, but is very dangerous as it will
sporadically result into split-brain errors even when the leader
election works completely smooth. This will stay. And the reason
for such behavior is that elections with election_mode='off'
effectively use election quorum 1. No matter what
replication_synchro_quorum is configured to.

From dev's PoV the biggest pain was that with off-mode the limbo
had to ignore the Raft state machine. It manifested in many places
in the code as having to check raft_is_enabled() being true or
false.

In the future it would be very hard to make this hack co-exist
with the synchronous promotions. It would bloat the branching on
raft_is_enabled() and its corner cases even more.

The current patch does a simple thing - makes limbo and Raft
work literally like the high-level observable behavior of
promotions with election_mode='off'. Which means - use election
quorum 1 and txn confirmation quorum as configured.

Then Raft being always enabled (after recovery is complete) is
actually ok, and makes the code way simpler, allowing to eliminate
tons of corner cases and hacky checks, while preserving most of
the (buggy by design) behavior.

It will also be able to co-exist with the synchro promotions, as
these new promotion entries in the future will simply use the
election_quorum (= 1 in case of 'off' mode), and hence will work
like before.

Nonetheless some behavior changes might still be observed. Could
be considered bug fixes. Specifically, now even with
election_mode='off':

- on WAL error the leader will resign.

- box.info.ro_reason "election" is deleted, only "synchro"
  remains.

- box.info.election.state will show 'leader' when the instance is
  actually the leader.

- switching 'off' -> 'manual'/'candidate' will preserve the
  leadership state without need for new term elections (if the
  instance was actually the leader).

- some error messages about concurrent or conflicting elections
  might be different.

- box.ctl.demote() requires the node to own the limbo in the
  latest known Raft term. If Raft term is newer than the limbo,
  then it is too unsafe to allow persistent demotion. The node
  must firstly promote itself and then can do the demotion.

- the node will be able to vote, when it receives a vote request
  and the candidate looks suitable. Previously the inability to
  vote could result into candidate-nodes being unable to win any
  elections, because the off-nodes are still counted towards
  quorum 50% + 1 (they are in `_cluster` space), but if they never
  vote and there is more than half of them, then no candidate-node
  would ever win any elections.

Part of #8095

@Gerold103 Gerold103 requested a review from a team as a code owner May 15, 2026 10:55
@Gerold103 Gerold103 self-assigned this May 15, 2026
@Gerold103 Gerold103 requested a review from sergepetrenko May 15, 2026 10:59
Copy link
Copy Markdown
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

Comment thread src/box/raft.c
Comment thread src/box/raft.c
@sergepetrenko sergepetrenko removed their assignment May 19, 2026
@Gerold103 Gerold103 requested a review from sergepetrenko May 20, 2026 11:35
@Gerold103 Gerold103 removed the request for review from sergepetrenko May 20, 2026 13:27
@Gerold103 Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 693eb3e to 27f8ab0 Compare May 22, 2026 11:29
@coveralls
Copy link
Copy Markdown

coveralls commented May 22, 2026

Coverage Status

coverage: 87.674% (-0.02%) from 87.69% — gerold103/gh-8095-synchro-promote-part5 into master

@Gerold103 Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 27f8ab0 to 05689e1 Compare May 25, 2026 12:18
@Gerold103 Gerold103 requested a review from sergepetrenko May 25, 2026 12:56
Copy link
Copy Markdown
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your answers and sorry for the delay in review!
Just a couple more questions.

Comment thread test/replication-luatest/gh_9723_sync_system_spaces_test.lua
Comment thread src/box/box.cc
assert(raft->state == RAFT_STATE_FOLLOWER);
diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
"manual elections");
if (box_raft_try_promote() != 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not turn it into a no-op when the node is already a raft leader? Like it was done before.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still nop. box_raft_try_promote() does the same state check.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see that, sorry.
AFAIU box_raft_try_promote() unconditionally calls raft_sm_schedule_new_term() from raft_promote(), making this node step off in case it was a leader.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right indeed, I mislooked it somewhere else. Trying to fix it I found an interesting problem - the test replication-luatest/gh_8996_synchro_filter_enable_test.lua fails when this is fixed. Because we might get a split-brain like this:

  • Node1 and node2 are raft-leaders and limbo-owners in the same term (due to too low quorum or simply due to election_mode='off' being used). They are separated.
  • Replication is restored.
  • The nodes receive PROMOTE entries of each other, but split-brain detection is disabled, because their have too low schema versions. The promotions get applied.
  • Now these nodes are both raft-leaders, but are limbo-replicas. Raft and limbo terms are matching.

The only ways to get out of this (considering we even allow this situation) are:

  • Demote and promote again.
  • Allow promote by looking only at the limbo state being not a leader, which my patch did.

This case is only possible on an old schema version and with the old behaviour of election_mode='off' and previously it was simply doing the term bump again on promote.

So you are right that we could have skipped it in a normal situation, but weirdly my patch is also right because it preserves the old behaviour here.

I am not sure what to do with this, so I will leave it up to you to decide.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

Looks like it'd be better to skip promotion when we consider ourselves a raft leader, then.
Even for election_mode off.

The test can still be fixed by changing box.ctl.promote() to demote + promote, right?

Comment thread src/box/raft.c
Comment thread src/box/raft.c
@sergepetrenko sergepetrenko removed their assignment Jun 1, 2026
@Gerold103 Gerold103 requested a review from sergepetrenko June 1, 2026 18:28
@Gerold103 Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 05689e1 to 28146bf Compare June 1, 2026 18:28
Copy link
Copy Markdown
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!
LGTM.

Comment on lines +3 to +5
* Fixed a crash that could occur when a replication source was removed from
`box.cfg.replication` at the same time that the node reached its maximum WAL
queue size (configured by `box.cfg.wal_queue_max_size`) (gh-12719).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment on lines +3 to +7
* `election_mode='off'` has been reworked internally. Among visible behavior
changes is that the promoted node in off-mode will be visible as `'leader'` in
`box.info.election.state`. Also such nodes are able to vote for other nodes
which might be having other election modes. Some other minor changes are also
done which shouldn't affect anything (gh-8095).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Gerold103 added 5 commits June 3, 2026 00:06
The applier fiber can get blocked while waiting for journal
submit, if the journal queue is full.

While waiting, the applier can get cancelled (for example, when
its peer is removed from box.cfg.replication).

This produced the error code JOURNAL_ENTRY_ERR_CANCELLED for the
transaction's signature, which wasn't handled on its
transformation into an error object in
diag_set_journal_res_detailed(), causing a panic-crash.

There was only one place, where this error code wasn't immediately
transformed into an error object and had to be carried via the
signature code: applier_rollback_by_wal_io().

The design of this function having to create an error object
before txn_commit_submit() returns is probably questionable, but
not big enough of a problem to redesign it right now.

Closes #12719

NO_DOC=bugfix
There were 3 places waiting on a hardcoded term number to appear
in the logs, exactly like if just the term alone of a specific
number was persisted.

Somehow it was working, but it is very fragile. This became
obvious in the next commits where off-election promotion/demotion
was reworked, and these error injections and hardcoded term
numbers stopped working.

Lets rework these test cases to use a more generic approach
without having to wait for specific number of WAL writes and
hardcoded term numbers.

This probably wasn't possible previously because the convenient
error injection ERRINJ_TXN_LIMBO_BEGIN_DELAY_COUNTDOWN is
relatively fresh.

In scope of #8095

NO_DOC=test
NO_CHANGELOG=test
It was unstable in 3 places for the same scenario:
- A promote/demote is done on one node.
- Another node expects to receive term bump as the first row, and
  promote/demote as the second row.
- The node sets a WAL error injection to fail promote/demote WAL
  write.

What actually happened sometimes is that the term bump came after
promote/demote. Which made the test cases fail the Raft term WAL
write, making the instance panic and exit.

The reordering itself is not a problem, Raft and limbo code are
handling it already.

Lets stabilize the tests so they bump the term, replicate it, and
then write promote/demote.

NO_DOC=test
NO_CHANGELOG=test
It was flaky in 2 places for the same reason: relay thread was
supposed to get something from the WAL thread, but the WAL thread
was already blocked by an error injection.

In one case the injection ERRINJ_RELAY_FASTER_THAN_TX was
sometimes set while the WAL thread was serving older updates to
the relay thread. As a result, the new sync txns were saved for
later and not delivered to the relay. And later they already
couldn't be sent, because the WAL thread got blocked.

In the other case the injection ERRINJ_WAL_DELAY was sometimes set
while the relay thread was reading an old update like ENTRY-1,
the ENTRY-2 was pending. Then 2 new transactions were sent to the
journal as ENTRY-3, lets say, and blocked the WAL thread.

Then the relay finished processing ENTRY-1, but already couldn't
get ENTRY-2, because the WAL thread was blocked by now.

As a result, the replica couldn't catch up with the master.

NO_DOC=test
NO_CHANGELOG=test
There were 2 triggers: replicaset_on_quorum_gain,
replicaset_on_quorum_loss.

No point in having 2 of them. The only usecase is Raft, which
anyway handles both these triggers using one function.

And any other place can do the same. Would even be encouraged to
do the same, because it forces not to store any sort of "health
state", and requires to recalculate it on each update.

Regarding performance it will make zero difference since the event
is very rare.

Noticed in scope of #8095, but isn't required for anything.

NO_DOC=refactoring
NO_CHANGELOG=refactoring
NO_TEST=refactoring
Instance with election_mode='off' still allows box.ctl.promote().

This was a mistake to enable this in the public API, but it is
already done, and deprecation will take a long while.

From user's PoV such usage of synchronous replication looks
deceitfully convenient, but is very dangerous as it will
sporadically result into split-brain errors even when the leader
election works completely smooth. This will stay. And the reason
for such behavior is that elections with election_mode='off'
effectively use election quorum 1. No matter what
replication_synchro_quorum is configured to.

From dev's PoV the biggest pain was that with off-mode the limbo
had to ignore the Raft state machine. It manifested in many places
in the code as having to check raft_is_enabled() being true or
false.

In the future it would be very hard to make this hack co-exist
with the synchronous promotions. It would bloat the branching on
raft_is_enabled() and its corner cases even more.

The current patch does a simple thing - makes limbo and Raft
work literally like the high-level observable behavior of
promotions with election_mode='off'. Which means - use election
quorum 1 and txn confirmation quorum as configured.

Then Raft being always enabled (after recovery is complete) is
actually ok, and makes the code way simpler, allowing to eliminate
tons of corner cases and hacky checks, while preserving most of
the (buggy by design) behavior.

It will also be able to co-exist with the synchro promotions, as
these new promotion entries in the future will simply use the
election_quorum (= 1 in case of 'off' mode), and hence will work
like before.

Nonetheless some behavior changes might still be observed. Could
be considered bug fixes. Specifically, now even with
election_mode='off':

- on WAL error the leader will resign.

- box.info.ro_reason "election" is deleted, only "synchro"
  remains.

- box.info.election.state will show 'leader' when the instance is
  actually the leader.

- switching 'off' -> 'manual'/'candidate' will preserve the
  leadership state without need for new term elections (if the
  instance was actually the leader).

- some error messages about concurrent or conflicting elections
  might be different.

- box.ctl.demote() requires the node to own the limbo in the
  latest known Raft term. If Raft term is newer than the limbo,
  then it is too unsafe to allow persistent demotion. The node
  must firstly promote itself and then can do the demotion.

- the node will be able to vote, when it receives a vote request
  and the candidate looks suitable. Previously the inability to
  vote could result into candidate-nodes being unable to win any
  elections, because the off-nodes are still counted towards
  quorum 50% + 1 (they are in `_cluster` space), but if they never
  vote and there is more than half of them, then no candidate-node
  would ever win any elections.

Part of #8095

@TarantoolBot document
Title: `election_mode` update

Here: https://www.tarantool.io/en/doc/latest/reference/configuration/#confval-election_mode.

The doc says that the `election_mode='off'` makes it not
participate in the election process, and it won't run the Raft
state machine.

But it is gonna change. Need to update the text between
"Possible values" and "Since version 2.8.2, .....".

Like this:
```
Participation of a replica set node in the automated leader
election can be configured with this option.

The default value is `'off'`.

All nodes run the Raft state machine internally, talking to other
nodes according to the Raft leader election protocol. All nodes
can vote for other nodes in elections and ack synchronous
transactions except for anonymous replicas (they are basically
ignored by elections and synchronous transactions ack counting).

When the mode is `'off'`, the node can be writable even when it is
not the leader. Normally in this state the Raft state machine is
just idle on all the nodes, and the leadership management is done
externally (if done at all).

If in `'off'` mode the node sees the synchronous transaction queue
claimed by any node (`box.info.synchro.queue.owner` is not 0),
then it will be read-only unless it is the owner of the queue and
is the elected leader.

To disable synchronous replication entirely and to be able to make
nodes writable regardless of Raft and the synchronous transaction
queue you can switch into `'off'` mode, and then do
`box.ctl.promote()` + `box.ctl.demote()`.

The mode `'off'` is mostly just a way to run asynchronous
replication, and also a way to get off the synchronous
replication. It must not be used for actual production workload as
it can result into split-brain errors even in a seemingly healthy
replicaset.

The usage ways are normally either mode `'off'` on all nodes and
no `box.ctl.promote()` at all, or election modes `'voter'`,
`'candidate'` and `'manual'` on all nodes and full reliance on the
synchronous replication and Raft elections.

When a node is a `'voter'`, it will be able to vote, but will not
be a leader. It will also be read-only guaranteed.

When a node is a `'candidate'`, it will watch the state of the
current leader, and will actively try to get itself elected when
the leader is not healthy.
```
@Gerold103 Gerold103 force-pushed the gerold103/gh-8095-synchro-promote-part5 branch from 28146bf to d025e64 Compare June 2, 2026 22:07
@Gerold103 Gerold103 requested a review from Serpentian June 2, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants