Skip to content

core: Don't racily stop Type=dbus services when the bus broker restarts#42455

Open
cdown wants to merge 3 commits into
systemd:mainfrom
cdown:cdown/2026-06-04/dbus_bumbaclart
Open

core: Don't racily stop Type=dbus services when the bus broker restarts#42455
cdown wants to merge 3 commits into
systemd:mainfrom
cdown:cdown/2026-06-04/dbus_bumbaclart

Conversation

@cdown

@cdown cdown commented Jun 3, 2026

Copy link
Copy Markdown
Member

If dbus dies, it is then immediately reactivated by dbus.socket as soon as a client reconnects. In that case pid1's own connection to the bus drops and is reestablished by manager_recheck_dbus() (which comes from the unit_notify() when dbus.services goes back to being running).

As part of bringing the bus back up, bus_setup_api() issues GetNameOwner() for every name we watch on behalf of a BusName= unit.

For a Type=dbus service this is a problem. Consider this scenario:

  1. spearmint.service is Type=dbus, owns com.example.Spearmint, and is running.
  2. The broker crashes and is re-activated through dbus.socket.
  3. spearmint's client connection dropped too, so it is now reconnecting to the new broker to grab its name again.
  4. pid1 racily reconnects first given it is triggered the moment dbus.service reaches SERVICE_RUNNING, whereas spearmint is an ordinary client that must first notice its own disconnection and thus when we do GetNameOwner() com.example.Spearmint is reported as having no owner.
  5. service_bus_name_owner_change() sets bus_name_good = false, and since service_good() returns false for a Type=dbus service without its name, service_enter_running() stops the (perfectly healthy, other than dbus kicking the bucket temporarily) service.

In other words, a broker restart can tear down running Type=dbus services that did nothing wrong, really.

So, to fix this, we can use the fact that a NameOwnerChanged signal is definitive. That is, the name's owner changed while we were connected, so a service that crashes or drops its name can still be stopped immediately. By contrast a GetNameOwner() reply is less authoritative.

There is, importantly, no change to how losing the name on a live connection works, we only give grace on reconnect.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

Claude review of PR #42455 (1c185e1)

This revision addresses essentially all prior feedback (dedicated SERVICE_RUNNING_REVALIDATING state with RuntimeMaxSec= paused, coldplug re-arming, RemainAfterExit honored on signal-driven loss, the service_enter_exited_or_stop() helper, dedicated test bus name + marker synchronization). A few subtle edge cases remain in the new resume/grace logic.

Suggestions

  • RuntimeMaxSec under-credited across multiple revalidationssrc/core/service.c:2697 — resume recomputes the budget from the fixed original anchor, dropping earlier pauses' credit so repeated broker flaps stop the service early.
  • Reload during grace window resets the grace deadlinesrc/core/service.c:2716 — an explicit systemctl reload while revalidating re-enters service_enter_running() from a non-revalidating state and arms a fresh full grace window, bypassing the flapping guard's "can't push the deadline out" intent.

Nits

  • SERVICE_BUS_NAME_GRACE_USEC define placementsrc/core/service.c:143 — macro is wedged between the static int service_dispatch_*() forward declarations; group it with the other file-level _USEC constants.

Workflow run

@github-actions github-actions Bot added build-system util-lib tests run meson please-review PR is ready for (re-)review by a maintainer labels Jun 3, 2026
Comment thread src/core/service.c
Comment thread src/core/unit.c
Comment thread src/core/service.c Outdated
Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated
Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated
Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated
Comment thread src/test/test-dbus-name-holder.c
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch 2 times, most recently from 62b3262 to 884a3aa Compare June 4, 2026 04:39
Comment thread src/core/service.c Outdated
Comment thread src/core/service.c Outdated
Comment thread src/core/service.c Outdated
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch 2 times, most recently from c3ed496 to 46652c1 Compare June 4, 2026 07:33
Comment thread src/core/service.c Outdated
Comment thread src/core/service.h
Comment thread src/test/test-dbus-name-holder.c Outdated
Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch from 46652c1 to 893c903 Compare June 4, 2026 08:40
Comment thread src/core/service.c Outdated
Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated
Comment thread src/core/service.c
Comment thread src/core/service.c
Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated
@poettering poettering added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels Jun 4, 2026
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch from 893c903 to 1f2f64f Compare June 8, 2026 07:07
@github-actions github-actions Bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jun 8, 2026
Comment thread src/core/service.c
Comment thread src/core/service.c Outdated
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch from 1f2f64f to b0e74d6 Compare June 9, 2026 08:12
For a Type=dbus service we sometimes want to keep treating it as running
while its bus name has briefly gone away (for example, due to a broker
restart).

SERVICE_RUNNING_REVALIDATING is active and keeps its main process like
SERVICE_RUNNING, but it's deliberately not part of the runtime timer
set, so RuntimeMaxSec= is paused while we sit in it.
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch from b0e74d6 to 5a44b47 Compare June 9, 2026 10:52
cdown added 2 commits June 9, 2026 19:58
If dbus dies, it is then immediately reactivated by dbus.socket as soon
as a client reconnects. In that case pid1's own connection to the bus
drops and is reestablished by manager_recheck_dbus() (which comes from
the unit_notify() when dbus.services goes back to being running).

As part of bringing the bus back up, bus_setup_api() issues
GetNameOwner() for every name we watch on behalf of a BusName= unit.

For a Type=dbus service this is a problem. Consider this scenario:

1. spearmint.service is Type=dbus, owns com.example.Spearmint, and is
   running.
2. The broker crashes and is re-activated through dbus.socket.
3. spearmint's client connection dropped too, so it is now reconnecting
   to the new broker to grab its name again.
4. pid1 racily reconnects first given it is triggered the moment
   dbus.service reaches SERVICE_RUNNING, whereas spearmint is an
   ordinary client that must first notice its own disconnection and thus
   when we do GetNameOwner() com.example.Spearmint is reported as having
   no owner.
5. service_bus_name_owner_change() sets bus_name_good = false, and since
   service_good() returns false for a Type=dbus service without its
   name, service_enter_running() stops the (perfectly healthy, other
   than dbus kicking the bucket temporarily) service.

In other words, a broker restart can tear down running Type=dbus
services that did nothing wrong, really.

So, to fix this, we can use the fact that a NameOwnerChanged signal is
definitive. That is, the name's owner changed while we were connected,
so a service that crashes or drops its name can still be stopped
immediately. By contrast a GetNameOwner() reply is less authoritative.

There is, importantly, no change to how losing the name on a live
connection works, we only give grace on reconnect.
@cdown cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch from 5a44b47 to 1c185e1 Compare June 9, 2026 10:58
Comment thread src/core/service.c
Comment thread src/core/service.c
Comment on lines +2713 to +2716
/* If we're already revalidating, just keep waiting on the timer we armed on the way in, so a
* flapping broker that reconnects repeatedly can't keep pushing the deadline out. */
if (s->state == SERVICE_RUNNING_REVALIDATING)
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: suggestion: The early return enforces the documented invariant that a flapping broker can't keep pushing the grace deadline out, but only for re-entries that arrive while s->state == SERVICE_RUNNING_REVALIDATING. An explicit reload of a Type=dbus service is now permitted from this state (the service_reload() assert was widened in the first commit), and it transitions through SERVICE_REFRESH_EXTENSIONS/SERVICE_RELOAD_POST. service_set_state() disables/unrefs bus_name_grace_event_source on those transitions, and when the reload completes service_enter_running() is re-entered with s->state no longer SERVICE_RUNNING_REVALIDATING. If the name is still absent, this branch is taken again and arms a fresh full SERVICE_BUS_NAME_GRACE_USEC window, effectively resetting the grace clock (the runtime budget is fine since revalidate_runtime_begin_usec is only set when s->state == SERVICE_RUNNING). It errs toward keeping the service up, so it isn't a regression, but it defeats the deadline cap in that window. Consider keying the "already revalidating" check off whether the grace timer is still armed rather than off the current state, so an interleaved reload doesn't restart the grace clock.

Comment thread src/core/service.c
static int service_dispatch_exec_io(sd_event_source *source, int fd, uint32_t events, void *userdata);
static int service_dispatch_bus_name_grace(sd_event_source *source, usec_t usec, void *userdata);

#define SERVICE_BUS_NAME_GRACE_USEC (2 * USEC_PER_SEC)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude: nit: The #define SERVICE_BUS_NAME_GRACE_USEC is placed in the middle of the block of static int service_dispatch_*() forward declarations (between the new service_dispatch_bus_name_grace declaration and service_enter_signal). Elsewhere in src/core such timing #define ..._USEC constants are grouped together near the top of the file after the includes (e.g. JOBS_IN_PROGRESS_WAIT_USEC in manager.c, CGROUP_CPU_QUOTA_DEFAULT_PERIOD_USEC in cgroup.c). Move the macro up with the other file-level constants rather than interleaving it with function prototypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants