core: Don't racily stop Type=dbus services when the bus broker restarts by cdown · Pull Request #42455 · systemd/systemd

cdown · 2026-06-03T17:15:18Z

If dbus dies, it is then immediately reactivated by dbus.socket as soon as a client reconnects. In that case pid1's own connection to the bus drops and is reestablished by manager_recheck_dbus() (which comes from the unit_notify() when dbus.services goes back to being running).

As part of bringing the bus back up, bus_setup_api() issues GetNameOwner() for every name we watch on behalf of a BusName= unit.

For a Type=dbus service this is a problem. Consider this scenario:

spearmint.service is Type=dbus, owns com.example.Spearmint, and is running.
The broker crashes and is re-activated through dbus.socket.
spearmint's client connection dropped too, so it is now reconnecting to the new broker to grab its name again.
pid1 racily reconnects first given it is triggered the moment dbus.service reaches SERVICE_RUNNING, whereas spearmint is an ordinary client that must first notice its own disconnection and thus when we do GetNameOwner() com.example.Spearmint is reported as having no owner.
service_bus_name_owner_change() sets bus_name_good = false, and since service_good() returns false for a Type=dbus service without its name, service_enter_running() stops the (perfectly healthy, other than dbus kicking the bucket temporarily) service.

In other words, a broker restart can tear down running Type=dbus services that did nothing wrong, really.

So, to fix this, we can use the fact that a NameOwnerChanged signal is definitive. That is, the name's owner changed while we were connected, so a service that crashes or drops its name can still be stopped immediately. By contrast a GetNameOwner() reply is less authoritative.

There is, importantly, no change to how losing the name on a live connection works, we only give grace on reconnect.

github-actions · 2026-06-03T17:15:35Z

Claude review of PR #42455 (`1c185e1`)

This revision addresses essentially all prior feedback (dedicated SERVICE_RUNNING_REVALIDATING state with RuntimeMaxSec= paused, coldplug re-arming, RemainAfterExit honored on signal-driven loss, the service_enter_exited_or_stop() helper, dedicated test bus name + marker synchronization). A few subtle edge cases remain in the new resume/grace logic.

Suggestions

RuntimeMaxSec under-credited across multiple revalidations — src/core/service.c:2697 — resume recomputes the budget from the fixed original anchor, dropping earlier pauses' credit so repeated broker flaps stop the service early.
Reload during grace window resets the grace deadline — src/core/service.c:2716 — an explicit systemctl reload while revalidating re-enters service_enter_running() from a non-revalidating state and arms a fresh full grace window, bypassing the flapping guard's "can't push the deadline out" intent.

Nits

SERVICE_BUS_NAME_GRACE_USEC define placement — src/core/service.c:143 — macro is wedged between the static int service_dispatch_*() forward declarations; group it with the other file-level _USEC constants.

Workflow run

For a Type=dbus service we sometimes want to keep treating it as running while its bus name has briefly gone away (for example, due to a broker restart). SERVICE_RUNNING_REVALIDATING is active and keeps its main process like SERVICE_RUNNING, but it's deliberately not part of the runtime timer set, so RuntimeMaxSec= is paused while we sit in it.

If dbus dies, it is then immediately reactivated by dbus.socket as soon as a client reconnects. In that case pid1's own connection to the bus drops and is reestablished by manager_recheck_dbus() (which comes from the unit_notify() when dbus.services goes back to being running). As part of bringing the bus back up, bus_setup_api() issues GetNameOwner() for every name we watch on behalf of a BusName= unit. For a Type=dbus service this is a problem. Consider this scenario: 1. spearmint.service is Type=dbus, owns com.example.Spearmint, and is running. 2. The broker crashes and is re-activated through dbus.socket. 3. spearmint's client connection dropped too, so it is now reconnecting to the new broker to grab its name again. 4. pid1 racily reconnects first given it is triggered the moment dbus.service reaches SERVICE_RUNNING, whereas spearmint is an ordinary client that must first notice its own disconnection and thus when we do GetNameOwner() com.example.Spearmint is reported as having no owner. 5. service_bus_name_owner_change() sets bus_name_good = false, and since service_good() returns false for a Type=dbus service without its name, service_enter_running() stops the (perfectly healthy, other than dbus kicking the bucket temporarily) service. In other words, a broker restart can tear down running Type=dbus services that did nothing wrong, really. So, to fix this, we can use the fact that a NameOwnerChanged signal is definitive. That is, the name's owner changed while we were connected, so a service that crashes or drops its name can still be stopped immediately. By contrast a GetNameOwner() reply is less authoritative. There is, importantly, no change to how losing the name on a live connection works, we only give grace on reconnect.

github-actions · 2026-06-09T12:11:02Z

+                /* If we're already revalidating, just keep waiting on the timer we armed on the way in, so a
+                 * flapping broker that reconnects repeatedly can't keep pushing the deadline out. */
+                if (s->state == SERVICE_RUNNING_REVALIDATING)
+                        return;


Claude: suggestion: The early return enforces the documented invariant that a flapping broker can't keep pushing the grace deadline out, but only for re-entries that arrive while s->state == SERVICE_RUNNING_REVALIDATING. An explicit reload of a Type=dbus service is now permitted from this state (the service_reload() assert was widened in the first commit), and it transitions through SERVICE_REFRESH_EXTENSIONS/SERVICE_RELOAD_POST. service_set_state() disables/unrefs bus_name_grace_event_source on those transitions, and when the reload completes service_enter_running() is re-entered with s->state no longer SERVICE_RUNNING_REVALIDATING. If the name is still absent, this branch is taken again and arms a fresh full SERVICE_BUS_NAME_GRACE_USEC window, effectively resetting the grace clock (the runtime budget is fine since revalidate_runtime_begin_usec is only set when s->state == SERVICE_RUNNING). It errs toward keeping the service up, so it isn't a regression, but it defeats the deadline cap in that window. Consider keying the "already revalidating" check off whether the grace timer is still armed rather than off the current state, so an interleaved reload doesn't restart the grace clock.

github-actions · 2026-06-09T12:11:03Z

 static int service_dispatch_exec_io(sd_event_source *source, int fd, uint32_t events, void *userdata);
+static int service_dispatch_bus_name_grace(sd_event_source *source, usec_t usec, void *userdata);
+
+#define SERVICE_BUS_NAME_GRACE_USEC (2 * USEC_PER_SEC)


Claude: nit: The #define SERVICE_BUS_NAME_GRACE_USEC is placed in the middle of the block of static int service_dispatch_*() forward declarations (between the new service_dispatch_bus_name_grace declaration and service_enter_signal). Elsewhere in src/core such timing #define ..._USEC constants are grouped together near the top of the file after the includes (e.g. JOBS_IN_PROGRESS_WAIT_USEC in manager.c, CGROUP_CPU_QUOTA_DEFAULT_PERIOD_USEC in cgroup.c). Move the macro up with the other file-level constants rather than interleaving it with function prototypes.

github-actions Bot added the claude-review label Jun 3, 2026

github-actions Bot added build-system util-lib tests run meson please-review PR is ready for (re-)review by a maintainer labels Jun 3, 2026

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/core/service.c

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/core/unit.c

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/core/service.c Outdated

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread test/units/TEST-23-UNIT-FILE.dbus-reconnect.sh Outdated

github-actions Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/test/test-dbus-name-holder.c

cdown force-pushed the cdown/2026-06-04/dbus_bumbaclart branch 2 times, most recently from 62b3262 to 884a3aa Compare June 4, 2026 04:39