Skip to content

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM device_reapply (e.g. igb / Intel I211) #16

@jmckitt

Description

@jmckitt

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM device_reapply (e.g. igb / Intel I211)

Summary

When connecting via the WireGuard protocol, the kill-switch code path calls NetworkManager's device_reapply_async() on the physical interface to inject a host route for the VPN server. On NICs whose driver performs a PHY reset in response to device_reapply (Intel I211 with the igb driver, in this report), the interface loses carrier (NO-CARRIER, state DOWN) for several seconds. The kernel marks every route via that interface — including the freshly-added VPN-server host route — as linkdown and unusable. The subsequent TCP reachability check in networkmanager.py:start() cannot find a usable route to the server IP and fails immediately with "VPN server NOT reachable". Connection is aborted with "Error: Connection failed. Try connecting to a different server or check your network settings." Every WireGuard connect attempt fails identically.

OpenVPN-TCP works fine on the same system because its kill-switch path (NMKillSwitch) injects exclusion routes into a dummy kill-switch connection profile instead of calling device_reapply on the physical interface.

Environment

  • OS: Ubuntu 24.04 LTS
  • Kernel: 6.8.0-generic (x86_64)
  • NetworkManager: 1.46.0
  • libnetplan1: 1.1.2
  • NIC: Intel I211 Gigabit (PCI ID 8086:1539), driver igb
  • Physical interface: managed by NM, DHCP (referred to below as <eth>)

ProtonVPN packages (all installed from repo.protonvpn.com/debian stable, latest versions available):

Package Version
proton-vpn-cli 1.0.1
proton-vpn-daemon 0.13.7
python3-proton-vpn-api-core 5.2.4
python3-proton-vpn-local-agent 1.6.3
python3-proton-core 0.7.4
python3-proton-keyring-linux 0.2.1

Reproduction

  1. On a host with an Intel I211 NIC managed by the igb driver (or any NIC whose driver triggers a PHY reset on device_reapply).
  2. Default protocol = wireguard, default kill switch setting (off).
  3. protonvpn signin, then protonvpn connect (or protonvpn connect --country US).

Result: Error: Connection failed. Try connecting to a different server or check your network settings. 100% reproducible.

Diagnostic data

Below, <eth> is the physical interface name, <gw> is the LAN gateway, and <server_ip> is the chosen VPN-server IPv4.

CLI verbose log (relevant lines)

T+0.000s | proton.vpn.core.vpnconnector | INFO | CONN.CONNECT:START | Protocol: wireguard
T+0.275s | proton.vpn.core.vpnconnector | INFO | CONN:STATE_CHANGED | Connecting
T+3.480s | proton.vpn.backend.networkmanager.core.networkmanager:80 | INFO | VPN server NOT reachable.
T+3.481s | proton.vpn.connection.states:401 | WARNING | Reached connection error state: Timeout (None)
T+3.482s | proton.vpn.core.vpnconnector | INFO | CONN:STATE_CHANGED | Disconnected

Physical interface link state and routing table, sampled every 200 ms during connect

T+0.000s | LINK: <eth> UP, <BROADCAST,MULTICAST,UP,LOWER_UP>
  ROUTES: default via <gw> dev <eth> ... metric 100

T+5.530s | LINK: <eth> UP, <BROADCAST,MULTICAST,UP,LOWER_UP>   ← kill switch added, interface still up
  ROUTES: default via 100.85.0.1 dev pvpnksintrf0 metric 98
          default via <gw> dev <eth>   metric 100
          100.85.0.0/24 dev pvpnksintrf0 ...

T+5.744s | LINK: <eth> DOWN, <NO-CARRIER,BROADCAST,MULTICAST,UP>   ← device_reapply fired; PHY reset; carrier lost
  ROUTES: default via 100.85.0.1 dev pvpnksintrf0 metric 98
          default via <gw> dev <eth> metric 100 linkdown
          <server_ip> via <gw> dev <eth> metric 100 linkdown   ← host route added but already linkdown
          <LAN>/24 dev <eth> ... linkdown

T+8.711s | LINK: <eth> DOWN, <NO-CARRIER,...>   ← TCP check failed, ProtonVPN tore down pvpnksintrf0
  ROUTES: default via <gw> dev <eth> metric 100 linkdown
          <server_ip> via <gw> dev <eth> metric 100 linkdown
          <LAN>/24 dev <eth> ... linkdown

T+11.686s | LINK: <eth> DOWN, <NO-CARRIER,...>   ← still down, ~6 seconds after device_reapply
  ROUTES: (none)

Approximate carrier-recovery time on this driver: 10–15 seconds.

TCP reachability sanity check (no VPN, server known reachable)

Port 443:  reachable
Port 7770: reachable
Port 8443: reachable

Root cause

The call sequence in python3-proton-vpn-api-core 5.2.4:

  1. proton/vpn/connection/states.py::Connecting.run_tasks() (line 250)
    self.context.kill_switch.enable(server, permanent=...)

  2. proton/vpn/backend/networkmanager/killswitch/wireguard/wgkillswitch.py::WGKillSwitch.enable() (line 59)
    self._ks_handler.add_kill_switch_connection(permanent) (adds pvpnksintrf0, metric 98 default route — fine)
    self._ks_handler.add_vpn_server_route(server_ip=...)

  3. proton/vpn/backend/networkmanager/killswitch/wireguard/killswitch_connection_handler.py::add_vpn_server_route() (line 144)
    → for each physical device: self.nm_client.add_route_to_device(device, ...)
    await self._wait_for_vpn_server_route(server_ip, device.get_iface(), found=True)

  4. proton/vpn/backend/networkmanager/killswitch/wireguard/nmclient.py::add_route_to_device() (line 354)
    → modifies the in-memory connection profile (adds host route to NM's IPv4 settings)
    cls._apply_connection_async(active_connection, ...)
    device.reapply_async(connection, version_id=0, flags=0, ...) (line 340)

  5. On the igb driver, reapply_async triggers a PHY reset. The interface goes NO-CARRIER, state DOWN. Kernel marks all routes via the physical interface — including the just-added host route — as linkdown.

  6. killswitch_connection_handler.py::_wait_for_vpn_server_route() (line 207)
    → polls ip route with delays [0.5, 0.5, 1, 1, 2]s, returns as soon as the route string matches.
    The regex f"{server_ip} via .* dev {interface_name} .*" does not check the linkdown flag. So the function returns "success" within 500 ms while the interface is still NO-CARRIER.

  7. proton/vpn/connection/states.py::Connecting.run_tasks() then calls self.context.connection.start().

  8. proton/vpn/backend/networkmanager/core/networkmanager.py::start() (line 65)
    await tcpcheck.is_any_port_reachable(self._vpnserver.server_ip, self._vpnserver.openvpn_ports.tcp)

  9. proton/vpn/backend/networkmanager/core/tcpcheck.py::is_port_reachable() opens a plain TCP socket with no SO_BINDTODEVICE. With the host route linkdown and the only viable default route being the dummy pvpnksintrf0, connect_ex() returns EHOSTUNREACH (or ENETUNREACH) immediately — no 5-second timeout. All three concurrent ports fail.

  10. is_any_port_reachable() returns False. networkmanager.py logs "VPN server NOT reachable.", fires events.Timeout, state machine goes to Error, CLI prints Error: Connection failed.

Why it works on most systems but fails here

Most users don't hit this because:

  • WiFi adapters do not reset the radio on device_reapply; carrier remains.
  • Many Ethernet drivers (e1000e, r8169, etc.) update L3 config without touching the PHY.
  • igb on Intel I211 is unusual in triggering a PHY reset for the kind of in-memory route reapply that ProtonVPN performs.

The race is latent in the code for all users; the igb driver just exposes it deterministically.

Why OpenVPN works on the same system

NMKillSwitch (used for openvpn-tcp / openvpn-udp) takes a different approach: it adds exclusion routes (covering 0.0.0.0/0 minus the server IP) into the dummy kill-switch connection profile. It never calls device_reapply on the physical interface, so the link stays up and traffic to the server IP falls through to it via the main routing table. Confirmed working on the same hardware with Protocol: openvpn-tcp.

Suggested fixes (any of these would resolve it)

The TCP-reachability check exists specifically to guard against the kill switch breaking server access ("after introducing the dummy kill switch network interface, the VPN connection backend tries to use it…" — comment in networkmanager.py:67). The fix needs to make that guard actually work when the kill-switch path itself disrupts the physical interface.

  1. Make _wait_for_vpn_server_route() wait for the route to become usable, not just present. Reject matches whose line contains linkdown:

    match = re.search(server_route, result.stdout)
    route_exists = bool(match)
    if found and route_exists and "linkdown" not in match.group(0):
        return

    And/or extend the polling schedule beyond 5 s, since PHY reset recovery on some NICs takes 10–15 s.

  2. Bind the TCP check socket to the physical interface (SO_BINDTODEVICE) so the routing table is bypassed during the reachability probe. The kernel still requires the link to be LOWER_UP for packets to leave, so this only helps if the link recovers — but it makes the probe correct in cases where the host route is marked linkdown spuriously, and is the canonical approach for "probe via this specific interface".

  3. Don't use device_reapply to inject the host route. Add it directly via ip route add <server>/32 via <gw> dev <iface> (subprocess, like _run_ip_route_command already does), or via netlink. This avoids triggering driver-specific reapply behaviors entirely. Removing the route on disconnect would mirror this.

  4. Run the TCP reachability check before the kill switch is enabled, since at that point the routing table is untouched and the check accurately reflects whether the upstream network can reach the server.

Any of these would unblock WireGuard on Intel I211 / igb hosts. (1) is the smallest patch and likely the safest.

Related existing issues

None of those identify device_reapply-induced PHY reset as the root cause, or the missing linkdown check in _wait_for_vpn_server_route as the amplifier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions