WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM `device_reapply` (e.g. `igb` / Intel I211)

# WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM `device_reapply` (e.g. `igb` / Intel I211)

## Summary

When connecting via the WireGuard protocol, the kill-switch code path calls NetworkManager's `device_reapply_async()` on the physical interface to inject a host route for the VPN server. On NICs whose driver performs a PHY reset in response to `device_reapply` (Intel I211 with the `igb` driver, in this report), the interface loses carrier (`NO-CARRIER`, `state DOWN`) for several seconds. The kernel marks every route via that interface &mdash; including the freshly-added VPN-server host route &mdash; as `linkdown` and unusable. The subsequent TCP reachability check in `networkmanager.py:start()` cannot find a usable route to the server IP and fails immediately with "VPN server NOT reachable". Connection is aborted with "Error: Connection failed. Try connecting to a different server or check your network settings." Every WireGuard connect attempt fails identically.

OpenVPN-TCP works fine on the same system because its kill-switch path (`NMKillSwitch`) injects exclusion routes into a dummy kill-switch connection profile instead of calling `device_reapply` on the physical interface.

## Environment

- OS: Ubuntu 24.04 LTS
- Kernel: 6.8.0-generic (x86_64)
- NetworkManager: 1.46.0
- libnetplan1: 1.1.2
- NIC: Intel I211 Gigabit (PCI ID `8086:1539`), driver `igb`
- Physical interface: managed by NM, DHCP (referred to below as `<eth>`)

ProtonVPN packages (all installed from `repo.protonvpn.com/debian stable`, latest versions available):

| Package | Version |
|---|---|
| proton-vpn-cli | 1.0.1 |
| proton-vpn-daemon | 0.13.7 |
| python3-proton-vpn-api-core | 5.2.4 |
| python3-proton-vpn-local-agent | 1.6.3 |
| python3-proton-core | 0.7.4 |
| python3-proton-keyring-linux | 0.2.1 |

## Reproduction

1. On a host with an Intel I211 NIC managed by the `igb` driver (or any NIC whose driver triggers a PHY reset on `device_reapply`).
2. Default protocol = wireguard, default kill switch setting (`off`).
3. `protonvpn signin`, then `protonvpn connect` (or `protonvpn connect --country US`).

**Result:** `Error: Connection failed. Try connecting to a different server or check your network settings.` 100% reproducible.

## Diagnostic data

Below, `<eth>` is the physical interface name, `<gw>` is the LAN gateway, and `<server_ip>` is the chosen VPN-server IPv4.

### CLI verbose log (relevant lines)

```
T+0.000s | proton.vpn.core.vpnconnector | INFO | CONN.CONNECT:START | Protocol: wireguard
T+0.275s | proton.vpn.core.vpnconnector | INFO | CONN:STATE_CHANGED | Connecting
T+3.480s | proton.vpn.backend.networkmanager.core.networkmanager:80 | INFO | VPN server NOT reachable.
T+3.481s | proton.vpn.connection.states:401 | WARNING | Reached connection error state: Timeout (None)
T+3.482s | proton.vpn.core.vpnconnector | INFO | CONN:STATE_CHANGED | Disconnected
```

### Physical interface link state and routing table, sampled every 200 ms during connect

```
T+0.000s | LINK: <eth> UP, <BROADCAST,MULTICAST,UP,LOWER_UP>
  ROUTES: default via <gw> dev <eth> ... metric 100

T+5.530s | LINK: <eth> UP, <BROADCAST,MULTICAST,UP,LOWER_UP>   &larr; kill switch added, interface still up
  ROUTES: default via 100.85.0.1 dev pvpnksintrf0 metric 98
          default via <gw> dev <eth>   metric 100
          100.85.0.0/24 dev pvpnksintrf0 ...

T+5.744s | LINK: <eth> DOWN, <NO-CARRIER,BROADCAST,MULTICAST,UP>   &larr; device_reapply fired; PHY reset; carrier lost
  ROUTES: default via 100.85.0.1 dev pvpnksintrf0 metric 98
          default via <gw> dev <eth> metric 100 linkdown
          <server_ip> via <gw> dev <eth> metric 100 linkdown   &larr; host route added but already linkdown
          <LAN>/24 dev <eth> ... linkdown

T+8.711s | LINK: <eth> DOWN, <NO-CARRIER,...>   &larr; TCP check failed, ProtonVPN tore down pvpnksintrf0
  ROUTES: default via <gw> dev <eth> metric 100 linkdown
          <server_ip> via <gw> dev <eth> metric 100 linkdown
          <LAN>/24 dev <eth> ... linkdown

T+11.686s | LINK: <eth> DOWN, <NO-CARRIER,...>   &larr; still down, ~6 seconds after device_reapply
  ROUTES: (none)
```

Approximate carrier-recovery time on this driver: 10&ndash;15 seconds.

### TCP reachability sanity check (no VPN, server known reachable)

```
Port 443:  reachable
Port 7770: reachable
Port 8443: reachable
```

## Root cause

The call sequence in `python3-proton-vpn-api-core 5.2.4`:

1. `proton/vpn/connection/states.py::Connecting.run_tasks()` (line 250)
   &rarr; `self.context.kill_switch.enable(server, permanent=...)`

2. `proton/vpn/backend/networkmanager/killswitch/wireguard/wgkillswitch.py::WGKillSwitch.enable()` (line 59)
   &rarr; `self._ks_handler.add_kill_switch_connection(permanent)` (adds pvpnksintrf0, metric 98 default route &mdash; fine)
   &rarr; `self._ks_handler.add_vpn_server_route(server_ip=...)`

3. `proton/vpn/backend/networkmanager/killswitch/wireguard/killswitch_connection_handler.py::add_vpn_server_route()` (line 144)
   &rarr; for each physical device: `self.nm_client.add_route_to_device(device, ...)`
   &rarr; `await self._wait_for_vpn_server_route(server_ip, device.get_iface(), found=True)`

4. `proton/vpn/backend/networkmanager/killswitch/wireguard/nmclient.py::add_route_to_device()` (line 354)
   &rarr; modifies the in-memory connection profile (adds host route to NM's IPv4 settings)
   &rarr; `cls._apply_connection_async(active_connection, ...)`
   &rarr; **`device.reapply_async(connection, version_id=0, flags=0, ...)`** (line 340)

5. On the `igb` driver, `reapply_async` triggers a PHY reset. The interface goes `NO-CARRIER`, `state DOWN`. Kernel marks all routes via the physical interface &mdash; including the just-added host route &mdash; as `linkdown`.

6. `killswitch_connection_handler.py::_wait_for_vpn_server_route()` (line 207)
   &rarr; polls `ip route` with delays `[0.5, 0.5, 1, 1, 2]`s, returns as soon as the route string matches.
   &rarr; **The regex `f"{server_ip} via .* dev {interface_name} .*"` does not check the `linkdown` flag.** So the function returns "success" within 500 ms while the interface is still NO-CARRIER.

7. `proton/vpn/connection/states.py::Connecting.run_tasks()` then calls `self.context.connection.start()`.

8. `proton/vpn/backend/networkmanager/core/networkmanager.py::start()` (line 65)
   &rarr; `await tcpcheck.is_any_port_reachable(self._vpnserver.server_ip, self._vpnserver.openvpn_ports.tcp)`

9. `proton/vpn/backend/networkmanager/core/tcpcheck.py::is_port_reachable()` opens a plain TCP socket with no `SO_BINDTODEVICE`. With the host route `linkdown` and the only viable default route being the dummy pvpnksintrf0, `connect_ex()` returns `EHOSTUNREACH` (or `ENETUNREACH`) **immediately** &mdash; no 5-second timeout. All three concurrent ports fail.

10. `is_any_port_reachable()` returns `False`. `networkmanager.py` logs `"VPN server NOT reachable."`, fires `events.Timeout`, state machine goes to Error, CLI prints `Error: Connection failed.`

## Why it works on most systems but fails here

Most users don't hit this because:

- WiFi adapters do not reset the radio on `device_reapply`; carrier remains.
- Many Ethernet drivers (`e1000e`, `r8169`, etc.) update L3 config without touching the PHY.
- `igb` on Intel I211 is unusual in triggering a PHY reset for the kind of in-memory route reapply that ProtonVPN performs.

The race is latent in the code for all users; the `igb` driver just exposes it deterministically.

## Why OpenVPN works on the same system

`NMKillSwitch` (used for `openvpn-tcp` / `openvpn-udp`) takes a different approach: it adds exclusion routes (covering 0.0.0.0/0 minus the server IP) into the dummy kill-switch connection profile. It never calls `device_reapply` on the physical interface, so the link stays up and traffic to the server IP falls through to it via the main routing table. Confirmed working on the same hardware with `Protocol: openvpn-tcp`.

## Suggested fixes (any of these would resolve it)

The TCP-reachability check exists specifically to guard against the kill switch breaking server access ("after introducing the dummy kill switch network interface, the VPN connection backend tries to use it&hellip;" &mdash; comment in `networkmanager.py:67`). The fix needs to make that guard actually work when the kill-switch path itself disrupts the physical interface.

1. **Make `_wait_for_vpn_server_route()` wait for the route to become usable, not just present.** Reject matches whose line contains `linkdown`:

   ```python
   match = re.search(server_route, result.stdout)
   route_exists = bool(match)
   if found and route_exists and "linkdown" not in match.group(0):
       return
   ```

   And/or extend the polling schedule beyond 5 s, since PHY reset recovery on some NICs takes 10&ndash;15 s.

2. **Bind the TCP check socket to the physical interface** (`SO_BINDTODEVICE`) so the routing table is bypassed during the reachability probe. The kernel still requires the link to be `LOWER_UP` for packets to leave, so this only helps if the link recovers &mdash; but it makes the probe correct in cases where the host route is marked linkdown spuriously, and is the canonical approach for "probe via this specific interface".

3. **Don't use `device_reapply` to inject the host route.** Add it directly via `ip route add <server>/32 via <gw> dev <iface>` (subprocess, like `_run_ip_route_command` already does), or via netlink. This avoids triggering driver-specific reapply behaviors entirely. Removing the route on disconnect would mirror this.

4. **Run the TCP reachability check before the kill switch is enabled**, since at that point the routing table is untouched and the check accurately reflects whether the upstream network can reach the server.

Any of these would unblock WireGuard on Intel I211 / `igb` hosts. (1) is the smallest patch and likely the safest.

## Related existing issues

- ProtonVPN/proton-vpn-gtk-app#118 &mdash; "Error waiting for server route to be added" &mdash; same `_wait_for_vpn_server_route` function, observed as a `TimeoutError` rather than spurious success.
- ProtonVPN/proton-vpn-gtk-app#110 &mdash; `TimeoutError` in `killswitch_connection_handler.py` during connect, WireGuard + NM.
- ProtonVPN/proton-vpn-cli#14 &mdash; "Failure to connect, but killswitch created. Disabling killswitch errors." &mdash; describes the same user-visible failure and orphaned kill-switch teardown.

None of those identify `device_reapply`-induced PHY reset as the root cause, or the missing `linkdown` check in `_wait_for_vpn_server_route` as the amplifier.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM `device_reapply` (e.g. `igb` / Intel I211) #16

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM `device_reapply` (e.g. `igb` / Intel I211)

Summary

Environment

Reproduction

Diagnostic data

CLI verbose log (relevant lines)

Physical interface link state and routing table, sampled every 200 ms during connect

TCP reachability sanity check (no VPN, server known reachable)

Root cause

Why it works on most systems but fails here

Why OpenVPN works on the same system

Suggested fixes (any of these would resolve it)

Related existing issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Package	Version
proton-vpn-cli	1.0.1
proton-vpn-daemon	0.13.7
python3-proton-vpn-api-core	5.2.4
python3-proton-vpn-local-agent	1.6.3
python3-proton-core	0.7.4
python3-proton-keyring-linux	0.2.1

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM device_reapply (e.g. igb / Intel I211) #16

Description

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM device_reapply (e.g. igb / Intel I211)

Summary

Environment

Reproduction

Diagnostic data

CLI verbose log (relevant lines)

Physical interface link state and routing table, sampled every 200 ms during connect

TCP reachability sanity check (no VPN, server known reachable)

Root cause

Why it works on most systems but fails here

Why OpenVPN works on the same system

Suggested fixes (any of these would resolve it)

Related existing issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM `device_reapply` (e.g. `igb` / Intel I211) #16

WireGuard connect fails on Ethernet NICs whose driver resets the PHY during NM `device_reapply` (e.g. `igb` / Intel I211)