VLANs in static-networking-service-type hangs shepherd

  • Open
  • quality assurance status badge
Details
3 participants
  • Alexey Abramov
  • Ludovic Courtès
  • Lars Rustand
Owner
unassigned
Submitted by
Lars Rustand
Severity
important
L
L
Lars Rustand wrote on 19 Jan 2024 20:12
(address . bug-guix@gnu.org)
87v87pc7ul.fsf@yoga.mail-host-address-is-not-set
Like the title says, if you add any VLAN in a
static-networking-service-type it seems like the whole shepherd daemon
freezes up and anything that depends on it stops responding.
Additionally the networking does not get fully configured either.

After configuring a VLAN `herd status`, `herd restart networking` and
any other herd command hangs forever with no output. Even reboot is not
working. The only remedy is to restart the system using the power
button, but even after the restart the networking service still fails to
start.

VLANs are seemingly created, but no addresses are created.

Steps to reproduce:

1. Add a static network with a VLAN to your system config (see below for
minimal example)
2. Reconfigure your system
3. Restart the networking service with `sudo herd restart networking`
4. Observe that herd does not finish
5. Try to run `herd status`, `guix system reconfigure`, or `sudo reboot`.
6. Observe that none of the commands seem to have any effect, and that
they hang indefinitely with no output

Toggle snippet (14 lines)
(service static-networking-service-type
(list (static-networking
(links
(list (network-link
(name "myvlan")
(type 'vlan)
(arguments '((id . 3)
(link . "eth0"))))))
(addresses
(list (network-address
(device "myvlan@eth0")
(value "192.168.0.2/24")))))))

Alternatively here are the reproduction steps using VM:

1. Build a qcow2 image, make sure there is enough space to reconfigure
the system. Use --save-provenance so you have the config inside the
vm so you can reconfigure later.
`guix system image --image-type=qcow2 --image-size=30G --save-provenance minimal.scm`
2. Copy the qcow image to a writable directory.
3. Start up the vm.
```
sudo qemu-system-x86_64 \
-nic user,model=virtio-net-pci \
-enable-kvm -m 2048 \
-device virtio-blk,drive=myhd \
-drive
if=none,file=1a7wi5mgcy3wrsx6pcnag6qjbb87djwl-image.qcow2,id=myhd
```
4. Edit /run/current-system/configuration.scm and uncomment the static
networking.
5. Reconfigure the system.
6. Try to restart the networking service. `herd restart networking`
7. The command will hang infinitely. Cancel it.
8. Check the network interfaces. The VLAN interface will have been
created, but it will not have any address.
9. The aforementioned commands will all be unresponsive now.
10. If you reboot your VM you will see that the networking service is
failed at startup, and if you try to restart the service you will get
an error: #<&netlink-response-error errno: 17>

Toggle snippet (48 lines)
(use-modules
(gnu)
(gnu services)
(gnu services base)
(gnu services networking)
(gnu bootloader)
(gnu bootloader grub)
(gnu system)
(gnu system file-systems)
(gnu system accounts))

(operating-system
(host-name "minimal")

(users
(cons*
(user-account
(name "lars")
(group "users"))
%base-user-accounts))

(services
(cons*
(service dhcp-client-service-type)
;; Commented out so you can uncomment it after booting the VM
;;(service static-networking-service-type
;; (list (static-networking
;; (links
;; (list (network-link
;; (name "myvlan")
;; (type 'vlan)
;; (arguments '((id . 3)
;; (link . "eth0"))))))
;; (addresses
;; (list (network-address
;; (device "myvlan@eth0")
;; (value "192.168.0.2/24")))))))
%base-services))

(bootloader
(bootloader-configuration
(bootloader grub-bootloader)
(targets '("/dev/vda"))))

(file-systems
(cons*
%base-file-systems)))
L
L
Lars Rustand wrote on 20 Jan 2024 00:32
(address . bug-guix@gnu.org)
87r0icdh1l.fsf@yoga.mail-host-address-is-not-set
For fun I tried to use the exact configuration that is mentioned in the
manual and was amazed that it worked, and the networking service is able
to start successfully. Here is the working configuration:

Toggle snippet (27 lines)
(static-networking
(links (list (network-link
(name "bond0")
(type 'bond)
(arguments '((mode . "802.3ad")
(miimon . 100)
(lacp-active . "on")
(lacp-rate . "fast"))))

(network-link
(mac-address "98:11:22:33:44:55")
(arguments '((master . "bond0"))))

(network-link
(mac-address "98:11:22:33:44:56")
(arguments '((master . "bond0"))))

(network-link
(name "bond0.1055")
(type 'vlan)
(arguments '((id . 1055)
(link . "bond0"))))))
(addresses (list (network-address
(value "192.168.1.4/24")


However, if I simply substitute the bond interface with a real interface
I get back the error described in my previous message. This
configuration fails:

Toggle snippet (12 lines)
(static-networking
(links (list (network-link
(name "bond0.1055")
(type 'vlan)
(arguments '((id . 1055)
(link . "ens3"))))))
(addresses (list (network-address
(value "192.168.1.4/24")
(device "bond0.1055")))))


So it seems that VLANs do work for bonds, but not for physical network
interfaces. I've done a lot of digging on the internet and cannot find a
single example of anyone using VLANs at all in Guix, so maybe that is
why this problem hasn't been discovered yet.
L
L
Ludovic Courtès wrote on 12 Feb 2024 10:55
Re: bug#68595: VLANs in static-networking-service-type hangs shepherd
(name . Lars Rustand)(address . rustand.lars@gmail.com)
875xyugg6j.fsf@gnu.org
Hi,

Lars Rustand <rustand.lars@gmail.com> skribis:

Toggle quote (11 lines)
> Like the title says, if you add any VLAN in a
> static-networking-service-type it seems like the whole shepherd daemon
> freezes up and anything that depends on it stops responding.
> Additionally the networking does not get fully configured either.
>
> After configuring a VLAN `herd status`, `herd restart networking` and
> any other herd command hangs forever with no output. Even reboot is not
> working. The only remedy is to restart the system using the power
> button, but even after the restart the networking service still fails to
> start.

Ouch. Could you check what /var/log/messages reports?

Once you’ve reproduced the hang, could you attach GDB to shepherd and
get a backtrace?

gdb -p 1
bt

(I recommend doing that in a VM rather than on your main machine!)

Toggle quote (22 lines)
> 1. Add a static network with a VLAN to your system config (see below for
> minimal example)
> 2. Reconfigure your system
> 3. Restart the networking service with `sudo herd restart networking`
> 4. Observe that herd does not finish
> 5. Try to run `herd status`, `guix system reconfigure`, or `sudo reboot`.
> 6. Observe that none of the commands seem to have any effect, and that
> they hang indefinitely with no output
>
> (service static-networking-service-type
> (list (static-networking
> (links
> (list (network-link
> (name "myvlan")
> (type 'vlan)
> (arguments '((id . 3)
> (link . "eth0"))))))
> (addresses
> (list (network-address
> (device "myvlan@eth0")
> (value "192.168.0.2/24")))))))

You mentioned in your other message that the example from the manual
works fine. Could you try and reduce your config until you find which
bit makes it fail?

Cc’ing Alexey and Julien who may know more.

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 12 Feb 2024 10:55
control message for bug #68595
(address . control@debbugs.gnu.org)
874jeegg60.fsf@gnu.org
severity 68595 important
quit
A
A
Alexey Abramov wrote on 12 Feb 2024 12:59
Re: bug#68595: VLANs in static-networking-service-type hangs shepherd
(name . Lars Rustand)(address . rustand.lars@gmail.com)(address . 68595@debbugs.gnu.org)
87r0hh28rl.fsf@delta.lan
Hi Lars,

Lars Rustand <rustand.lars@gmail.com> writes:

Toggle quote (39 lines)
> Like the title says, if you add any VLAN in a
> static-networking-service-type it seems like the whole shepherd daemon
> freezes up and anything that depends on it stops responding.
> Additionally the networking does not get fully configured either.
>
> After configuring a VLAN `herd status`, `herd restart networking` and
> any other herd command hangs forever with no output. Even reboot is not
> working. The only remedy is to restart the system using the power
> button, but even after the restart the networking service still fails to
> start.
>
> VLANs are seemingly created, but no addresses are created.
>
> Steps to reproduce:
>
> 1. Add a static network with a VLAN to your system config (see below for
> minimal example)
> 2. Reconfigure your system
> 3. Restart the networking service with `sudo herd restart networking`
> 4. Observe that herd does not finish
> 5. Try to run `herd status`, `guix system reconfigure`, or `sudo reboot`.
> 6. Observe that none of the commands seem to have any effect, and that
> they hang indefinitely with no output
>
> --8<---------------cut here---------------start------------->8---
> (service static-networking-service-type
> (list (static-networking
> (links
> (list (network-link
> (name "myvlan")
> (type 'vlan)
> (arguments '((id . 3)
> (link . "eth0"))))))
> (addresses
> (list (network-address
> (device "myvlan@eth0")
> (value "192.168.0.2/24")))))))
> --8<---------------cut here---------------end--------------->8---

I see, Could you please, replace the device name to "myvlan" and not
"myvlan@eth0" in the network-address.

Even though ip link (iproute2) shows you 'myvlan@eth0' this is not an
actual name of the interfaces.

Toggle quote (29 lines)
> Alternatively here are the reproduction steps using VM:
>
> 1. Build a qcow2 image, make sure there is enough space to reconfigure
> the system. Use --save-provenance so you have the config inside the
> vm so you can reconfigure later.
> `guix system image --image-type=qcow2 --image-size=30G --save-provenance minimal.scm`
> 2. Copy the qcow image to a writable directory.
> 3. Start up the vm.
> ```
> sudo qemu-system-x86_64 \
> -nic user,model=virtio-net-pci \
> -enable-kvm -m 2048 \
> -device virtio-blk,drive=myhd \
> -drive
> if=none,file=1a7wi5mgcy3wrsx6pcnag6qjbb87djwl-image.qcow2,id=myhd
> ```
> 4. Edit /run/current-system/configuration.scm and uncomment the static
> networking.
> 5. Reconfigure the system.
> 6. Try to restart the networking service. `herd restart networking`
> 7. The command will hang infinitely. Cancel it.
> 8. Check the network interfaces. The VLAN interface will have been
> created, but it will not have any address.
> 9. The aforementioned commands will all be unresponsive now.
> 10. If you reboot your VM you will see that the networking service is
> failed at startup, and if you try to restart the service you will get
> an error: #<&netlink-response-error errno: 17>
>

We need to improve our error messaging. This means that the
interface is exist.

--
Alexey
L
L
Lars Rustand wrote on 15 Feb 2024 10:07
(name . Ludovic Courtès)(address . ludo@gnu.org)
87il2q3wv2.fsf@yoga.mail-host-address-is-not-set
Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (11 lines)
> Ouch. Could you check what /var/log/messages reports?
>
> Once you’ve reproduced the hang, could you attach GDB to shepherd and
> get a backtrace?
>
> gdb -p 1
> bt
>
> (I recommend doing that in a VM rather than on your main machine!)
>

I have unfortunately been unable to reproduce the full shepherd hang,
even though I have followed the exact same procedure as before. I still
experience that the command `herd restart networking` hangs indefinitely
the first time after adding a VLAN, but now this has not triggered the
whole shepherd to hang afterwards anymore.

The basic error 17 still comes any time I try to start networking
service while having a VLAN configured.


Toggle quote (5 lines)
>
> You mentioned in your other message that the example from the manual
> works fine. Could you try and reduce your config until you find which
> bit makes it fail?

The configuration I have already attached is as minimal as it is
possible. It only includes the mandatory OS fields and a minimal
static-networking-configuration.

I have already found which bit makes it fail. It is the use of VLAN for
any normal network link. VLANs seem to only work for bond devices as in
the example.

The reproduction steps are maybe a little over-complicated however, and
are only necessary in order to reproduce the full "shepherd hangs" bug,
which I now am unable to reproduce anyway. But what I believe is the
root of the problem is the error 17 on starting the networking
service. This can be reproduced much more simply and reliably by just
starting a VM the normal way with the static-networking snippet already
enabled when building it.

So here are the new simplified reproduction steps for reproducing only
the error 17 and unfunctional VLAN networking:

Use the OS config from my first post, but uncomment the static
networking block. Build and run the VM with `$(guix system vm minimal.scm)`.

That's it.

Toggle quote (7 lines)
> Cc’ing Alexey and Julien who may know more.
>
> Thanks,
> Ludo’.



Alexey Abramov <levenson@mmer.org> writes:

Toggle quote (9 lines)
> Hi Lars,
>
> I see, Could you please, replace the device name to "myvlan" and not
> "myvlan@eth0" in the network-address.
>
> Even though ip link (iproute2) shows you 'myvlan@eth0' this is not an
> actual name of the interfaces.
>

I have tried with your suggestion, but everything behaves exactly the same.
?
Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 68595@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it
mumi current 68595
Then, you may apply the latest patchset in this issue (with sign off)
mumi am -- -s
Or, compose a reply to this issue
mumi compose
Or, send patches to this issue
mumi send-email *.patch