Excessive negative cache time for DNS lookups

Open

Details

3 participants

Thompson, David
Ludovic Courtès
Mark H Weaver

Owner: unassigned

Submitted by: Mark H Weaver

Severity: normal

Debbugs page

Mark H Weaver wrote on 15 Jun 2015 04:43

Recipients:(address . bug-guix@gnu.org)

Message-ID:87si9titcm.fsf@netris.org

On my GuixSD system, DNS lookup failures (due to lack of network) are

cached for at least 4 minutes, and possibly much longer (I've not had

the patience to wait until it expires).

Even after restarting nscd, the negative cache entry persists for about

another 20 seconds.

I run into this problem quite frequently after waking my system up from

suspend-to-ram. It typically takes several seconds for wicd to

establish a network connection, and if I make the mistake of triggering

a name lookup of my server (world.peace.net), then I must do the

following:

(1) deco restart nscd

(2) wait about 20 seconds

Looking at %nscd-default-caches in (gnu services base), I see that the

intent is to cache host lookup failures for 20 seconds, and service

lookup failures for an hour. The fact that the negative cache entry

persists for 20 seconds after restarting nscd would seem to indicate

that nscd considers this to be a host lookup. So perhaps this is a bug

in nscd.

Is anyone else able to reproduce this problem?

Mark

Thompson, David wrote on 15 Jun 2015 15:15

Recipients:(name . Mark H Weaver)(address . mhw@netris.org)(address . 20816@debbugs.gnu.org)

Message-ID:CAJ=RwfYU6HK8XDxgxnnjipoC7SGJvic1A5zOShWb68T3+AbPOw@mail.gmail.com

On Sun, Jun 14, 2015 at 10:43 PM, Mark H Weaver <mhw@netris.org> wrote:

Toggle quote (25 lines)> On my GuixSD system, DNS lookup failures (due to lack of network) are
> cached for at least 4 minutes, and possibly much longer (I've not had
> the patience to wait until it expires).
>
> Even after restarting nscd, the negative cache entry persists for about
> another 20 seconds.
>
> I run into this problem quite frequently after waking my system up from
> suspend-to-ram.  It typically takes several seconds for wicd to
> establish a network connection, and if I make the mistake of triggering
> a name lookup of my server (world.peace.net), then I must do the
> following:
>
> (1) deco restart nscd
> (2) wait about 20 seconds
>
> Looking at %nscd-default-caches in (gnu services base), I see that the
> intent is to cache host lookup failures for 20 seconds, and service
> lookup failures for an hour.  The fact that the negative cache entry
> persists for 20 seconds after restarting nscd would seem to indicate
> that nscd considers this to be a host lookup.  So perhaps this is a bug
> in nscd.
>
> Is anyone else able to reproduce this problem?

Yes, I had this exact same issue on Friday after having wireless

network connectivity problems.

- Dave

Ludovic Courtès wrote on 15 Jun 2015 22:02

Recipients:(name . Mark H Weaver)(address . mhw@netris.org)(address . 20816@debbugs.gnu.org)

Message-ID:874mm891ud.fsf@inria.fr

Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (7 lines)

> On my GuixSD system, DNS lookup failures (due to lack of network) are

> cached for at least 4 minutes, and possibly much longer (I've not had

> the patience to wait until it expires).

> Even after restarting nscd, the negative cache entry persists for about

> another 20 seconds.

To reproduce it, I ran ‘guix system vm’ on this:

Toggle snippet (24 lines)(use-modules (gnu) (guix))
(use-package-modules base)

(operating-system
  (host-name "komputilo")
  (timezone "Europe/Berlin")
  (locale "en_US.UTF-8")
  (bootloader (grub-configuration (device "/dev/sdX")))
  (file-systems (cons (file-system
                        (device "root")
                        (title 'label)
                        (mount-point "/")
                        (type "ext4"))
                      %base-file-systems))
  (packages (cons* (canonical-package glibc) %base-packages))
  (services (map (lambda (mservice)
                   (mlet %store-monad ((service mservice))
                     (if (memq 'nscd (service-provision service))
                         (nscd-service (nscd-configuration
                                        (debug-level 10)))
                         mservice)))
                 %base-services)))

Then I had one console running “tail -f /var/log/nscd.log”, and another

one where I would run “ping www.gnu.org.”

Initially looking up www.gnu.org fails (because the network is down),

and nscd.log indeed shows:

Toggle snippet (6 lines)

Mon Jun 15 21:04:20 2015 - 276: handle_request: request received (Version = 2) from PID 296

Mon Jun 15 21:04:20 2015 - 276: GETAI (www.gnu.org)

Mon Jun 15 21:04:20 2015 - 276: Haven't found "www.gnu.org" in hosts cache!

Mon Jun 15 21:04:20 2015 - 276: add new entry "www.gnu.org" of type GETAI for hosts to cache (first)

For subsequent requests, it shows fewer details, meaning that the cached

failure is used:

Toggle snippet (5 lines)

Mon Jun 15 21:04:25 2015 - 276: handle_request: request received (Version = 2) from PID 297

Mon Jun 15 21:04:25 2015 - 276: GETFDHST

Mon Jun 15 21:04:25 2015 - 276: provide access to FD 6, for hosts

Twenty seconds after the original failure, we see:

Toggle snippet (4 lines)

Mon Jun 15 21:04:40 2015 - 276: pruning hosts cache; time 1434395080

Mon Jun 15 21:04:40 2015 - 276: considering GETAI entry "www.gnu.org", timeout 1434395080

So cache expiry seems to be working as expected.

Now, if we enable networking with “dhclient -v eth0” and see what

happens, we get again a series of lookup failures, even after the

www.gnu.org entry has been removed from cache; apparently new failures

get cached. Eventually, after a bunch of cache prunes (so a few times

20 seconds), lookup succeeds.

The situation is different if you use ‘getent hosts’ instead of ‘ping’–

note that the latter uses getaddrinfo whereas the former uses

gethostbyname{,v6}, which is cached differently. Here we get a

successful lookup within 20 seconds after networking has been put up.

So my guess is that the GETAI cache is racy or something.

I’ve run nscd with Helgrind, which reports a few things (see attached

file), but I suspect most of it are false positives.

Ludo’.

Attachment: vg.log

Your comment

Commenting via the web interface is currently disabled.

To comment on this conversation send an email to 20816@debbugs.gnu.org

To respond to this issue using the mumi CLI, first switch to it

mumi current 20816

Then, you may apply the latest patchset in this issue (with sign off)

mumi am -- -s

Or, compose a reply to this issue

mumi compose

Or, send patches to this issue

mumi send-email *.patch

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date