Tuning EDNS0 retries?

Sun May 21 18:30:14 UTC 2017

On a busy unbound 1.6.2 server I observed the following sequence of events,
in which an initial query socket is closed quickly (for a retry with a
smaller EDNS0 buffer size) and ICMP unreachable is returned by the time the
answer arrives, with the retry answer finally accepted at the retry socket
60ms after the first answer, which was dropped.

-----

1.  Initial query with EDNS0 UDPsize = 8192

13:27:49.502228 IP (tos 0x0, ttl 64, id 61879, offset 0, flags [none], proto UDP (17), length 75)
   108.21.89.116.30230 > 199.254.50.1.53: 65168% [1au] DS? hairbylorelei.info. ar: . OPT UDPsize=8192 OK (47)

2.  ~90ms later retry with UDPsize=1472

13:27:49.591319 IP (tos 0x0, ttl 64, id 51021, offset 0, flags [none], proto UDP (17), length 75)
   108.21.89.116.41507 > 199.254.50.1.53: 64543% [1au] DS? hairbylorelei.info. ar: . OPT UDPsize=1472 OK (47)

3.  ~120ms from initial query response to that query

13:27:49.621226 IP (tos 0x0, ttl 58, id 38806, offset 0, flags [none], proto UDP (17), length 786)
   199.254.50.1.53 > 108.21.89.116.30230: 65168*- q: DS? hairbylorelei.info. 0/6/1 ns: info. SOA a0.info.afilias-nst.info. noc.afilias-nst.info. 2011722024 3600 1800 604800 3600, info. RRSIG, adnsd9nk7nk82he8h21rj0jjhj11o5gb.info. Type50, adnsd9nk7nk82he8h21rj0jjhj11o5gb.info. RRSIG, 5p19pe3bk0hiejutcthqm2f2n674rv1g.info. Type50, 5p19pe3bk0hiejutcthqm2f2n674rv1g.info. RRSIG ar: . OPT UDPsize=4096 OK (758)

4.  An immediate ICMP port unreachable, the original queries UDP socket is already closed:

13:27:49.621228 IP (tos 0x0, ttl 64, id 11225, offset 0, flags [none], proto ICMP (1), length 56)
   108.21.89.116 > 199.254.50.1: ICMP 108.21.89.116 udp port 30230 unreachable, length 36
       IP (tos 0x0, ttl 58, id 38806, offset 0, flags [none], proto UDP (17), length 786)

5.  Finally a reply to the second query:

13:27:49.668619 IP (tos 0x0, ttl 58, id 56774, offset 0, flags [none], proto UDP (17), length 786)
   199.254.50.1.53 > 108.21.89.116.41507: [udp sum ok] 64543*- q: DS? hairbylorelei.info. 0/6/1 ns: adnsd9nk7nk82he8h21rj0jjhj11o5gb.info. Type50, adnsd9nk7nk82he8h21rj0jjhj11o5gb.info. RRSIG, info. SOA a0.info.afilias-nst.info. noc.afilias-nst.info. 2011722024 3600 1800 604800 3600, info. RRSIG, 5p19pe3bk0hiejutcthqm2f2n674rv1g.info. Type50, 5p19pe3bk0hiejutcthqm2f2n674rv1g.info. RRSIG ar: . OPT UDPsize=4096 OK (758)

-----

It seems I need a non-zero value of "delay-close".  What is a sensible
value for this?  I've seen mention of 1500, is that about right?

How should such the server be tuned?  It is used for broad DNSSEC/DANE
adoption surveys, so it is not uncommon to run around 1200 queries per
second for O(12) hours.  The machine has 4 hyper-threaded cores and
64GB of ram.  Unbound is linked with libevent.  Relevant configuration:

       num-threads: 8
       infra-cache-slabs: 8 
       key-cache-slabs: 8   
       msg-cache-slabs: 8
       rrset-cache-slabs: 8
       key-cache-size: 256m
       rrset-cache-size: 256m
       msg-cache-size: 128m
       neg-cache-size: 16m
       jostle-timeout: 2000

       interface: 127.0.0.1
       so-reuseport: yes
       access-control: 127.0.0.0/8 allow
       edns-buffer-size: 8192
       max-udp-size: 8192

       outgoing-range: 6144
       num-queries-per-thread: 3077
       outgoing-port-permit: 1024-65535
       outgoing-port-avoid: 1-1023
       outgoing-num-tcp: 256
       incoming-num-tcp: 256
       so-rcvbuf: 12m
       so-sndbuf: 12m
       infra-cache-numhosts: 100000

-- 
	Viktor.