[Unbound-users] Python API extension patch proposal

Thu Sep 7 13:36:49 UTC 2017

Hello,

Bringing back this thread from the dead.
We have tested it in-house in the two-year interval, and added some 
tweaks to handle IPv6 clients (by default, gather clients under the same 
/64, but for specific spaces, if you have very large IPv6 customer 
spaces and are giving out /48s, it is possible to tune IPV6_PREFIXES_MAP 
to define exceptions, to set it so that addresses matching, say, 
2001:xxxx/32 will be gathered by /48)

Here is the github repository for the module :
https://github.com/asahinet-isp/unbound-geigeki/

Here is how you install it in Unbound (though you do need to tweak it a 
bit by hand as of now) :
<--
server:
     [...]
     module-config: "python iterator"

python:
     python-script: "/etc/unbound/geigeki.py"
-->

My company has finally given clearance for publishing it, which we will 
be doing on the above github account.
If judged appropriate, it may be entered in contrib/ in its current 
iteration.

Cheers,

On 2015-01-05 20:40, Stephane LAPIE wrote:
> Hi Wouter,
> 
> On 01/05/2015 07:21 PM, W.C.A. Wijngaards wrote:
>> > This patch, along with an actual module that will SERVFAIL (as
>> > above) cache-missing queries going over threshold (therefore
>> > reducing upstream traffic to a tenth of what it would be if
>> > honoring DDoS-related queries, AND keeping it within our AS), has
>> > been running in our production environment at ASAHI Net for several
>> > months now, and has been approved for upstream contribution on our
>> > side.
>> 
>> Thank you for the patch, I have put it in the source.  Can you tell
>> the allow_query() details that work for you (the threshold and what
>> you do with the AS specifically)?
> 
> Many thanks for your understanding.
> I'll try to provide as much notes as I am allowed to.
> 
> I don't have total clearance yet to publish the full code, or the exact
> thresholds, but basically here's what I am doing and where the 
> reasoning
> comes from.
> 
> 
> 
> * Storing information :
> I am storing query details for :
> - the client's address
> - the delegpt name (simplified form of qdsjhfoishfsdofjdqs.domain.com,
> "domain.com", derived via infra cache), which I will refer hereafter as
> "domain"
> - the "client-domain" pair
> 
> Then, for each of these, I create python dictionaries (it's tricky to
> handle locking properly though ;)) that have a series of counters for :
> - Normal query count (anything short of ANY)
> -> If used in conjunction with the AAAA filter, ignore these queries as
> they will never return meaningful information
> - ANY query count
> -> This is usually blocked at firewall level, but I wanted to try a few
> things
> - NXDOMAIN count (I get the return code from another event)
> -> A server capable of answering NXDOMAINs will crank up this counter
> extremely fast in case of a violent attack
> - RRSET count (if the query was a success, and had meaningful data, I
> check the answer's RRSET count)
> -> It could probably be used to detect and block weird forms of AMP
> attacks, using the TXT records for instance.
> 
> Then, I have set thresholds for each of these counters, applicable to
> client, domains, and client-domain pairs.
> These thresholds are basically hand-set, applying roughly a 80% ratio 
> on
> the kind of fullscale DDoS we take :
> - A given DDoS participant gives out 4 attack queries / second to that
> domain, which is around 1200 cache-missing queries in 5 minutes.
> - A given domain in a DDoS therefore takes around a hundred of these at
> the very least.
> - One can therefore start flagging a client-domain relation around 5
> cache-missing queries in 5 minutes. (not outright blocking ! the 
> purpose
> is to contain an attack without harming legitimate use)
> 
> 
> 
> * Decision process (summary) :
> - In either of these two cases :
> -- A client breaches the "single client" threshold (this guy is sending
> way too much stuff to be legit, no matter what), or the "client-domain"
> maximum threshold
> -- A domain is above the DDoS threshold (given how many queries it has
> received in five minutes, it's being hammered), and the "client-domain"
> is within suspicion range
> my implementation of allow_query() will return false, and it will be
> blocked.
> - For mitigation of false positives :
> -- The above is basically mainly using the normal, ANY and NXDOMAIN
> counters and positive thresholds
> -- In order to avoid false positives, I check if the RRSET counter is
> superior to zero, and if so, this means this domain, or client, or
> client-domain pair actually have dealt with meaningful info and are
> probably not implied in a bonafide DDoS, since they provide stuff that
> will fill the cache with meaningful info.
> 
> 
> 
> * Data purging :
> - Also, every minute, I decrement by a fifth of the total threshold.
> This ensures I "forget" about an attacker, but not too fast.
> - Last decrement step goes down to the suspicion threshold (if set), 
> and
> if the client really has been behaving properly and not hammering, then
> it's fully decremented back to zero, otherwise SERVFAIL.
> - Every five minutes, I weed out dictionaries entries that have been
> already set at zero, to ensure I don't leak memory like crazy.
> 
> I could actually also implement dictionaries based on authority 
> servers,
> but this is highly unperformant and it's not possible to update all the
> dictionaries in the time of a query answer. I think the above algorithm
> already considerably slows down the whole thing, but it still works 
> just
> fine, thank gods.
> 
> 
> 
> * Notes :
> - I have actually noticed our attackers have caught on to this blocking
> method, and they started attacking domains sharing the same authority
> servers, thus splitting the load across several domains.
> -> Actually, though, this can probably be blocked with outgoing rate
> limit on firewall level, but I was thinking it might make sense for
> unbound's iterator module to store information on how much a target
> server is solicited to mitigate attacks.
> - I am also sending out UDP broadcasts to a monitoring server when
> blocking a query, to keep tabs on blocked domains. I eventually think 
> of
> having a separate thread to get these broadcasts directly from unbound
> so as to share the information that a given client is up to no-good, or
> that a domain is hammered, or that a given client is forbidden from
> accessing a specific domain.
> -> It might be nice to have python module environment variables
> accessible via a stats dump or something, food for thought.
> 
> 
> 
> * Measured effect :
> - A client's queries are blocked, only if they are really hammering the
> server, or aggravating an already overwhelmed domain.
> -> Legitimate traffic or queries for cached entries are not affected,
> even if they are participating in a DDoS. This avoids harming "owned"
> customers, and earns the extra time needed to actually help them out.
> - Queries that should go upstream to the victim authority server are
> stopped, and a SERVFAIL is answered to the client.
> -> This not only ensures frivolous recursive queries are not performed,
> it also ensures no negative caching is done on Unbound's side, and 
> keeps
> memory and network usage tight.
> 
> Concretely, this means that, instead of doing this :
> 1) Client -> ISP network -> Cache server : Sending query for
> dsjfiodhsfiodjfiosdfdsq.domain.com (confirming 3MB/s ingoing traffic)
> 2) Cache server -> Internet peers -> Authority server : Sending query
> for dsjfiodhsfiodjfiosdfdsq.domain.com (confirming 30MB/s outgoing
> traffic if unfiltered)
> 3) Cache server <- Internet peers <- Authority server : Receiving
> NXDOMAIN or timeouting, thus further insisting
> 4) Client <- ISP network <- Cache server : Reply timeout for client, or
> SERVFAIL if all authority servers for the delegation point are 
> confirmed
> dead (which can take a lot of time or never happen), clogging the
> recursive client list
> 
> Eventually, the query flow becomes this, once a domain has been
> confirmed as a DDoS target and a client as an attacker :
> 1) Client -> ISP network -> Cache server : Sending query for
> dsjfiodhsfiodjfiosdfdsq.domain.com (confirming 3MB/s ingoing traffic)
> 2) Client <- ISP network <- Cache server : Replying SERVFAIL 
> (confirming
> 3MB/s outgoing traffic, which remains inside of our network and does 
> not
> impact any of our network peers), and keeping the recursive client list
> clean
> 
> This is what I meant by "traffic staying within our AS", this avoids
> polluting our network peers' pipes with crud, and unnecessary transit 
> costs.
> There is no fancy coordination with our BGP routing tables (yet ;)).
> 
> 
> * Conclusion :
> Basically, being able to lookup the delegpt name is literally the
> cornerstone of the whole above algorithm, as it allows to shove into 
> one
> counter hundreds of millions of queries and to finally quantify the
> damage in a way that allows for fair blocking.
> 
> Cheers,
> _______________________________________________
> Unbound-users mailing list
> Unbound-users at unbound.net
> http://unbound.nlnetlabs.nl/mailman/listinfo/unbound-users

-- 
Stephane LAPIE, EPITA SRS, Promo 2005
"Even when they have digital readouts, I can't understand them."
--MegaTokyo