Monitoring Unbound

Thu Nov 24 19:32:17 UTC 2016

My 2 cents:
* We try to structure our stats in a Errors->Saturation->Utilization way,  which is consistent with USE methodology[1] or Google's Four Golden Signals[2]. In case of unbound it is:
  * Servfails, availability measured by blackbox tests
  * Queue sizes, queue drops, blackbox tests latency
  * Number of Queries w/ breakdowns by querytype/answercode, cache hit rate
* We also graphs some basic internal stats, like memory usage, cpu usage, restart rate, etc.
* Breakdowns and drill downs are very useful to reduce MTTR.

[1] http://www.brendangregg.com/usemethod.html
[2] https://landing.google.com/sre/book.html
> On Nov 23, 2016, at 11:55 AM, John Todd via Unbound-users <unbound-users at unbound.net> wrote:
> 
> On 23 Nov 2016, at 0:49, Jaap Akkerhuis via Unbound-users wrote:
> 
>> Alexander via Unbound-users writes:
>> 
>>>     Hi to every one, can you help to monitor unbound dns with cacti?
>>> I'm tried to set up unbound and cacti, but the graphs are empty. I'm
>>> installed Dmitriy Demidov package.
>> 
>> Once I set-up cacti to do this, but I'm not really happy with that.
>> 
>>>     Can you tell me others tools for monitoring dns queues? Some tips
>>> for monitoring DNS?
>> 
>> I really prefer using munin. See the user contributed directory.
> 
> [snip]
> 
> I know it’s not a direct answer to the top part of the original question, but perhaps it does answer the second part about monitoring queues.  We’ve recently created an exporter for Unbound resolver for importation into Prometheus, which seems to work quite well. We then use Grafana to extract and visualize information from Prometheus. Building charts once you get the hang of the query language is quite easy, and allows on-the-fly regeneration of data visualization and complex comparisons/aggregations if you have multiple servers, locations, or services. Here is an example chart that took about 30 seconds to build.  There are also monitoring components for Prometheus and/or Grafana which can generate alerts based on metrics in a more complex way other than just visualization, but that perhaps is outside the scope of this thread. There are a number of tools for importing other system-level data into Prometheus, and it may be a good idea to investigate those other components to compliment or replace your existing monitoring systems if they do what you need. It is not trivial to learn - the query language is mostly unlike SQL, and there are quite a few ways to fail silently with what seem to be legitimate queries, but if you know the ground truth of one system you can start iteratively trying to draw graphs until you figure out the right way to do it.
> 
> If there is interest, we can try to work on getting the exporter we wrote in a condition where it could be provided in the contrib directory. It uses the “push gateway” method, which is not ideal but does work well enough. (Note: “Prometheus Unbound” is also a novel by Percy Bysshe Shelley, which makes keyword searching for prior work on this a bit difficult, so apologies if someone has already done this project.  :-)
> 
> 
> Prometheus overview:
> 
> To give an example of how a graph is built, this is the simplest query that I performed to get the component of the chart that generates the “A” QTYPE component line. I just cut/pasted this into a number of other queries in the same graph to create the other lines, replacing “A” with “AAA”, “MX”, etc.  This aggregates all of the Unbound servers I am running (I have many) with the “sum” command, then uses the “irate” command which shows change over time, with a time interval of 1 minute.
> 
> sum(irate(unbound_num_query_type_A[1m]))
> 
> I then specified that this is stacked chart, percentage-measured, with 60% as the lower bound.  I could command-click any of the labels shown and they would disappear from the graph and it would be re-drawn without that statistic instantly. Alternately, I could click on just one of the labels and only that graph line would be shown, re-drawing instantly.
> 
> A more complex query, limiting to systems that are tagged with “prod” (vs. “dev”) and limiting to specific POPs is shown below.
> The “env” and “loc” tags are made up by us, and the contents of those tags are set on the remote server before the metrics are collected.  This allows arbitrary tagging of each metric so that it is possible to filter (think of it as a modified “SELECT WHERE” statement.)  The $POP string specification (created by us, again another arbitrary tag name) is consumed by Grafana using a concept called “templates”, which puts a pull-down list at the top of the graph page with a list of all of the POPs we have.  I can then select one OR MORE POPs and the system will automatically aggregate all the data across all those metrics and display it. I could put other filters in here that would be parsed at the moment the graph is drawn.
> 
> sum(irate(unbound_num_query_type_A{env="prod",loc=~"$POP"}[1m]))
> 
> In summary: Once you start putting your monitoring data into a TSDB or TSDB-ish system like Prometheus (or InfluxDB, or OpenTSDB) and creating visualizations with Grafana, you will wonder how you possibly survived without it.  Even just using the most basic features is a huge win over older systems, in my opinion, and moving up into the automation methods and alerting methods as you get more experience is another win. If you’re looking for a short intro to Prometheus, see the following presentation from Monitorama 2015 by Jamie Wilkinson.
> 
> Video: https://vimeo.com/131581353
> Slides: https://docs.google.com/presentation/d/1X1rKozAUuF2MVc1YXElFWq9wkcWv3Axdldl8LOH9Vik/edit#slide=id.ga150a40c0_0_193
> 
> If you’re looking for an introduction to Grafana, there are many - Google will be a better guide than I.
> 
> JT
> 
> <Screen Shot 2016-11-23 at 10.28.43 AM.png>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.nlnetlabs.nl/pipermail/unbound-users/attachments/20161124/e8477a5d/attachment.bin>