Statistics question on histogram vs. total_num_queries

Fri Jan 20 09:01:56 UTC 2017

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi John,

The numbers add up to the num.cachemiss.  For your statistics you need
to add the num.cachehits queries to the 0 bucket, or create another
bucket with a zero response time for those cachehits queries.  Then
your percentile band can be calculated.

It is also documented, over here:
http://unbound.net/documentation/unbound-control.html
(it says that the cachemiss queries are the queries that use recursive
processing, and the histogram keeps track of queries that needed
recursive processing).

You have a low cache hit ratio, it looks like.  Perhaps increase cache
memory size in unbound.conf.  Or it could be an artifact of artificial
query inputs.

If the percentile band you are looking for is the median, this is
already calculated for you in recursion.time.median (that number is
only for the queries that needed recursion, it does not include the
cachehits that were answered immediately).

Best regards, Wouter

On 20/01/17 00:13, John Todd via Unbound-users wrote:
> 
> We import all (“stats_noreset”) our Unbound 1.5.10 data into
> Prometheus (a TSDB) every 15 seconds.  We then generate many
> different reports out of the collected data. The current test
> system has very low QPS - 100-300.  I admit that the problem I
> describe below might be an issue with Prometheus or low QPS, but
> I’d like to ensure my assumptions about counters are correct before
> I start digging into a system which seems to work quite well for
> all my other data thus far.
> 
> I noticed something a bit off with one of the graphs, and I’m
> trying to understand what I’m doing wrong that is causing the
> problem. I have added up all of the incremental increasing results
> for the histograms for a particular interval (i.e.: subtracting
> last sample from current sample) and then compared that against
> what I thought should be the same (or extremely close) number which
> would be contained in the “total.num.queries” counter using the
> same method for the same time intervals.  They are very, very
> different numbers - an unexpected result.  (1)
> 
> Sample query below for histogram. Hopefully these should be 
> mostly-evident even to a non-Prometheus user - “irate” means “give
> the per-second rate over time” with “[1m]” giving a one minute
> maximum time backwards from a sample to get the last value prior.
> Timespans are not shown here for brevity, but the queries are done
> with the same time bounds (typically 1 minute buckets. I limit the
> queries to “prod”uction systems, and only for particular POPs as a
> templated variable.)
> 
> sum( irate(unbound_histogram_000000_000000_to_000000_000001 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000001_to_000000_000002 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000002_to_000000_000004 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000004_to_000000_000008 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000008_to_000000_000016 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000016_to_000000_000032 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000032_to_000000_000064 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000064_to_000000_000128 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000128_to_000000_000256 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000256_to_000000_000512 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_000512_to_000000_001024 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_001024_to_000000_002048 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_002048_to_000000_004096 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_004096_to_000000_008192 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_008192_to_000000_016384 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_016384_to_000000_032768 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_032768_to_000000_065536 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_065536_to_000000_131072 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_131072_to_000000_262144 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_262144_to_000000_524288 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000000_524288_to_000001_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000001_000000_to_000002_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000002_000000_to_000004_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000004_000000_to_000008_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000008_000000_to_000016_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000016_000000_to_000032_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000032_000000_to_000064_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000064_000000_to_000128_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000128_000000_to_000256_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000256_000000_to_000512_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_000512_000000_to_001024_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_001024_000000_to_002048_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_002048_000000_to_004096_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_004096_000000_to_008192_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_008192_000000_to_016384_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_016384_000000_to_032768_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_032768_000000_to_065536_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_065536_000000_to_131072_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_131072_000000_to_262144_000000 
> {env="prod",loc=~"$POP"}[1m]) + ignoring(label)
> irate(unbound_histogram_262144_000000_to_524288_000000 
> {env="prod",loc=~"$POP"}[1m]) )
> 
> This above query for an example 1 minute period provides a value of
> 159 total queries per second. I then compare that result to the
> following query using the exact same time bounds:
> 
> sum(irate(unbound_total_num_queries {env="prod",loc=~"$POP"}[1m]))
> 
> The above query results in 241 queries per second for that same 1
> minute interval. I would expect the values to be within 5% of each
> other, but they are quite dissimilar. The mismatched ratio I see
> stays fairly level across QPS variation - around .7:1 though it
> does spike and dip between .5:1 and .8:1 if I graph the ratio over
> time.
> 
> Am I misunderstanding how the histogram.* and total.num.queries
> are counting queries and replies?  My initial assumption was that
> they would be almost identical for the same timeframe.
> 
> Do the histogram counters not include NXDOMAIN, nodata, SERVFAIL 
> results? That seems inherently wrong to not measure the reply time
> for those rcode results (those intervals are still visible to the
> end user!) though it was worth looking to see if that led to more
> sensible matching. Even subtracting those rcodes out of the larger
> number (unbound_total_num_queries) the aggregated histogram values
> still are 10-20% lower than that value though they are closer; I
> can’t tell if this is simply by chance, and is possibly the wrong
> path to a solution.
> 
> I could “look at the code” but that means farming it out to an
> actual developer, because I’m merely a pointy-haired
> boss/graphmonger who wants to build KPI-type statistics and alerts.
> I’m hoping there is a document or quick answer someone can point me
> towards which will clear up the discrepancy I am finding.
> 
> JT
> 
> 
> 
> (1) Background story: I’m trying to build a graph that is
> “percentile bands” of response time, and I needed a denominator for
> my calculations. I could have added up all of the histogram results
> to get the denominator for my percentage graph, but that made my
> query too big, so I just used “total.num.queries” but that led to a
> gap in the stacked graph, meaning down the rabbit hole I went and
> ended up here.
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJYgdH9AAoJEJ9vHC1+BF+NIYcP/jOJEogRI50U8t6PeIOHeDmv
MNul4udyNJdhg+VeMIhqw018V01IMvPjthSNB+aP71HzvN6DRnx6WopT/CV9pHZh
Dr+FVxUjKQn03vI28hlGMQzAIzSd7oQtyFXFK4C356qDsgmufdLEf7V6IorViLys
ECvig2JAWoMjOcV9cDm6O4AhMRsPIRTwQkE8n2j5vrVlTSa46vaEHdplqJdu8M5y
rJqct9cPv5AgJGRdHNPF+LAHLfKher2/nnnQ39omHKUXNxJxEDEj15hqmFGKj7UL
bOwOmGy1FevaPBAtlgxqjxXbrwLQ5G5hQhltEVpdjEa6rT2MFy+4LNdPNM8shvjn
XLMA71YXyYIvcsyP903KaAysRjvmNE2sTjkntejMGVDVQ7xHH1Q0FG0O8Jqiou7L
TTw2SDZ2t4t/O8GwjbdHJM5bAcU4zsh3uBM236LLgSx5iF4iDprMew/hMZpP0GM0
uq4o1860vd+j0TM+X7+qLqmXpkRmYsNtLsIXgypQ4llDrAGbJknlcDEH22gnH8hB
vcrUxuYPXGWN/BUBx6IqajEUkN6YQ5JoQeis9Gb1u7RJ+NcQmX/U3tLsex4Gb3Uv
O31HVGN3PsATOXxq0DwqrDQfKBNSeR8n+VkGKOdvejrJ0EjGYEmqDfUY5Kw713YR
qv9+6+qgOVuPhL5o58dq
=+4ze
-----END PGP SIGNATURE-----