Maintained by: NLnet Labs

Statistics question on histogram vs. total_num_queries

John Todd
Fri Jan 20 00:13:43 CET 2017


We import all (“stats_noreset”) our Unbound 1.5.10 data into 
Prometheus (a TSDB) every 15 seconds.  We then generate many different 
reports out of the collected data. The current test system has very low 
QPS - 100-300.  I admit that the problem I describe below might be an 
issue with Prometheus or low QPS, but I’d like to ensure my 
assumptions about counters are correct before I start digging into a 
system which seems to work quite well for all my other data thus far.

I noticed something a bit off with one of the graphs, and I’m trying 
to understand what I’m doing wrong that is causing the problem. I have 
added up all of the incremental increasing results for the histograms 
for a particular interval (i.e.: subtracting last sample from current 
sample) and then compared that against what I thought should be the same 
(or extremely close) number which would be contained in the 
“total.num.queries” counter using the same method for the same time 
intervals.  They are very, very different numbers - an unexpected 
result.  (1)

Sample query below for histogram. Hopefully these should be 
mostly-evident even to a non-Prometheus user - “irate” means “give 
the per-second rate over time” with “[1m]” giving a one minute 
maximum time backwards from a sample to get the last value prior.  
Timespans are not shown here for brevity, but the queries are done with 
the same time bounds (typically 1 minute buckets. I limit the queries to 
“prod”uction systems, and only for particular POPs as a templated 
variable.)

   sum(
                   irate(unbound_histogram_000000_000000_to_000000_000001 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000001_to_000000_000002 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000002_to_000000_000004 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000004_to_000000_000008 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000008_to_000000_000016 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000016_to_000000_000032 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000032_to_000000_000064 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000064_to_000000_000128 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000128_to_000000_000256 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000256_to_000000_000512 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_000512_to_000000_001024 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_001024_to_000000_002048 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_002048_to_000000_004096 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_004096_to_000000_008192 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_008192_to_000000_016384 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_016384_to_000000_032768 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_032768_to_000000_065536 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_065536_to_000000_131072 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_131072_to_000000_262144 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_262144_to_000000_524288 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000000_524288_to_000001_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000001_000000_to_000002_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000002_000000_to_000004_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000004_000000_to_000008_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000008_000000_to_000016_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000016_000000_to_000032_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000032_000000_to_000064_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000064_000000_to_000128_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000128_000000_to_000256_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000256_000000_to_000512_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_000512_000000_to_001024_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_001024_000000_to_002048_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_002048_000000_to_004096_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_004096_000000_to_008192_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_008192_000000_to_016384_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_016384_000000_to_032768_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_032768_000000_to_065536_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_065536_000000_to_131072_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_131072_000000_to_262144_000000 
{env="prod",loc=~"$POP"}[1m]) +
   ignoring(label) 
irate(unbound_histogram_262144_000000_to_524288_000000 
{env="prod",loc=~"$POP"}[1m])
   )

This above query for an example 1 minute period provides a value of 159 
total queries per second. I then compare that result to the following 
query using the exact same time bounds:

  sum(irate(unbound_total_num_queries {env="prod",loc=~"$POP"}[1m]))

The above query results in 241 queries per second for that same 1 minute 
interval. I would expect the values to be within 5% of each other, but 
they are quite dissimilar. The mismatched ratio I see stays fairly level 
across QPS variation - around .7:1 though it does spike and dip between 
.5:1 and .8:1 if I graph the ratio over time.

Am I misunderstanding how the histogram.* and total.num.queries are 
counting queries and replies?  My initial assumption was that they would 
be almost identical for the same timeframe.

Do the histogram counters not include NXDOMAIN, nodata, SERVFAIL 
results? That seems inherently wrong to not measure the reply time for 
those rcode results (those intervals are still visible to the end user!) 
though it was worth looking to see if that led to more sensible 
matching. Even subtracting those rcodes out of the larger number 
(unbound_total_num_queries) the aggregated histogram values still are  
10-20% lower than that value though they are closer; I can’t tell if 
this is simply by chance, and is possibly the wrong path to a solution.

I could “look at the code” but that means farming it out to an 
actual developer, because I’m merely a pointy-haired boss/graphmonger 
who wants to build KPI-type statistics and alerts. I’m hoping there is 
a document or quick answer someone can point me towards which will clear 
up the discrepancy I am finding.

JT



(1) Background story: I’m trying to build a graph that is 
“percentile bands” of response time, and I needed a denominator for 
my calculations. I could have added up all of the histogram results to 
get the denominator for my percentage graph, but that made my query too 
big, so I just used “total.num.queries” but that led to a gap in the 
stacked graph, meaning down the rabbit hole I went and ended up here.