Skip to content
stackedsax edited this page Dec 30, 2015 · 1 revision

Blueflood uses Coda Hale Metrics for metrics collection and reporting.

With many of these only the mean is explicitly referenced, but with timers we typically watch both the mean and the 95th percentiles. The most basic and easiest to monitor and set thresholds for items have been bolded.

Operational Dashboard

At Rackspace, our operational dashboard contains graphs for the following metrics:

  1. operations/min

    hitcount(*.com.rackspacecloud.blueflood.io.InstrumentedConnectionPoolMonitor.Connection-Borrowed.m5_rate, "1minute")

    Rough graph of operations per minute. Expect to be fairly spikey, with short periods (<1hour) of non-spikeyness. Visualized as a stacked graph. Keep an eye on overall trend changes.

  2. rollups/min

    hitcount(*.com.rackspacecloud.blueflood.service.RollupService.Rollup-Execution-Timer.m5_rate, "1minute")

    Rollups per minute. Very spikey. The less spikiness this graph has, the worse your latency is on rollup calculation. If it remains relatively stable, that indicates you are near capacity on rollup calculation throughput.

  3. metrics ingested/min

    hitcount(*.com.rackspacecloud.blueflood.io.Instrumentation.Full-Resolution-Metrics-Written.m1_rate, "1minute")

    Full resolution metrics ingested per minute. Depending on your ingestion patterns, should stay relatively flat.

  4. average rollout execution time

    *.com.rackspacecloud.blueflood.service.RollupService.Rollup-Execution-Timer.mean

    Normally ~15 ms, short spikes to ~40ms are normal and okay.

  5. average time taken to grab locators for a shard and schedule rollups for a given slot and gran combination

    averageSeries(*.com.rackspacecloud.blueflood.service.RollupService.Locate-and-Schedule-Rollups-for-Slot.mean)

    How long it takes to rollup all locators for a shard+slot combination. (i.e, 1/128th of the rollups at a given granularity for a given timestamp)

  6. average query time

    *-maas-prod-dcass*.com.rackspacecloud.blueflood.outputs.handlers.RollupHandler.Get-metrics-from-db.mean

    Average query times to Cassandra for user queries. 20-40ms considered normal.

  7. queries/min

    sumSeries(hitcount(*.com.rackspacecloud.blueflood.outputs.handlers.RollupHandler.Get-metrics-from-db.m5_rate, "1minute"))

    User queries per minute. This is for a 5-minute window, but calculated every minute.

  8. db failures/min

    sumSeries(hitcount(*.com.rackspacecloud.blueflood.io.InstrumentedConnectionPoolMonitor.Operation-Result-Failure.m5_rate, "1minute"))

    Failed Cassandra operations. Should not go above 0. This is for a 5-minute window, but calculated every minute.

  9. total buffered metrics

    *.com.rackspacecloud.blueflood.inputs.handlers.*.Buffered-Metrics.count

    Number of metrics buffered for writes. Depends a lot on your ingestion path how to interpret.

  10. largest queue

    *.com.rackspacecloud.blueflood.concurrent.InstrumentedThreadPoolExecutor.*-work-queue-size

    Allows you to find bottlenecks in metric processing pipeline. Whichever queue consistently has the biggest work queue size is likely the culprit.

  11. Number of queued rollups

    *.com.rackspacecloud.blueflood.service.RollupService.Queued-Rollup-Count

    Number values depend on cluster load. However, it should go to 0 during the majority of 5 minute periods. Otherwise, you may be near capacity for rollup throughput.

  12. rate of re-rollup/min

    *.com.rackspacecloud.blueflood.service.RollupService.Re-rolling-up-a-slot-because-of-new-data.m1_rate

    Indicates redoing a whole chunk of rollup work (redoing 1/128th of all rollups for a given granularity and timestamp) because of one+ piece of late data. In other words, how many times we have re-rolled-up some piece of data in the last minute.

  13. approximate maximum delay for rollup calculation

    scale(*.com.rackspacecloud.blueflood.service.RollupService.Rollup-Wait-Histogram.p95,0.001)

    Rollup delay in seconds (95th percentile). How long from when a rollup window ended before it was calculated and written back to Cassandra. i.e, a rollup for data from 00:00:00 through 00:04:59 being written to the database at 00:05:59 would show as 60 on the Y axis of this graph.

  14. number of exhausted pools

    *.com.rackspacecloud.blueflood.io.Instrumentation.All-Pools-Exhausted.count

    Insufficient Connections. Should never go above 0. Indicates that you need more connections to Cassandra if it does.

Monitoring

It depends entirely on what your usage patterns are, what SLAs you are looking to conform to, and what roles each node is responsible for performing. However, the easiest to monitor and arguably strongest indicators of problems are listed below.

  • Insufficient Connections should never go above 0.
  • Failed Cassandra operations should not go above 0.
  • Rollup delay in seconds (95th percentile) should be monitored with thresholds based on the SLA you want to guarantee to users regarding rollup latencies.
Clone this wiki locally