-
Notifications
You must be signed in to change notification settings - Fork 103
Monitoring
Blueflood uses Coda Hale Metrics for metrics collection and reporting.
With many of these only the mean is explicitly referenced, but with timers we typically watch both the mean and the 95th percentiles. The most basic and easiest to monitor and set thresholds for items have been bolded.
At Rackspace, our operational dashboard contains graphs for the following metrics:
-
operations/min
hitcount(*.com.rackspacecloud.blueflood.io.InstrumentedConnectionPoolMonitor.Connection-Borrowed.m5_rate, "1minute")
Rough graph of operations per minute. Expect to be fairly spikey, with short periods (<1hour) of non-spikeyness. Visualized as a stacked graph. Keep an eye on overall trend changes.
-
rollups/min
hitcount(*.com.rackspacecloud.blueflood.service.RollupService.Rollup-Execution-Timer.m5_rate, "1minute")
Rollups per minute. Very spikey. The less spikiness this graph has, the worse your latency is on rollup calculation. If it remains relatively stable, that indicates you are near capacity on rollup calculation throughput.
-
metrics ingested/min
hitcount(*.com.rackspacecloud.blueflood.io.Instrumentation.Full-Resolution-Metrics-Written.m1_rate, "1minute")
Full resolution metrics ingested per minute. Depending on your ingestion patterns, should stay relatively flat.
-
average rollout execution time
*.com.rackspacecloud.blueflood.service.RollupService.Rollup-Execution-Timer.mean
Normally ~15 ms, short spikes to ~40ms are normal and okay.
-
average time taken to grab locators for a shard and schedule rollups for a given slot and gran combination
averageSeries(*.com.rackspacecloud.blueflood.service.RollupService.Locate-and-Schedule-Rollups-for-Slot.mean)
How long it takes to rollup all locators for a shard+slot combination. (i.e, 1/128th of the rollups at a given granularity for a given timestamp)
-
average query time
*-maas-prod-dcass*.com.rackspacecloud.blueflood.outputs.handlers.RollupHandler.Get-metrics-from-db.mean
Average query times to Cassandra for user queries. 20-40ms considered normal.
-
queries/min
sumSeries(hitcount(*.com.rackspacecloud.blueflood.outputs.handlers.RollupHandler.Get-metrics-from-db.m5_rate, "1minute"))
User queries per minute. This is for a 5-minute window, but calculated every minute.
-
db failures/min
sumSeries(hitcount(*.com.rackspacecloud.blueflood.io.InstrumentedConnectionPoolMonitor.Operation-Result-Failure.m5_rate, "1minute"))
Failed Cassandra operations. Should not go above 0. This is for a 5-minute window, but calculated every minute.
-
total buffered metrics
*.com.rackspacecloud.blueflood.inputs.handlers.*.Buffered-Metrics.count
Number of metrics buffered for writes. Depends a lot on your ingestion path how to interpret.
-
largest queue
*.com.rackspacecloud.blueflood.concurrent.InstrumentedThreadPoolExecutor.*-work-queue-size
Allows you to find bottlenecks in metric processing pipeline. Whichever queue consistently has the biggest work queue size is likely the culprit.
-
Number of queued rollups
*.com.rackspacecloud.blueflood.service.RollupService.Queued-Rollup-Count
Number values depend on cluster load. However, it should go to 0 during the majority of 5 minute periods. Otherwise, you may be near capacity for rollup throughput.
-
rate of re-rollup/min
*.com.rackspacecloud.blueflood.service.RollupService.Re-rolling-up-a-slot-because-of-new-data.m1_rate
Indicates redoing a whole chunk of rollup work (redoing 1/128th of all rollups for a given granularity and timestamp) because of one+ piece of late data. In other words, how many times we have re-rolled-up some piece of data in the last minute.
-
approximate maximum delay for rollup calculation
scale(*.com.rackspacecloud.blueflood.service.RollupService.Rollup-Wait-Histogram.p95,0.001)
Rollup delay in seconds (95th percentile). How long from when a rollup window ended before it was calculated and written back to Cassandra. i.e, a rollup for data from 00:00:00 through 00:04:59 being written to the database at 00:05:59 would show as 60 on the Y axis of this graph.
-
number of exhausted pools
*.com.rackspacecloud.blueflood.io.Instrumentation.All-Pools-Exhausted.count
Insufficient Connections. Should never go above 0. Indicates that you need more connections to Cassandra if it does.
It depends entirely on what your usage patterns are, what SLAs you are looking to conform to, and what roles each node is responsible for performing. However, the easiest to monitor and arguably strongest indicators of problems are listed below.
- Insufficient Connections should never go above 0.
- Failed Cassandra operations should not go above 0.
- Rollup delay in seconds (95th percentile) should be monitored with thresholds based on the SLA you want to guarantee to users regarding rollup latencies.