Ceilometer statistics and alarms a walk-through

In order to understand Ceilometer alarm behaviour, it is important to understand 
the behaviour of the underlying statistics.

Hence this short walk-through.

We start by looking at the samples associated with a meter, instance 
(being the count of the number of instances in existence)

Execute:

ceilometer sample-list --meter instance

Output:

Resource ID Name Type Volume Unit Timestamp
0ab52b7c instance gauge 1.0 instance 2014-07-11T03:33:58
17ea7a47 instance gauge 1.0 instance 2014-07-11T03:33:58
0ab52b7c instance gauge 1.0 instance 2014-07-11T03:23:57
17ea7a47 instance gauge 1.0 instance 2014-07-11T03:23:57
0ab52b7c instance gauge 1.0 instance 2014-07-11T03:13:59
17ea7a47 instance gauge 1.0 instance 2014-07-11T03:13:59
0ab52b7c instance gauge 1.0 instance 2014-07-11T03:03:58
17ea7a47 instance gauge 1.0 instance 2014-07-11T03:03:58
0ab52b7c instance gauge 1.0 instance 2014-07-11T02:14:05
17ea7a47 instance gauge 1.0 instance 2014-07-11T02:14:05
0ab52b7c instance gauge 1.0 instance 2014-07-11T02:04:05
17ea7a47 instance gauge 1.0 instance 2014-07-11T02:04:05
0ab52b7c instance gauge 1.0 instance 2014-07-11T01:54:05
17ea7a47 instance gauge 1.0 instance 2014-07-11T01:54:05
17ea7a47 instance gauge 1.0 instance 2014-07-11T01:44:05
17ea7a47 instance gauge 1.0 instance 2014-07-11T01:34:05

NB: If trying to follow along on the NeCTAR cloud, the above call will fail with 
the “Error communicating withhttps://ceilometer.rc.nectar.org.au:8777/ [Errno 54] 
Connection reset by peer” error message, as the number of samples gathered is too 
large.

We can see that these samples have been gathered approximately every 10 minutes, 
that the samples are ordered latest to earliest, that there are two resources that 
are being counted at the end of the run, but that originally there was only one 
instance found (the resource id column tips us off). Some twenty minutes after the 
first instance was counted, another instance was found. A closer look at the timestamp 
of the samples shows that initially the samples were taken at exactly 10 minutes 
intervals, but that some slight changes in the sampling interval occurred later 
in the run.


To get the ceilometer statistics across all of the samples ever recorded for the 
meter, we run the statistics command with no period specified:

ceilometer statistics --meter instance

Output:

Period Period Start Period End Max Min Avg Sum Count Duration Duration Start Duration End
0 2014-07-11T01:34:05 2014-07-11T03:33:58 1.0 1.0 1.0 16.0 16 7193.0 2014-07-11T01:34:05 2014-07-11T03:33:58

The column headers have the following meanings:

  • Period: The difference, in seconds, between the period start and end

  • Period Start: UTC date and time of the period start

  • Period End: UTC date and time of the period end

  • Duration: The difference, in seconds, between the oldest and newest timestamp

  • Duration Start: UTC date and time of the earliest timestamp, or the query start time

  • Duration End: UTC date and time of the oldest timestamp, or the query end time

We recalculate the statistics with the period set to 5 seconds:

ceilometer statistics --meter instance --period 5

Output:

Period Period Start Period End Max Min Avg Sum Count Duration Duration Start Duration End
5 2014-07-11T01:34:05 2014-07-11T01:34:10 1.0 1.0 1.0 1.0 1 0.0 2014-07-11T01:34:05 2014-07-11T01:34:05
5 2014-07-11T01:44:05 2014-07-11T01:44:10 1.0 1.0 1.0 1.0 1 0.0 2014-07-11T01:44:05 2014-07-11T01:44:05
5 2014-07-11T01:54:05 2014-07-11T01:54:10 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T01:54:05 2014-07-11T01:54:05
5 2014-07-11T02:04:05 2014-07-11T02:04:10 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T02:04:05 2014-07-11T02:04:05
5 2014-07-11T02:14:05 2014-07-11T02:14:10 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T02:14:05 2014-07-11T02:14:05
5 2014-07-11T03:03:55 2014-07-11T03:04:00 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:03:58 2014-07-11T03:03:58
5 2014-07-11T03:13:55 2014-07-11T03:14:00 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:13:59 2014-07-11T03:13:59
5 2014-07-11T03:23:55 2014-07-11T03:24:00 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:23:57 2014-07-11T03:23:57
5 2014-07-11T03:33:55 2014-07-11T03:34:00 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:33:58 2014-07-11T03:33:58

We can see that the period start is measured from the time of the first sample 
within the period, and the duration covers the time span between the first and 
the last sample within the period. Disconcertingly, we also see that the samples 
are time ordered in the opposite direction to the original list of samples. It’s 
important to note that the resultant periods reported are not contiguous. They are 
discrete. NB: If you define an alarm that results in a discrete period, such as in 
the above example, the chances are very high that the alarm will remain in the 
“insufficient data” state.

Then we recalculate the statistics with the period set to 700 seconds. This is 
100 seconds more than the sampling rate.

Execute:

ceilometer statistics --meter instance --period 700


Period Period Start Period End Max Min Avg Sum Count Duration Duration Start Duration End
700 2014-07-11T01:34:05 2014-07-11T01:45:45 1.0 1.0 1.0 2.0 2 600.0 2014-07-11T01:34:05 2014-07-11T01:44:05
700 2014-07-11T01:45:45 2014-07-11T01:57:25 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T01:54:05 2014-07-11T01:54:05
700 2014-07-11T01:57:25 2014-07-11T02:09:05 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T02:04:05 2014-07-11T02:04:05
700 2014-07-11T02:09:05 2014-07-11T02:20:45 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T02:14:05 2014-07-11T02:14:05
700 2014-07-11T02:55:45 2014-07-11T03:07:25 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:03:58 2014-07-11T03:03:58
700 2014-07-11T03:07:25 2014-07-11T03:19:05 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:13:59 2014-07-11T03:13:59
700 2014-07-11T03:19:05 2014-07-11T03:30:45 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:23:57 2014-07-11T03:23:57
700 2014-07-11T03:30:45 2014-07-11T03:42:25 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T03:33:58 2014-07-11T03:33:58

This result show us that the period is started from the time of the first sample 
received, and marches in 700 second increments. The reported periods are no longer 
discrete. The first period contains two samples, but from then on only one sample 
falls into each period. Again we see the duration covers the time span between the 
first and last sample within the period.

Then if we set the period to be 1200 (double the sampling rate), we get the following:

statistics --meter instance --period 1200

Period Period Start Period End Max Min Avg Sum Count Duration Duration Start Duration End
1200 2014-07-11T01:34:05 2014-07-11T01:54:05 1.0 1.0 1.0 2.0 2 600.0 2014-07-11T01:34:05 2014-07-11T01:44:05
1200 2014-07-11T01:54:05 2014-07-11T02:14:05 1.0 1.0 1.0 4.0 4 600.0 2014-07-11T01:54:05 2014-07-11T02:04:05
1200 2014-07-11T02:14:05 2014-07-11T02:34:05 1.0 1.0 1.0 2.0 2 0.0 2014-07-11T02:14:05 2014-07-11T02:14:05
1200 2014-07-11T02:54:05 2014-07-11T03:14:05 1.0 1.0 1.0 4.0 4 601.0 2014-07-11T03:03:58 2014-07-11T03:13:59
1200 2014-07-11T03:14:05 2014-07-11T03:34:05 1.0 1.0 1.0 4.0 4 601.0 2014-07-11T03:23:57 2014-07-11T03:33:58

We can see that the variability in the sample rate affects the calculated statistics.

Knowing the approximate sampling rate, and knowing how statistics are calculated, 
we can create an alarm that will fire if we accidentally start up more than two instances.

ceilometer -k alarm-threshold-create --name warn_on_high_instance_count --description 'squeal if too many instances' --meter-name instance --threshold 3 --comparison-operator ge --statistic count --period 600 --evaluation-periods 1 --alarm-action 'http://130.56.250.199:8080/alarm/instances_TOO_MANY' --insufficient-data-action 'http://130.56.250.199:8080/alarm/instances_nada_a' --ok-action 'http://130.56.250.199:8080/alarm/instances_ok'

Ceilometer Alarms: a worked example

In order to help understand the nature of Ceilometer alarms we have created two 
applications:

The basic idea is that Alarm Counter application can act as the webhook receiver 
for Ceilometer alarms. i.e.: Ceilometer alarms can be constructed that report to 
the Alarm Counter application by posting to a url when their state changes. The 
Stressed! application can be run on an instance that is then monitored by the 
Ceilometer alarms. Hence allowing an alarm to fire whenever the instance is under 
stress. The whole serves as a laboratory for learning about Ceilometer (hopefully).

NB: To run through this example you have to have the Ceilometer command line tools 
installed. Instructions on how to do this can be found here.

Both Stressed! and the Alarm Counter application have heat templates that will 
install and configure them on NeCTAR instances

Stressed!

Fire up the Stressed! application by:

  • Downloading the heat template named “launchStressed.yaml” from the Stressed! 
    github repository to your desktop.

  • Go to the Project-> Stacks (orchestration) tab of the dashboard.

  • Select the “Launch Stack” button.

  • In the resultant Select Template dialogue, select “File” as the Template Source.

  • Browse to your desktop and select “launchStressed.yaml”

  • Select the Next button to be taken to the “Launch Stack” form.

  • Provide the requested details and hit the “Launch” button.

All going well, Heat will launch the instance. Once this is done, you can go to 
either the Overview or the Topology tab, and you should see a link named “URL for 
stressed server”. You will have to wait a few minutes before this url works: the 
application is a Maven based Java one, and there is a fair bit of downloading and 
configuration going on in setting up the server.

Once the url becomes responsive it leads to a simple form:


The application is simply a front end to the “stress” utility. To stress the server, 
select the number of CPU’s on it, then the duration that you want to stress the 
server for, and finally hit the “Stress this server!” button.

The default duration is 600 seconds. That is 10 minutes, the default NeCTAR evaluation 
period at the time of writing.

The header of the page shows two figures:

  • Average CPU: The system load average for the last minute.

  • Current CPU: The recent CPU usage. 0.0 means that all CPU’s were idle, 1 means 
    that all CPU’s were running at 100%

When you stress the server, the Current CPU figure should go up to around 1 for 
the duration that you have selected. The Average CPU figure should soon start to 
climb as well!

Once launched make a note of the resource id for this server. A quick way to find it is to go to the “Resources” tab of the Stack. The resource id is the rather long and complex number that appears at the intersection of the “Resource” column and “StressMachine” row.


Alarm Counter

Once the Stressed application is installed, a similar path is followed for the 
Alarm Counter application.

  • Download the heat template named “launchAlarmCounter.yaml” from the Alarm 
    Counter github repository to your desktop.

  • Go to the Project-> Stacks (orchestration) tab of the dashboard.

  • Select the “Launch Stack” button.

  • In the resultant Select Template dialogue, select “File” as the Template Source.

  • Browse to your desktop and select “launchAlarmCounter.yaml”

  • Select the Next button to be taken to the “Launch Stack” form.

  • Provide the requested details and hit the “Launch” button.

All going well, Heat will launch the instance. Once this is done, you can go to 
either the Overview or the Topology tab, and you should see a link named “URL for 
Alarm server”. You will have to wait a few minutes before this url works: as this 
application is also a Maven based Java one, and again, there is a fair bit of 
configuration going on in setting up the server.

Once the url becomes responsive it leads to an application that you can use as a 
webhook for Ceilometer alarm calls.


The Ceilometer alarm

The front page of the Alarm Counter application has a sample Ceilometer command 
that can be used as a template for the Ceilometer alarm call that you are going 
to set up.


To use it, simply copy it from the browser, change the IP numbers to be the numbers 
of the server on which it is running (if not already those numbers) and then alter 
the resource id to be the resource id of the server running Stressed! (this is the 
resource ID that you made a note of earlier). Then paste it into your command line. 
All being well you should be met with a view of the alarm that you have just created.

This alarm is one that monitors the cpu utilisation of the server running the Stressed! application. If the threshold goes above 30 % cpu utilisation for one evaluation period of 10 minutes (the default NeCTAR evaluation period at the time of writing), then the alarm should trigger and post a message to the alarm-action url. Similarly, if there is insufficient data the alarm will transition to the insufficient data state and a message will be posted to the insufficient-data-action url. When the alarm transitions from either the alarm state or the insufficient data state to the OK state, then a message will be posted to the ok-action url.

Ceilometer will blindly perform a post request to whatever url’s are presented in 
the command, so the url’s presented in this sample call are specific to the Alarm 
Counter application. Here we use the ‘/alarm’ portion of the path to route the post 
to the appropriate handler in the Alarm Counter application, and the following term 
in the path as the name of the alarm that we are calling.

Triggering the alarms

Once you have created the alarm, a visit to the “see the alarm count totals” 
link (‘/totals’) should soon show the “instance_NO_DATA” call registered.


Within about 10 minutes or so it should also show that a call to the ‘instance_OK’ 
url has been registered. The page is set to auto-refresh every 3 minutes (yuck), 
so if you want instant gratification you will have to manually refresh the page 
to see these changes. No ajaxy bling on this budget!

Once everything is up and running return to the front page of the Stressed! application, 
and hit the “Stress this server!” button. After about another 10 minutes or so, 
the alarm totals page should now show that a call to “instance_TOO_HOT” has been 
registered.

When the Stressed! server has finished its run of the stress application, a call 
to the ‘instance_OK’ url will be registered.

If you leave the applications to run for a few hours you will see that the occasional 
a call to the ‘instance_NO_DATA’ url is registered. This is because the alarm has 
its evaluation periods set to match NeCTAR’s sampling rate and occasionally the 
two might be slightly out of sync, thus causing insufficient data alerts.

The alert payload

In the totals page of the Alarm Counter application the alarm names are links: 
and if selected will take you to a page showing the time ordered history of that 
alarm: and the contents of the payload passed by Ceilometer when it calls the 
associated url.

Ceilometer encapsulates this payload as a JSON body in the webhook call. So the 
body would typically look something like the following:

{

"current": "ok",

"alarm_id": "4a4579b6-3c24-42f8-b0bc-45b12148b176",

"reason": "Transition to ok due to 1 samples inside threshold, most recent: 15.5083333333",

"reason_data": {

"count": 1,

"most_recent": 15.508333333333333,

"type": "threshold",

"disposition": "inside"

},

"previous": "alarm"

}

The Alarm Counter program simply unpacks this data from the JSON and stores it 
for display on the page showing the time ordered history of the calls to the alarms. 
Note that the date and time displayed does not form part of the JSON package: it 
is added by the Alarm Counter program.

What to do if the alarms don’t fire

First, check to see if the resource that is being monitored has meters associated 
with it:

ceilometer meter-list -q 'resource_id=544f9431-2ec2-4b90-9d9f-9786855b0049'

where the resource_id is the resource id of the instance that you are trying to 
monitor (in this case, the server running the Stressed! application). If there 
are no meters, then the chances are good that the resource id is wrong.

A good way of checking the resource id is to issue the:

nova list

command line call, and check to see resource id is in the resultant ID column.

If there are meters, confirm that they are gathering samples:

ceilometer sample-list --meter cpu_util -q 'resource_id=544f9431-2ec2-4b90-9d9f-9786855b0049'

Again, the resource id is of the instance that you are trying to monitor, and 
the meter matches the one that the alarm is using.

The samples should appear at the default NeCTAR sample rate (every 600 seconds/10 
minutes at the time of writing), and should be up to date. If they aren’t up to 
date, or don’t appear, then again, the chances are good that the resource id is 
wrong.

If there are samples, confirm that the statistics are being gathered correctly:

ceilometer -k statistics --meter cpu_util -q 'resource_id=544f9431-2ec2-4b90-9d9f-9786855b0049' --period 600

Again, the resource id is the instance you are trying to monitor, the meter is 
the one that matches the one the alarm is using, and the period matches that of 
the alarm. Do the statistics show that the alarm should have fired? Does the 
period shown actually match the default NeCTAR sample rate?

Conclusion

Although Ceilometer alarms are tightly bound up with Heat’s autoscaling and 
failover capabilities, you can specify your own alerts that will call out to 
webhooks that you provide. These webhooks could in turn translate these alerts 
into another form, such as email or SMS. The alarms need not be bound with CPU 
utilisation: whatever Ceilometer meters can be alarmed! Most importantly, these 
alarms are coming from within the OpenStack infrastructure, thus adding an extra 
layer to your monitoring options for your infrastructure running on the NeCTAR 
cloud.