A Proposal for Cloud State Notifications

By George Reese
April 2, 2011 | Comments: 6

About a year ago, I published a blog entry explaining why cloud providers should create event-driven API to reduce the need for monitoring, management, and automation tools to poll clouds for state information. To date, a grand total of 0 clouds have done anything in this direction in spite of the fact that such a system would greatly reduce the demands placed on their infrastructure by third-party tools.

I have been talking to a number of cloud ecosystem players and there seems to be a great interest in supporting such a system. I decided to throw a proposal out there to see if it sticks.

Purpose

The purpose of this system is to enable interested parties like monitoring, management, and automation tools to receive notifications of any changes to the states of virtual resources provided in the cloud. By pushing a change out to interested parties (subscribers) in a near-instantaneous manner as those changes happen, the cloud provider (the publisher) eliminates the need for polling.

Subscription

A subscriber must notify the publisher of its desire to receive state change notifications. The API of the underlying cloud provider should govern this process. Ideally, it is an automated leveraging using the normal authentication process of the cloud provider's existing API. This proposal requires only that the subscription enable the subscriber to specify an endpoint to which any state changes are pushed.

A subscriber should be able to establish any number of endpoints. Ideally, the subscriber should be able to limit interest to changes in specific resources as well as for changes in all resources.

Publishing

When an important event occurs resulting in a change in a resource state, the publisher will then publish the event to all subscribers as an HTTP POST to the subscriber endpoints with a JSON payload. The form of the POST will be:

http://endpoint/provider_id

Update based on Shlomo's comment

http://endpoint/provider_endpoint

A provider endpoint is the region/API endpoint for the cloud or region. For example:

http://cloud.enstratus.com/com.gogrid

Update based on Shlomo's comment

http://cloud.enstratus.com/ec2.us-east-1.amazonaws.com
http://cloud.enstratus.com/api.gogrid.com

The payload must minimally contain the type of the resource and its ID:

{ "type" : "vm", "id" : "1234" }
Update based on Shlomo's comment
{ "type" : "vm", "account" : "abc", "id" : "1234" }

Two optional attributes are "agent" (a cloud-specific string identifier that identifies the user, if any, who caused the state change) and "change" attribute (a cloud-specific JSON object containing any extra details about the change).

For tools to be able to rely on this system, the publisher should be pushing changes in a near-instantaneous manner.

Authentication and Trust

There is no authentication, encryption, or trust under this system. While the target has been validated through the subscription process, the publication process is vulnerable to interception via man-in-the-middle or simple network sniffing. Consequently, the publisher should never include sensitive information in the payload. The resource type and virtual machine are the recommended data.

Similarly, the subscriber should never believe that the information it is receiving is actually coming from the publisher nor that it is accurate. It is therefore expected that a subscriber will use normal API channels to pull the true current state.


You might also be interested in:

6 Comments

The provider ID as the reverse-domain-name of the provider is not specific enough. A single cloud provider may allow duplicate resource ID within different scopes: for example, Security Groups in AWS region us-east-1 and region eu-west-1 can be identically named but configured completely differently.

The URL endpoint of the cloud provider would be unambiguous.

The most pressing need for this in our environment is to notify monitoring tools that instances have gone away permanently. In that case, a poll following the notification will fail to find the resource, because it has gone away, but needs to be sure that this is the case, and it isn't a transient failure of some kind, so I think there needs to be a higher level of trust in the notification message. Within AWS I would base this on an SNS topic. I'm not sure what equivalent mechanisms exist in other clouds.

There are three sources for notifications in our current (Netflix internal) tools. One is a simple removal of an instance by our tools. The second is controlled by an Amazon Autoscaler that is shrinking it's group of instances (which we have no visibility into), the third is those nice emails from AWS telling us that certain instances are dead or about to die. To fix case 2, would need an Autoscaler option to send an SNS message. I'm not sure if that is currently possible, and if not will ask the AWS folks about it.

When SNS came out last year, I was hopeful that it was the first step in an AWS attempt to solve this problem. They haven't taken this step unfortunately.

And to be honest, I don't actually expect AWS to follow any externally developed proposal, regardless of the merits. I expect some day that they will create something proprietary based on SNS (by the way, I love SNS).

As far as the deletion case goes, what if the change object was required and minimally looked like this:

"change" : { "state" : "terminated" }

You would then expect to get a non-response when verifying with AWS.

Thoughts?

George,
What's wrong with a cloud provider using an SNMP solution? The subscriber can use polling to determine current state, or rates of change in resources. Also, the cloud service can send trap notifications based on changes in configuration state.

And the great part is, the subscriber already has a large toolchest of management agents that he just point to his cloud provider, if they offered such a service.

+1 on huge importance of finding out about state changes in the cloud without doing a diff between 2 dumps of current state.

The problem consists of 2 parts:

1. provider to be able to extract deltas (state changes)
2. provider and customer to agree how such deltas are to be delivered

My point is that your post focuses on #2, which is undoubtedly important, but it's a moot point without providers doing #1. Something as simple as API call that would return a list of last N state changes would be at least something, even though yes, it still would be polling with all accompanying wasted resources, etc.

Without providers stepping up their game on #1, we the ecosystem can't solve it all by ourselves.

How about XMPP? Pubsubhubbub?

News Topics

Recommended for You

Got a Question?