Adrian Cole of jclouds began a discussion on cloud abstraction layers with a number of individuals responding to his blog post. Included in those responses were a couple of posts from RightScale's Thorsten von Eicken. While I disagree with him significantly on the idea of a cloud abstraction layer, he brought up another topic at the end of his second post that I think deserves more discussion: the need for event-driven cloud APIs.
Thorsten points out that he has a number of servers dedicated to doing nothing other than polling EC2 for changes. My company, enStratus, has the same exact issue. Polling is a horrible solution to the problem, but necessary given the design of every cloud API in active use. In short, if we want to know when a server is terminated outside our systems, we must continually query the state of that server from the cloud provider.
While I can't speak to how RightScale handles things, I assume it's probably similar to what enStratus does. We ask the cloud provider if anything has changed with any resource in which we are interested. enStratus caches changes locally so that users are never talking directly to the cloud provider. The problem, however, is that we need to query very regularly (how regularly depends on the kind of resource and expected service level), but actual changes happen very rarely. The polling approach thus results in an incredibly inefficient use of CPU power both at enStratus and the cloud provider as well as wasted bandwidth on both ends.
We certainly do a number of optimizations to make sure we are polling as infrequently as we can get away with (we poll more often when there is a pending request for a change, less often for servers that are not production servers, etc.). The bottom line remains, however, that most of our calls are wasteful.
The real answer is an event-driven API through which the cloud provider notifies us of changes in resources we care about. With an event-driven API, we would stop polling the cloud provider. When a resource like a server has a state change, the cloud provider makes an API call to our service and tells us about the state change. Unfortunately, an event-driven API faces a number of challenges:
- An event-driven API demands a level of standardization that just doesn't exist in the cloud world today. You can't have every consumer designing its own callback API, and even supporting a cross-cloud system with every cloud provider defining its own callback protocol is problematic.
- You can't provide data via the callbacks because providing data requires reverse authentication and complicates the entire process.
- In the end, the consumer can't fully trust the callback API. It still needs to make some calls on its own to verify the cloud provider is actually working properly.
Here's an outline of how it might work:
- The consumer (for example, enStratus) would notify the cloud provider that we are interested in any state changes in a given account. This notification should include standard cloud provider authentication and specify a callback URL.
- For a finite period of time, the cloud provider notifies the consumer whenever there is a change in state in a specific resource or a new resource is added to a class of resources. This call is not authenticated and not trusted.
- The consumer then calls the cloud provider back via the normal authenticated API to verify the state with the cloud provider.
Let's take the scenario in which you need to know about a lost server within a minute of it disappearing. Under the request/response API model, you have to poll the cloud provider at least once a minute. A good cloud provider will provide a single API call to provide full details in every server. If you are dealing with a solid API call, that means you have one API call per minute with (n*b ) + o bytes being transferred (where n is the number of servers, b is the amount of data per server, and o is the data overhead of an API call).
One of the pet peeves Thorsten lists in his blog is an API that doesn't describe servers in full detail. For example, an API that provides basic information in the "listServers" call but then requires you to call a "getServerDetail" or some other service to get the full server details. In examining the issues with the request/response model, it becomes painfully clear why this is a pet peeve for Thorsten and myself. Validating server state goes from being 1 call per minute with n*b bytes transferred to (1 + (n*a)) API calls with (n*b) + ((1 + (n*a)) * o) bytes being transferred (where a is the number of API calls required to retrieve full server state.
If, for example, you have an API that requires you to listServers and then call getServerDetails on each server for 100 servers, you must make 101 API calls each minute. If you also have a getServerIps call, that number jumps to 201 calls each minute. There's a huge difference in scalability of an API that provides data in a single call versus one that requires multiple calls.
Let's contrast this model with an event-driven model. If a typical server undergoes 1 change per day and you are running 100 servers, you should see 100 callbacks per day. That means 100 calls to the listServers API per day versus 1440 and your client application would see the changes right away instead of experiencing up to a minute delay. How often your servers change depends on how you are using the cloud, but long-running and moderately long-running servers don't experience even 1 change per day.
The math uncovers a slight weakness in an event-driven versus a request/response model. If your infrastructure averages more than one change per minute (or whatever the period you care about), you are better of polling and capturing the results of multiple changes in a single poll call. Based on my experience, that scenario is extremely rare (to non-existent) with the asset class that changes the most, the server. It's completely non-existent for other cloud asset types.
I'd like to see two things evolve in the short-term:
- The introduction of a standardized call back format (should be very simple, something like [consumer-base]/[asset class]/[id])
- Integration of event callbacks into the existing cloud APIs.
I'm sure there's complexity I am not thinking of at this moment, but it's a start.