One of the major sticking points of clouds, besides security, has been reliability. Along with reliability comes availability. And with availability comes service levels. You can obviously see where this is leading…
For some time there has been a heated debate on these topics, primarily centered on the viability of hosting mission critical applications in the cloud. Just like virtualization, cloud computing and cloud services adoption will move slowly through the IT organization. Test/Dev first, then some non-critical services, then a few “near-mission-critical” applications. Eventually we will get there, but it will take a little time.
Coming from an outsourcing world I am very accustomed to the concept of Service Level Agreements (SLAs). I’ve spent many an hour wading through contracts, fine tuning the language in order to maximize value for the client and protection for our company. And sometimes those words paid off for both parties. But for the majority of the time they simply sat in a file cabinet somewhere, waiting to be called into service.
Historically, SLAs focused around a finite set of infrastructure which was well-defined and could be reasonably understood by all parties involved. You could count the servers, firewalls, load balancers, spindles, and you could measure traffic. If something “broke” it was pretty easy to detect. Server A belonged to Client B and it broke so Client B got a refund. But even in that well-defined, “all the i’s dotted and t’s crossed” world, there was still lots of wiggle room. Short of a power-outage or hardware failure it was hard to determine who caused the problem. And as we know, hardware has historically been more reliable than software, so the bulk of the SLA reviews centered around whose software was at fault in the outage. Throw in the fact that most clients wanted (demanded) root access to the OS, and you can see that the lines of demarcation quickly blurred.
Fast forward a few years and here we are in the era of cloud computing. The relationship between Server A and Client B is no longer so clear. And Server A is more than likely not a physical server, but a virtual machine instead. And it may be one of several virtual machines residing on a single server. And it may be running on a different physical server today as opposed to yesterday. On top of that, it may move to another physical server today because of capacity or performance policies. So as you can see, the lines of demarcation become even more blurred.
Granted, we will still have Infrastructure as a Service (IaaS) models where I simply want the cloud provider to provide physical infrastructure and I will architecturally carve it up to my liking. But as we move beyond IaaS, the question comes to bear – do SLAs really make sense? Let’s stop and think about it for a minute. What are the characteristics and value propositions touted by cloud vendors:
- The cloud is “transparent” – the customer doesn’t need to worry about the details
- The cloud is “scalable” – it can grow on demand to respond to the customer’s needs
- The cloud is “elastic” – it can shrink on demand to respond to the customer’s needs
- The cloud is “homogeneous” – it allows workloads to move freely throughout the cloud
- The cloud is “ubiquitous” – the customer can place the application closer to the user if desired
There are few other descriptors I could throw into the mix (e.g., polymorphic) that could apply to this discussion, but I think you get the picture. The cloud is by default supposed to respond to internal and external factors/stimuli and self-adapt. Unfortunately, we aren’t there quite yet – but we are getting close(r). The cloud management space is exploding and the methods and tools needed to support the characteristics and value propositions listed above are hitting the market at a very rapid clip. When we cross over to a self-aware, self-managing cloud model the whole question of SLAs may become a moot point. If the cloud can respond to the demands of the customer, and it can detect and self-heal hardware and software failures, and it can detect performance anomalies, and it can respond to changes in traffic and transaction patterns and volumes the need to measure what’s “not working” becomes a unnecessary and burdensome task. What the customer will expect is that the “end user experience”, whatever that may be is always reliable, consistent and measurable. The customer will see the manifestation of the end user experience in the variation of their monthly bill – much like our utility bills go up and down based on our personal end user experience.
In order to meet that goal, cloud providers will have to assume the risk of hardware and software reliability and will have to bear the burden of infrastructure scalability/elasticity. And that’s where the rub begins (and continues from prior experience in this space). Unless current hardware/software acquisition and licensing models change this will be a hard pill to swallow for many cloud providers. Years and years of capacity analysis, management and forecasting have gone into the purchasing models and strategies of most companies. We are now going to have to throw many of those models away and start with new, uncomfortable in some cases, approaches. The “cloud in a box” does not satisfy the requirements of a true cloud environment unless the box makers (Oracle, HP, IBM, Dell, et al) are willing to make their price/payment model truly elastic as well.
SLAs are not going away anytime soon, and the race is on to see who can be most creative. Some are pretty sophisticated (GoGrid), some are pretty simple (Joyent) and others are non-existent. What is interesting about these SLAs is that the burden of proof lies heavily with the customer (in the GoGrid case the customer must open a trouble ticket while the problem is occurring…). As the cloud becomes more transparent (or maybe opaque is a better word) it’s going to become harder and harder to pinpoint where a problem is occurring. So why don’t we just abandon the old model and work on tools to make the cloud what it’s touted to be… If only it were that easy!