While we may not be able to agree on how “the cloud” will ultimately manifest itself (Public, Private, Hybrid, ???), hopefully we can agree that it’s here to stay. Despite the countless white papers, seminars, roundtables and panel discussions, we still can’t bring the pendulum to equilibrium and agree on whether it’s an old technology with a new veneer, or something brand new and innovative that will forever change the business and IT landscape. In the end, it really doesn’t matter. Whether the technology is old or new is really immaterial – what’s important about the cloud is that it has changed our mindset and behavior with respect to how IT services are defined, acquired and delivered. Services being the operative word in that last sentence.

I’ve never been a proponent of the “cloud in a box” model. At least not using traditional architectural models and technologies. Clouds need to be elastic in nature, and even though virtualization allows us to make big physical things logically very granular, I just don’t feel the economics work for big boxes with big stacks. I believe upstarts like Nebula,  Nicira and Nutanix will change that model and give us cloud building blocks (based on OpenStack and OpenFlow and similar architectures/models) that will give us the scale-out, virtual granularity and elasticity that we need to build truly flexible clouds. But the jury is still out on that front.

But enough about infrastructure and stacks. Let’s get back to the point of this post – services. Much like virtualization has abstracted the hardware layer and made it somewhat irrelevant, I believe cloud-based services will do the same for the application layer. While core, proprietary business logic will always be critically important, I think the future model of constructing applications will be centered primarily around the assembly of cloud-based services. This is fundamentally no different than composite application models – we’ve been building widgets and applets and reusable modules for quite some time now. And messaging systems, enterprise buses, object models, brokers and APIs are common place. But as we move towards an architectural and operational model where these things now live in “clouds”, I believe there’s another level of complexity that gets added to the process of building services-based applications.

I’m actually moving away from this picture of a stack. I think this picture of a stack where you have infrastructure at the bottom, platforms on top of that and all the services on top of it is actually just a model. It is really outdated. In essence, most of the whole Amazon cloud is much more than the Amazon services. It is a whole collection of all services from all ecosystem providers as well. So there’s many different players. Many in the ISV world, many coming out of the system integrator world and management world that are adding so many different services into the Amazon ecosystem that the cloud is much more, and can no longer be define by just having strict layers. Many applications will be pulling different services from different places and connecting them together.

Werner Vogels, Amazon CTO – GigaOM Structure 2011

So if I’m building applications in the future services-based cloud model, what are the things I need to consider? While there are lots of moving parts which makes the answer pretty tricky, I think there are four basic areas that need to be carefully considered.

Services brokering

Publish/subscribe, object request brokers, web services descriptors, service oriented architectures – the things that make up the nuts and bolts of distributed applications have been around for some time now and have enabled pretty sophisticated services-based applications. But for cloud-based services which can be consumed “at will” we need to move beyond the technical aspects of the application model and focus on how services are enabled. There needs to be some sort of clearinghouse to manage the process of locating and acquiring reliable services. Why is this important? First and foremost is sustainability. If I’m committing my enterprise to run its business on loosely-bound application services then I need to ensure that those services aren’t going to disappear without (sufficient) warning. Second, and really as equally important, is service level assurance. Now that I’m combining multiple services to build a composite application, how do I ensure that the performance of any given service doesn’t degrade overall system stability, reliability or performance. This is really nothing new in the sense that any aspect of compute, network or storage can affect performance, but now I’m abstracting those elements even one more level and putting my trust in a service-component provider to ensure that things perform as expected.

There’s been quite a bit of discussion about a “spot market” for cloud services where enterprises can bid for unused capacity. A very interesting concept which is getting more attention from research firms such as Forrester who recently published a report on this topic. Their model is quite comprehensive and covers a broad spectrum of cloud acquisition and management services – many of which are leaps of faith in today’s cloud market. For the development of services-based applications the process will be less dynamic and can be implemented using existing models. Take for example Equinix’s International Business Exchange (IBX) and Ecosystem model. Equinix provides a mechanism by which customers can leverage the services of other customers in the same cloud ecosystem. In other words, Equinix provides a brokering service between customers. While the Equinix model has historically be focused on infrastructure (compute, connectivity, etc.), there’s nothing to preclude it from being a cloud-services brokerage. Amazon, via Werner Vogels comments at this year’s GigOM Structure 2011 seems to be clearly headed toward this space to act not only as the provider of basic cloud-enabling services, but also as a facilitator of the services that make up business applications. Could there possibly be a “closer” relationship between Amazon and Equinix on the horizon?

Integration and interoperability

Once you’ve located and vetted the building blocks for your services-based application, getting them to work together is the next and possibly bigger challenge. Given the richness and maturity of this segment of the application development landscape, one might question that position. But the multiple options from which we have to choose, while providing flexibility can also be the bane of application development. Unless you are very lucky, the collection of services you select to build your application ecosystem is unlikely to use a common integration approach. So this leaves you with the problem of learning, developing and supporting multiple interfaces. Again, this is not a new problem, but where it gets sticky is that cloud services are still evolving and common practices like documentation and release management are lacking or non-existent in some situations. On top of that, depending on the nature of your application, you may be dependent on a service that ultimately receives no direct revenue from external application integration but reaps downstream or indirect benefits. Social networking/media services such as Facebook, Twitter, Google and LinkedIn are examples of this model. The risk with direct integration with these types of services is that since they receive no direct revenue from third-party connections they have little incentive to ensure the integrity of their interfaces. Anybody that has ever expended effort to use the Facebook API can relate to this problem.

To mitigate this risk application developers can pursue multiple strategies. One is to be “all in”, selecting a fairly self-contained ecosystem to provide a number of services through a single interface. A nice approach, but doubtful that many applications will hook their wagon to a one-horse team. At the other end of the spectrum is the “alphabet soup” approach, selecting one or more generally available/accepted integration protocols such as SOAP, REST, XBRL, XPDL, etc., and code to the specifics of the service with which you are interfacing. This approach requires more development effort and maintenance, but ensures a higher degree of stability in that these protocols are unlikely to suffer from variability over time. Between these ends of the spectrum lie solutions like traditional enterprise buses (new wave:  Queplix, Boomi, Vordel, IBM/Cast Iron, MuleSoft, SnapLogic, Jitterbit or old school: WebSphere MQ, BizTalk Server, Sterling Commerce B2B, Tibco ESB, JBOSS ESB, CORDYS/iWay), API services (Apigee, Mashery, Layer7 Technologies, Apache/Delta Cloud, Programmable Web), or direct application APIs (Netflix, Facebook, PayPal, Google, Kayak, Amazon, etc.). Regardless of the approach you select, ensuring that you have a clear understanding of the service levels you can expect from the third-party service with which you are integrating is absolutely critical.

Performance

Maintaining performance of distributed applications has always been a challenge, especially in dealing with latency and time-sensitive transactions. There’s fundamentally nothing different between legacy distributed application models and cloud-based services models. It’s simply a request/response system in which the two parties (applications) agree (handshake) on how to communicate. Very little of this handshake has anything to do with agreement on what levels of performance are expected. The difference with legacy distributed systems is that the enterprise usually had a fairly well-defined relationship with the third party (service provider) and had made an investment  in dedicated networking capability to ensure an expected level of performance. With the advent of Werner Vogel’s services-rich ecosystem the likelihood of building those dedicated connections is diminished. Obviously you can take the additional steps to ensure performance by utilizing VPNs and even going the extra mile and installing direct connections. But this somewhat defeats the purpose of an elastic cloud model where you want the ability to move/expand/contract workloads or add/change/delete services as needed. As we move forward through the cloud life cycle we will continue to see closer relationships between infrastructure and cloud providers to mitigate these performance issues (e.g., Amazon and Equinix’s recently announced Direct Connect service).

The other key aspect of performance that has received lots of press lately is service assurance and recovery. Both Amazon and Microsoft have been at the center of this discussion and through the process the industry has learned that recovery mechanisms for cloud services are still in their infancy. While we like to think that we have well-documented “run books”, mechanisms such as Chef to create templates, metadata and scripts to automate recovery processes, and tools such as Chaos Monkey to put our designs through all sort of failure stress tests, recovering a highly distributed system made up of lots of independent services – which in turn have their own recovery processes and procedures – is not easy. During the latest Amazon outage some customers were offline for almost a week.

The problem that we face in today’s cloud computing model is that we design for high availability as opposed to designing for failure. Recovering workloads in the traditional legacy model focused creating and reinstating workloads on duplicate infrastructure instances. To minimize exposure and risk to cloud service interruptions enterprises must identify and plan for the initiation of alternative services in the event of a major outage. These services may be within the same ecosystem provider, but as we have seen in the two latest Amazon outages, this can be risky. The ideal solution is to have a completely alternative approach to accessing mission critical services that are outside the control of the enterprise. While this may be an expensive proposition, it’s a critical aspect of ensuring service availability – as we have learned from Netflix and others.

Management and monitoring

Up to this point the focus of monitoring and management tools in the cloud computing space has been centered around virtual machines, application performance and chargeback. Most of the tools that address  application performance management (APM) such as NewRelic, AppDynamics, BlueStripe and Coradiant (now part of BMC) either focus on internal application performance, providing detailed analysis of Java or Ruby code performance, or they provide an end-user experience view of application from a usability perspective. Little progress has been made in the area of inter-application (including third-party services) management, examining the relationship and performance of the various services that contribute to the application ecosystem. For the purposes of this discussion, traditional I/O and database services, which have fairly robust management and monitoring tools, are not included as third-party services.

The next generation of APM tools will need to provide a broader context, including the discovery and relationship mapping of all the elements that comprise the application as a whole. Only then can the application “owner” have a complete view of the entire application environment. Tools such as Adaptivity Blueprint-4IT and Nimsoft provide an upfront lifecycle view of the application environment and automated discovery of infrastructure elements, but don’t provide the application-specific correlation and mapping of service elements. For this purpose enterprises will need to consider discovery-based tools that specialize in automatically identifying and cataloging the service elements that make up an infrastructure environment and then provide a mechanism to designate (or validate) the relationships between service elements (e.g., LAYERZngn/CPlane). This will then enable the next step of assigning performance characteristics to these relationships (via metadata models) which can then be monitored/measured. In the meantime enterprises are still faced with the “swivel chair” approach to APM.

In summary…

The cloud computing model hold great potential for delivering significant economic benefit to enterprises of all sizes. However, it is not without risk and is by no means “automatic”. Designing for the cloud – whether it’s private, hybrid or public – requires a new focus on the concept of services. It also requires a new mindset that embraces the notion that some or all of the mission critical services that make up applications may be outside of the direct control of the enterprise. While these services offer flexibility, agility and the promise of lower cost, they need to be viewed for what they are – services provided by others who may or may not hold the same view of performance and service assurance. Therefore it is critical that enterprises apply a few extra factors to the application design process to ensure that mission critical business functions don’t fall victim to those differences in views.


Disclosure: I am a business and technical advisor to LAYERZngn. I have no other affiliations with any of the companies mentioned in this post. Inclusion of companies and products in this post is for the purpose of providing examples of various services and in no way indicates an endorsement of one product over another.