"Project Zeus" -- How to Build a Cloud

Technical excellence, code ninjitsu, and pizza-fueled all-nighters. This is the very picture of what it must mean to build something awesome in software. Our collective fevered imaginations run to 4am Halo battles mixed with marathon code sessions, cheap beer with amazing algorithms, bean-bag-chair-naps with the very latest in technologies and platforms. When deadlines are close, and when the technology challenges are large, it means another marathon game of Wizards and Martyrs.

Clearly, I'm wrong. Maybe some products succeed when built using that model, but the reality is very different. The most important features of a successful development effort are: simplicity, pragmatism, flexibility, constraints, the willingness to work with instead of against the landscape, and most importantly, trust. This is the story of "Project Zeus", and how Contegix and Relevance worked together to build something cool.

The Problem

Contegix is moving into the cloud. They are already the best in the biz for managed hosting with their attention to customer service and technical competence. Now they are applying that excellence to cloud services. With 100% network and power uptime SLAs coupled with their legendary service, the platform should be a real home run.

Contegix chose to build on top of VMWare's ESX, vSphere and vCenter technologies. They needed a development partner to build the customer layer, where individual users could control all this kick-ass technology and bend it to their will. The system needs to be fast, stable, usable, and done. Perhaps most of all, it has to live peacefully with the ecosystem of apps Contegix already uses to help their customers.

The Solution

Relevance places extremely high value in the twin properties of simplicity and elegance. By simplicity, we mean the code should focus on the core values of the proposed system and leave out extraneous, minimally valuable fluff. The smaller the codebase, the better. By elegance, we mean that the code itself is easy to understand, readable and extendable by other developers. It should tell its own story, without needing one of us to stand over the monitor drawing air diagrams of complex interactions.

When the team began working on the project, it became quickly obvious that at its heart, "Project Zeus" is a messaging system. There is a web-based user experience, but the grand majority of the work being done is asynchronous communication between the user and the VMWare infrastructure assembled by Contegix. Which means our solution needed to focus on:

  • a lightweight web application for direct communication with the user
  • a messaging infrastructure for passing commands and results between the user and vCenter
  • a robust engine for manipulating vCenter to perform the user requested tasks which needed to be:
    • fault tolerant (we are dealing with cloud resources, here)
    • easily monitored for responsiveness, error rates, etc.
    • easily extendable as new features are designed and developed

Contegix has an existing investment in Java-based infrastructure and has built a reputation for hosting and managing Java solutions over a long history. They also have a keen interest in any technology that gets the job done. And it turns out that simplicity and elegance are at the top of their priority list, too.

The Messaging Infrastructure

This was easy. We chose ActiveMQ: an open source, proven, Java-based message queue. We love and trust open source, and the Java-based solutions are robust, well-tested and fit in well with the existing Contegix management infrastructure. It's also compatible with the popular Stomp protocol, so by adapting our system to Stomp we have the flexibility to swap out different queues if the need arises.

The Back End

We decided on JRuby for the back end implementation. vCenter exposes all of its functionality via a SOAP interface. There are two good wrappers for it already: a command-line Perl wrapper, and a Java proxy. For purposes of monitoring and process management, we decided that the Java proxy was the best way to go. Having made that decision, we knew our solution would have to run on the JVM. With JRuby, we could easily interact with the VIJava proxy for vCenter and still take advantage of the productivity and low-ceremony that Ruby provides on the back-end. Couple that with standard Java monitoring tools and we have an ideal solution for our asynchronous processing engine.

The Web Application

This also turned out to be an easy choice. There are no direct connections between the web interface and the backend; everything travels through the message queue. For scalability purposes, the web application and backend application will exist on separate servers. We had no limitations on what we could choose as our technology platform. Therefore, we went with the most productive, simplest platform we can think of: Rails 2.3.2 via Passenger.

The only special need is the inclusion of a Ruby Stomp client for interacting with the Java-based message queue. For this piece of the stack, we decided to use RosettaQueue as our messaging gateway library - this wraps the Stomp client, provides a clean interface for message consumers and publishers, and also provides clients for AMQP and beanstalkd if the need arises. We rolled RosettaQueue into the back-end as well, to keep things simple and consistent between the front-end and back-end wherever our code has to interact with the messaging system.

Everything else is just a normal Rails application, hosted via Passenger and the MRI. Since Contegix has a long history hosting Rails applications for their customers, they were perfectly happy with this solution for the web application.

The Result

So far, the results have been remarkably pleasing. We went from initial architectural decision to working demonstration in an extremely short period of time. Contegix is remarkably pleased with both the velocity of the feature development and the overall simplicity of the system. They can easily see how to manage the system once it is fully deployed, and how we can come back and extend it as user feedback is collected. Contegix is well on their way to a late summer roll-out with a feature-set and experience that is equal or better than any of its competitors.