Can Clouds be Used as True HPC Resources?

The use of cloud computing has been growing steadily because purchasing computing as a service (like electricity) rather than as a product (like a generator) has many cost advantages. Made possible by operating system (OS) virtualization and the Internet, cloud computing allows almost any server environment to be replicated (and scaled) instantly. Many web service companies find cloud computing more economical than purchasing (or co-locating) hardware because they can pay for computing services only when needed.

The definition of a "computing cloud" can vary depending on the customer and vendor. The definition from Wikipedia is as follows: Cloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devices as a metered service over a network (typically the Internet).

The ability to rapidly construct and meter needed computing services is what makes the cloud model successful for both providers and customers. Grid computing made the same promise years ago, but had issues with the rapid delivery of services. Most grid systems offered a low-level library compatibility to end users rather than the machine-level compatibility of a cloud. Offering full OS virtualization ensured full compatibility for users and eliminated the library mismatch issues that often occurred in grid systems.

HPC in a Cloud

The advantages of cloud computing are certainly attractive to HPC users. Indeed, in many cases, users cannot find enough cycles on existing systems and cloud HPC would be a viable economic alternative to purchasing additional hardware. At first glance clouds would seem to be a welcome addition to the HPC toolbox, but on closer inspection traditional clouds do not (or cannot) offer many important aspects of HPC computing. To illustrate the lack of overlap, consider the following diagram. The only shared features are scalability and reliability. The other aspects are orthogonal in nature and represent serious a mismatch between the two approaches.

HPC Cloud Diagram
Figure One: The overlap between traditional cloud services and an HPC system.

A Deeper Look

A "traditional" cloud offers features that are attractive to web service organizations. Most of these services are single or loosely coupled instances (an instance of an OS running in a virtual environment). There are service level agreements (SLAs) that provide the end user with guaranteed levels of service. The features that are attractive to end users, as shown in the figure above, are as follows:

  • Instant Availability - Cloud offers almost instant availability of resources. The amount of "computing" can be quickly increased or decreased.
  • Large Capacity - Users can instantly scale the number of applications within the cloud. There is often no waiting for resources.
  • Software Choice - Users can design their "instances" to suit their needs. There are few software restrictions in the virtual environment.
  • Virtualized - Instances can be easily moved to and from similar clouds.
  • Service Level Performance - users are guaranteed a certain minimal level of performance.

Contrast these features with those that are attractive to most HPC users:

  • Close to the "Metal" - many man-years have been invested in optimizing HPC libraries and applications to work closely with the hardware, thus requiring specific OS drivers and hardware support.
  • Userspace Communication - In HPC, user applications need to bypass the OS kernel and communicate directly with remote user processes, but this feature is not supported in a cloud environment.
  • Tuned Hardware - HPC hardware is often selected based on communication, memory, and processor speed for a given application set.
  • Tuned Storage - HPC storage is often designed for a specific application set and user base.
  • Batch Scheduling - All HPC systems use a batch scheduler to share limited resources. User jobs must wait until resources become available.

One shared aspect between the two feature sets is resource scalability. That is, the ability of the user to quickly increase compute resources. Since most cloud applications are sequential single process jobs, scalability is easily accomplished by adding additional virtual machines. In the case of HPC, scalability is usually referred to as an application property that determines how many cores (processors) can be applied to the problem before performance levels off. It can also represent the number of users’ jobs that can run on an HPC cluster regardless of program scalability.

In essence, in both clouds and HPC clusters users can scale the amount of computing that they require. There is big difference, however, in how scalability is managed. In a cloud, additional resources are created by adding additional virtual machines (OS instances). In a cluster, the resource scheduler provides additional physical resources for a user’s application. Due to their shared nature, clouds often have more compute capacity than many clusters do.

Another shared aspect is redundancy though hardware independence. That is, the user, for the most part, does not care (or control) on which exact hardware their applications run. Thus, both clouds and clusters can schedule around broken or bad hardware.

Regardless of the similarities, the differences between the two feature sets are important to remember. Perhaps the biggest mismatch between the two is in the performance area. HPC applications strive to maximize performance on particular hardware. Clouds only guarantee "minimal" levels of performance in terms of compute and I/O capability. Thus, if your maximum requirements are near the cloud minimum, then cloud computing may be a solution. Otherwise, the performance you were expecting may not be possible or delivered on a consistent basis within a cloud.

In our next installment we will explore how some desirable cloud features can be delivered in an optimized HPC environment. In the meantime, you may want to check out an HPC cloud at