Software Defined Supercomputer
In a previous guest blog for techUK I discussed aspects of interoperability across the continuum of infrastructures that are, or will be, available for many organisations as they grapple with the complexities of data locality and interoperability across cloud-to-edge. Here, I discuss how modern DevOps methods apply to the building, maintenance and use of modern infrastructure as a flexible and agile resource from the supercomputer to, what scientists have called, “the bigger laptop”.
At StackHPC we help our customers get the most out of modern hardware infrastructure, both for their current computing needs and a rapidly moving transition to cloud access models and the convergence of HPC and AI. This entails not only encompassing the bare-metal infrastructure case (siloed supercomputer) where performance is paramount, but extends the performance envelope to virtualised, secure resources too.
We have helped customers use OpenStack to manage all their infrastructure, bringing unified operations across a diverse set of applications and infrastructure, without any loss in performance. These aspects introduce the concept of the Software Defined Supercomputer for incorporating operational and application workflows; something we, or at least I, call HPCStack2.0.
To contextualise, we discuss the Cambridge CSD3 service, a UK National Research Cloud and one of the UK’s most powerful academic supercomputers. It’s hosted at the University of Cambridge and funded by UKRI via STFC DiRAC, STFC IRIS, EPSRC, MRC and UKAEA.
Diverse Workloads
In scientific disciplines, large-scale computing was historically the preserve of the physical sciences. With the advent of disciplines like bioinformatics and data science, in later years it has become a central theme of life and social sciences too, although not always accessible to these cohorts.
Traditionally, organisations have a central HPC system with a job scheduler that fairly queues up requests for an allocation of computing infrastructure. In the traditional HPC simulation case, while the increase in hardware diversity can be managed via schedulers, not all workloads fit well within this scheme, particularly when interactive or persistent services are required and time to science is critical.
In addition, there are increasing requests from scientists working on more sensitive data, where they need higher levels of separation from other users on the system. This brings in the concept of multi-tenancy, where in the past a requirement to support such capability and sovereignty, as in the case of Cambridge supporting MRC Trusted Research Environment needs, would have required the lengthy procurement of new air-gapped hardware. By providing this software-defined approach significantly decreases the time-to-research outcome.
OpenStack. Really?
OpenStack is the most widely used cloud operating environment in Research Computing as well as by many commercial telecoms and private cloud suppliers too. Typically it’s used to support virtualisation, although it does offer a bare-metal service too which is increasing in popularity.
The first building block here was StackHPC’s work on the AlaSKA prototype for the Square Kilometre Array, demonstrating the use of OpenStack services and APIs to build HPC resources supporting the platforms required by the astronomy community, for container orchestration and batch scheduling.
We took this into production, working with CSD3. The Cambridge support team wanted to harmonise their system software stack across the then different infrastructures required for both on-premises cloud and separate HPC cluster under a common middleware layer. This required moving from the well-trodden path of xCAT, circa 00s software for Linux clusters, to using OpenStack Ironic to provision their new servers, resulting in being able to support different workloads: some bare-metal and some virtual. Cambridge now uses OpenStack to manage all their new hardware. There are still some things you can't do with bare-metal (snapshots for example), but anything that works on bare-metal generally can also be run in VMs.
The ability to codify infrastructure, now also means that, the full lifecycle of resources is managed from initial power-on to decommissioning, resources can move from one function to another with traceability, an important concern for many organisations, like Cambridge, that provide a level of data governance to classes of research.
Self-Service Platforms
What we have described so far, deals with predominantly infrastructure and demonstrates that with the move to a modern DevOps driven methodology using standard APIs, results not only in the ability to software-define a Supercomputer but also equally well accommodates more agile resources. Moving up the HPC Stack from IaaS, the benefit to users comes from the ability to abstract away from the hardware details, exploit the underlying APIs and regularise the Platforms that researchers will use.
We worked extensively with Cambridge and their customers to explore the Platforms that are of interest to them and package these via a cluster-as-a-service portal. This abstracts complexity while making use of specific hardware technologies if required - such as GPUs or specialised networking. This resulted in a series of “Science Platforms” that can be deployed on-demand, covering the gamut of execution environments required from the most simple (a Linux workstation), complex batch schedulers for rapid testing of a parallel workload through to Kubernetes environments, supporting JupyterHub front-ends.
Hybrid and Multi-Cloud
Once you are used to building with cloud, options for pooling resources between multiple clouds exist. In many cases researchers may want to move from the local infrastructure to using external resources such as public or community clouds. This may be for many reasons such as costly data movement, the local resource is full and bursting would be more efficient; or the researcher would like to test a novel piece of hardware or develop a new community service.
Such hybrid or multi-cloud environments can be very complex and require a degree of interoperability and federation of access. The portal helps here as we have seen it can abstract away the details of the underlying infrastructure and by persisting in the use of standard APIs, achieving interoperability across clouds can be eased. We are continuing to work with our partners to understand the opportunities here and hope to report on these soon.
Rory Daniels
Rory joined techUK in June 2023 after three years in the Civil Service on its Fast Stream leadership development programme.
Laura Foster
Laura is techUK’s Associate Director for Technology and Innovation.
Elis Thomas
Elis joined techUK in December 2023 as a Programme Manager for Tech and Innovation, focusing on AI, Semiconductors and Digital ID.