HPC: The importance of being open
A cursory glance at the agenda of any major computing industry conference immediately confirms the central role played by large-scale computing in supporting research and innovation. Almost every field of research and almost every industry now relies on various combinations of simulations, data analytics, AI and machine learning. Over the coming decades, delivering strategic priorities such as Net Zero will only be possible through the use of cutting-edge high performance computing systems at exascale and beyond.
The DiRAC facility (www.dirac.ac.uk) provides HPC services for the theoretical communities in the Science & Technology Facilities Council (STFC) remit, covering astrophysics, particle physics, nuclear physics, cosmology and planetary science. DiRAC has hardware deployments at the Universities of Cambridge, Durham, Edinburgh and Leicester and a Project Office at University College London.
Not all computer hardware is equal, and the better one understands the workflows which need to be supported, the more a system can be optimised to support those particular needs. For the DiRAC research communities, HPC systems are scientific instruments: as important for the delivery of their research as telescopes are for astronomers and demanding a systematic approach for their definition and design. Starting from a peer-reviewed science case, we identify a set of representative workflows which can be used to define the high-level specifications of our future services.
An extended co-design activity with industry partners follows, ensuring that the deployed systems are optimally tailored to delivering our science programme. Over the past decade, this workflow-based co-design approach has delivered major UK HPC successes, most recently the deployments of the DiRAC-3 Phase 1 services in 2021.
But simply buying the right hardware is not sufficient for a productive HPC service. Skilled support teams are needed to manage the systems and the demand for such people greatly exceeds supply, not just in the UK but also internationally.
We also need to have the right software to exploit the full power of the deployed hardware. Even in a technically savvy community, most users do not have the highly specialised skills needed to optimise codes for the latest HPC architectures. This drives the need for teams of research software engineers to support the development of cutting-edge software.
Finally, users need to be able to access the services easily. This last point is relatively straightforward to address for a research community such as DiRAC in which the users are happy to access our services via command line interfaces, since any user with access to a terminal can access DiRAC resources. However, insisting on a command-line-based access model is an immediate barrier for some research communities and industries.
The above high-level description of what is involved in the deployment of HPC services is sufficiently daunting to ensure that many potential users, both from industry and other areas of academia, are put off from even considering the adoption of large-scale computing, even where they can see a clear benefit. As a result, opportunities for innovation and increased productivity and competitiveness are being missed.
Commercial cloud providers now provide ready access to large-scale compute and have started the process of “democratising” access to large-scale computing. We now need to be more proactive in opening up access to national HPC resources if we are to realise the full potential of future UK large-scale services.
First, we can remove one barrier to access by presenting national services, such as DiRAC, as private clouds. In so doing, we can help communities who just want “a bigger laptop” to access computing resources at scale and drive their research forward. DiRAC has been working with the University of Cambridge, the Square Kilometre Array, the STFC IRIS consortium and StackHPC Ltd. to develop the tools necessary to deliver this vision. The benefits of cloud computing are well known and recent work has demonstrated that it is also a route to providing flexible, shared services which are secure, easier to access and have sovereign data hosting where this is a requirement.
But it’s not enough to just provide a cloud front-end – we need to ensure that the stakeholder communities have the skills needed to make effective and efficient use of large-scale computing. Investment in people is the “missing link” in many discussions of computing infrastructure, and this needs to be addressed as soon as possible.
DiRAC runs an extensive training programme for new users (with little or no previous experience) through to advanced users. Working with partners including the STFC SciML group, the Software Sustainability Institute and many of our HPC industry partners we train more than 100 early career researchers per year, many of whom later take their HPC skills into the wider UK economy. Clearly, there needs to be more investment in programmes such as this to train future generations of HPC users.
Finally, it is important to note that very few companies or research communities can support the full cost of an HPC service themselves. However, where common requirements exist, the sharing of some or all of these resources can be a cost-effective way to deliver the required outcomes. We need to adopt a co-design process for future UK systems which covers all stakeholders, including potential industry users. By doing this, we can ensure that the UK continues to be a major player on the international stage.
The DCMS-led “Future of Compute” review is currently exploring many of these issues and we look forward to the publication of its findings, which will form a solid basis for future, long-term planning for large-scale computing in the UK.
Why should we go through the process of opening up our HPC services to other communities? For me, a recent example illustrates the surprising opportunities that can result from such engagement. Following two DiRAC-supported innovation placements with the Guys and St Thomas NHS Trust, a machine-learning based study of data quality in the NHS diagnostic database was recently accepted for publication in the British Medical Journal. Led by two astronomers from the DiRAC community and with a total of four DiRAC community members in the author list (Hardy, Heyl et al. 2022; doi: 10.1136/bmjhci-2022-100633), this important work will have an influence on NHS policy, with the potential to lead to improved clinical outcomes for patients.
Opening access to large-scale computing has never been more important…or achievable.
Rory Daniels
Rory joined techUK in June 2023 after three years in the Civil Service on its Fast Stream leadership development programme.
Laura Foster
Laura is techUK’s Associate Director for Technology and Innovation.
Elis Thomas
Elis joined techUK in December 2023 as a Programme Manager for Tech and Innovation, focusing on AI, Semiconductors and Digital ID.