Challenges and innovation in the age of Exascale
High-performance computing (HPC) simulations are providing unparalleled insights into new scientific discoveries and are essential tools for industrial product design. HPC technology development over the last decades has been fuelled by the scientific and engineering communities’ unquenchable thirst for more and more computing power. Exascale has been in everyone’s mind ever since the first Petaflop system was deployed in 2008. The target was clear: 1 exaflops within a 20-megawatt (MW) power envelop by 2020. As of today, a first Exascale system has been installed in the USA and more coming around the globe while post-Exascale supercomputers are already planned. The focus is clearly on meaningful application performance (Exascale = exaflops delivered to HPC applications) and this still entails multiple challenges [1] well beyond raw hardware performance (exaflops).
The HPC applications have evolved to deliver more performance through unprecedented levels of parallelism, but also with new techniques. Noticeably, (Big) Data Analysis was introduced to refine computer models through the mining of real-life physical observations. More recently, Artificial Intelligence (AI) frameworks made possible the use of surrogate models, thus drastically accelerating a significant range of HPC applications, and considerably improving the quality of the simulations.
The diversity challenge
Concurrently to the evolution of HPC applications software, HPC hardware architecture has also changed significantly. Ten years ago, the HPC ecosystem looked quite uniform, with most supercomputers based on x86 CPUs. By contrast, today’s supercomputer architectures are quite diverse. HPC systems are now commonly composed of several partitions, each featuring different types of computing/processing nodes. On the CPU side, different instruction set architectures (ISAs) are used besides traditional x86, particularly ARM and possibly in the future RISC-V. GPUs have been so far the HPC Accelerators of choice, now with multiple providers. In addition to GPUs, other accelerators are proposed such as FPGAs, or specific AI processing units (IPUS, TPUs …). This unprecedented wave of innovations in processors technology [2] presents for developers the opportunity to boost HPC applications performance, whilst at the same time tackling the challenge of such heterogeneous environments.
Energy efficiency at Exascale
Even though each new generation of computing elements is delivering more performance per Watt thanks to new architecture and advances in electronic manufacturing, the overall consumption of Exascale systems is nevertheless reaching costly levels. Exascale HPC datacenters are now commonly configured to provide 20 MW of electrical power, or more. With the rising electricity cost, each MW now exceeds $1 million per year. Hence, over a period of 5 years, the electricity bill for a 20 MW system will sum up to $100 million. Taking these considerations in mind, the supercomputer and datacenter utilities, and most importantly the cooling system, must be carefully optimized. Additionally, GPUs and CPUs consumption has been growing steadily, soon exceeding 500 W or even reaching 1000 W. With such power concentration, the heat dissipation requirements far exceed the capacity of classical air-cooled servers, but liquid cooling has proven to be the perfect practical solution for such requirements at Exascale.
Improving on previous generations, the newly introduced BullSequana XH3000 platform by Atos greatly expands the power supply and cooling capacities for each rack. As a result, a higher inlet temperature is admissible, and the datacenter free-cooling range is further extended. The use of chilled water, a strong requirement for classical air-cooled servers, is not necessary with DLC, thus allowing for a Power Usage Effectiveness (PUE) as low as 1.05 for most datacenters, all year around. On average, DLC reduces by 40% the HPC datacenters global electricity bill.
The future of Exascale is hybrid: the role of quantum and AI
Atos see the importance of the future coupling of HPC and Quantum computing. Within the framework of the EuroHPC HPCQS project, a first prototype will allow researchers to explore these possibilities. The Atos QLM (Quantum Learning Machine) software environment will ensure a smooth integration of the Quantum computing with the HPC platform.
AI, which is now used to accelerate HPC applications, also plays an important role in energy optimization, resource scheduling, data management, performance optimization and system preventive maintenance. These management software tools complement the hardware technology to improve the global energy efficiency of the system. Overall, AI-augmented solutions are the backbone of making advanced computing faster and more efficient, from hardware to software.
We believe that the future of supercomputing is in hybrid platforms, combining cutting-edge processing units to deliver unparalleled computing power at any scale. These solutions will prove crucial in tackling the emerging engineering and scientific challenges of the years and decades to come.
Rory Daniels
Rory joined techUK in June 2023 after three years in the Civil Service on its Fast Stream leadership development programme.
Laura Foster
Laura is techUK’s Associate Director for Technology and Innovation.
Elis Thomas
Elis joined techUK in December 2023 as a Programme Manager for Tech and Innovation, focusing on AI, Semiconductors and Digital ID.