Synthetic data’s pivotal role in the future of compute
The vast problem-solving potential of today’s HPC means that science and industry can develop technologies that have previously been impossible. HPC systems can perform quadrillions of calculations per second, and in concert with developments like quantum computing, this performance will grow many times over. Data sits at the heart of these calculations. But there are many instances where it’s impossible, impractical or illegal to use real data, stopping innovation in its tracks.
Synthetic data is a solution. Rather than being the data equivalent of ‘lorum ipsum’ placeholder text, synthetic data accurately reflects the real data to enable significant and valuable conclusions to be drawn. Using AI and modelling, you can synthesise whole swathes of data in a repeatable and realistic fashion. By synthesising input data from scratch or by creating new data from old, you can build, test and demonstrate a proof of concept at maximum velocity with total control and with your risks mitigated.
Synthetic data keeps the build moving
The future of compute looks exciting, and you want to make decisions fast. Maybe quantum optimisation is ideal for your goals, or your next killer app will depend on a graph neural net. But you need data to go deep in the build process – and different compute means different data to what you already might have. Synthetic data can bridge the gap.
Perhaps you’re leaning on the latest GPUs, using HPC to power ultra-high-resolution forecasts. Or perhaps you’re considering quantum computing, with its potential to solve optimisation problems that are intractable today. What if your input data isn’t high enough resolution, or you simply don’t have enough? How do you test your nascent solution? Or demonstrate it to the board?
Starting from scratch
If you only have old data – or none at all! – that shouldn’t stop you building new analytics solutions. If anything, it’s an opportunity. For many data sets, you can build a mathematical model that represents the fundamental behaviours you anticipate, like surges in power use at teatime or drivers diverting from a closed road. How rich a data set you create depends on the complexity of your use case: the metaverse is being used as synthetic data for training autonomous vehicles, for example. But even if your synthetic data is only a rough approximation of reality, it can still be hugely valuable for testing high-impact scenarios.
You can also use data synthesis to fill in the blanks. Sometimes you can’t measure everything, or can’t record things often, like in the core of a nuclear reactor. Informed by your best understanding of the whole, a model of the system underneath can help you extrapolate from what you know to what you don’t, using techniques like Bayesian inference powered by the latest HPC.
Mimicking data to mitigate risk
On the other hand, when you’re data-rich, synthetic data can be used to mitigate risk. What if you’re uneasy about using real data in a development sandbox, or when demonstrating to potential partners? To avoid using real data until it’s necessary, you could train a generative adversarial network to make new data from old.
It’s vital to consider privacy and regulation, even with synthetic data. If you use real data to train a synthesiser, signatures of that data could still be present in what you generate, especially if overfitting is a risk – so GDPR may apply, for example. Beware too of issues like copyright as regulation catches up with innovation, coming to the fore in today’s diffusion-based image generators. Your least risky approach in regulated domains could be to synthesise realistic data from scratch, with the added benefit that it’s now totally in your control
Where next?
As compute moves forward, we’re seeing a shift. Tomorrow isn’t about just making faster versions of the hardware we already use; it’s about harnessing different technologies like quantum computers, digital annealers and FPGAs. But building new high-performance solutions on new hardware needs new data. Synthetic data can help you better incorporate AI and other types of modelling into your business today while also allowing you to experiment with that technology of tomorrow. The synthetic world is your oyster.
Written by Dr. Francis Woodhouse, Technical director at the Smith Institute
Rory Daniels
Rory joined techUK in June 2023 after three years in the Civil Service on its Fast Stream leadership development programme.
Laura Foster
Laura is techUK’s Associate Director for Technology and Innovation.
Elis Thomas
Elis joined techUK in December 2023 as a Programme Manager for Tech and Innovation, focusing on AI, Semiconductors and Digital ID.