In the world of High Performance Computing (HPC), supercomputers represent the peak of capability, with performance measured in petaFLOPs (1015 operations per second). They play a key role in climate research, drug research, oil and gas exploration, cryptanalysis, and nuclear weapons development. But after decades of steady improvement, changes are coming as old technologies start to run into fundamental problems.
When you’re talking about supercomputers, a good place to start is the TOP500 list. Published twice a year, it ranks the world’s fastest machines based on their performance on the Linpack benchmark, which solves a dense system of linear equations using double precision (64 bit) arithmetic.
Looking down the list, you soon run into some numbers that boggle the mind. The Tianhe-2 (Milky Way-2), a system deployed at the National Supercomputer Center in Guangzho, China, is the number one system as of November 2015, a position it’s held since 2013. Running Linpack, it clocks in at 33.86 x 1015 floating point operations per second (33.86 PFLOPS).
Supercomputers achieve their performance through parallel use of off-the-shelf components.Tianhe-2 currently has 16,000 nodes, each with two Intel Xeon Ivy Bridge processors, and three Xeon Phi coprocessors for a combined total of 3.12 million computing cores. Not just reliant on Intel, it also features a number of Chinese-developed components, including the TH Express-2 interconnect network, the front-end processors, the OS (Kylin Linux), and software tools.
Most machines on the TOP500 list make extensive use of graphics processing units (GPUs), or coprocessors, to handle computationally-intensive tasks. Nvidia chips appear in 66 of the TOP500, with another 27 using Xeon Phi.
In the short term, everything looks good. In Q1 2016, Intel is scheduled to introduce Knights Landing, its next-generation Xeon Phi with up to 72 cores and three teraflops of performance, and Nvidia’s Pascal GPU will cram 17 billion transistors into a single device using a 16 nm FinFET process from TSMC.
Over the last few years though progress towards the next big benchmark, exascale computing (1018 floating point operations per second), has gradually slowed.
The reasons are varied, and the solutions will demand a rethinking of supercomputer architectures. In 2008, Peter Kogge from Notre Dame University, together with a team of computer scientists and engineers, put together an influential study for DARPA that identified several fundamental problems.
Power consumption is one: Tianhe-2 uses 17.8 MW running Linpack. That’s enough to power over 14,000 average US homes, plus another 6 MW for the cooling system. To scale up to the higher level using an expanded version of the same architecture would require around 540 MW, or 100 percent of the output of a high-efficiency gas-fired power plant, plus extra power for cooling. The power management problems inherent in scaling up performance even have their own term – the so-called “power wall.”
One option to reduce power consumption is to cut the operating voltage of transistors, which has been around one volt for the last ten years, to a few millivolts. Researchers are looking at several options, including FETs that make use of quantum tunnelling; switches based around MEMs technology; and nanophotonics, which substitutes light for electrons.
Three-dimensional memory, which allows structures to be placed closer to CPUs, will likely be part of the solution, too. In July 2015, Intel and Micron Technology announced 3D Xpoint, a non-volatile memory with a packing density claimed to be up to ten times greater than DRAM.
The Von Neumann Bottleneck
Other issues stem from the von Neumann architecture itself, which has been the dominant model for stored program computers since John Von Neumann first described it in 1945. The basic problem is an imbalance between the number of cycles needed to perform a mathematical operation and the much larger number of cycles needed to fetch the data for that operation from memory. Following Moore’s Law, shrinking transistor sizes has resulted in ever-increasing processor speeds, while the time needed to access memory has improved at a slower pace. The ratio can now be as much as 1:100, so a supercomputer is forced to spend the vast majority of its time moving data to and from memory.
This is spurring the adoption of data-tiering hardware and software that try to maximize CPU effectiveness by storing frequently accessed data in high-speed tiers such as solid-state drives, while storing other data in slower disk-based storage. Cray’s DataWarp and DDN’s Infinite Memory Engine are examples of data-tiering products.
There are several other lines of research aimed at eliminating the von Neumann bottleneck, including: Processing-in-Memory (PIM) technology; quantum computing; and even neuromorphic computing inspired by human brain structure.
The Changing Nature of Applications
The original applications for supercomputers involved solving systems of linear equations for fields such as nuclear weapons development, and performance on the Linpack benchmark has provided a close correlation to performance on real-world applications.
Over time, however, applications have diversified, and data-intensive supercomputer applications have become increasingly important. These applications have a low computation-to-data-access ratio, and Linpack isn’t as useful a metric.
Graph algorithms are a key part of many analytics workloads, leading to the development of the Graph 500 benchmark and the corresponding Graph500 list, which provides an interesting contrast to the TOP500.
There are other ways to rank supercomputers too. In an age of concern about global warming, the Green 500 list ranks the world’s most powerful machines in terms of floating point operations per watt; not surprisingly, it’s very different from the TOP500. Piz Daint, the #7 machine on the TOP500, barely breaks the Green 500 top 20 at #20, but the other top machines – Tianhe-2 (70), Titan (53), Sequoia (55), K computer (214), Mira (56) – are much further down the list. Conversely, the #1 Green 500 machine achieves over 7 Gflops/W, but its Linpack performance is only good enough for #136 on the TOP 500 list.
The Road Ahead
Although it might be possible to produce an exascale machine today with a brute force approach, a practical computer might be achievable by 2022 or even later. The US Department of Energy’s Exascale Initiative envisions a machine that consumes just 20 MW using one million 1-teraflop processors and 64 petabytes of memory.
Among the tasks identified will be the development of new programming paradigms to deal with massive data sets, and the detection of and recovery from soft errors. The resulting machine is likely to look quite different from the machines of the last twenty years.