To exascale and beyond: challenges and opportunities for the most powerful computers ever created

Jeff Hammond
PhD in Chemistry (2009)
Principal Architect at NVIDIA
Abstract

Some time between now and 2023, the world's first exascale supercomputer will be deployed, with a mission to deliver scientific breakthroughs in everything ranging biochemistry to cosmology, as well as applied use cases in mechanical and nuclear engineering. While this is just another mark on the exponential growth in computing power over the past 50 years, exascale is different in that we are simultaneously reaching the limits of nanoscale engineering of semiconductors and the cost ceiling for power consumption of such systems. I will talk about the scientific breakthroughs enabled by really big computers and what programming methods are used to build the software behind these breakthroughs.
Outline

Supercomputers: what are they good for?

Battle of the exponentials

Programming models for next-generation HPC

Acknowledgements:

Peter Boyle (QCD)

David Hardy, John Stone, Julio Maia, Peng Wang (NAMD)

Sotiris Xantheas, Edo Apra (NWChem)

Content:

Salman Habib and the HACC team (HACC)

Jed Brown
Supercomputers: what are they good for?
From the smallest things...

Lattice QCD is the class of models used to simulate subatomic particles using Markov Chain Monte Carlo methods and Feynman path integrals.

New Calculation Refines Comparison of Matter with Antimatter

Theorists publish improved prediction for the tiny difference in kaon decays observed by experiments

September 17, 2020
...to the biggest things

Argonne scientists (Salman Habib and coworkers) are simulating the cosmos using all the biggest HPC systems by computing the interactions between trillions of particles.

https://cacm.acm.org/magazines/2017/1/211098-hacc/fulltext
NAMD Simulating SARS-CoV-2 on Frontera and Summit

Collaboration with Amaro Lab at UCSD, images rendered by VMD
Winner of Gordon Bell Special Prize at SC20, project involved overall 1.13 Zettaflops of NAMD simulation

(A) Virion, (B) Spike, (C) Glycan shield conformations

Scaling performance:
- ~305M atom virion
- ~8.5M atom spike

https://doi.org/10.1101/2020.11.19.390187

https://gtc21.event.nvidia.com/media/Molecular%20Dynamics%20Simulations%20on%20GPU-Dense%20Architectures%20with%20NAMD%20%5B31529%5D/1_znsuv1wc
Scaling of the SPEC CCSD(T) Library on the Full Partition of the KNL Nodes of the Cori Supercomputer at NERSC

Scientific Achievement
Calculation of the binding energy of the coronene dimer, an archetypal system for graphene

Significance and Impact
Ability to obtain accurate interaction energies of large systems; largest to date CCSD(T) calculation (9.14 PFLOPs) used 538,650 Knight’s Landing (KNL) cores (9,450 nodes; 57/68 cores per node)

Research Details
- 216 electrons / 1,776 basis functions (cc-pVTZ basis set)
- OpenMP for multi-threading in CCSD and CCSD(T)
- Checkpoint restart capability in CCSD
- Improved inter-node parallelization of the (T) correction on the KNL nodes using the Global Arrays (xGA) tool

<table>
<thead>
<tr>
<th># of KNL nodes/cores</th>
<th>(T) kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Wall time (sec)</td>
</tr>
<tr>
<td>7,624 / 434,568</td>
<td>10,553</td>
</tr>
<tr>
<td>8,644 / 492,708</td>
<td>9,357</td>
</tr>
<tr>
<td>9,450 / 538,650 (97.5% of full partition)</td>
<td>8,344</td>
</tr>
</tbody>
</table>

Team: Aprà, Hammond (Intel), Daily, Palmer, Xantheas
Support from Intel’s Parallel Computing Centers (IPCC)

Work was performed at Pacific Northwest National Laboratory under a NERSC Initiative for Scientific Exploration (NISE) and BES allocation awards
What are the features of a supercomputer?

1. Lots and lots of components *working together*
   a. Production computing jobs often use 20-80% of the system for a single simulation
   b. Thousands of nodes with many cores (50+) and/or multiple (4+) GPUs per node
   c. Many terabytes or even petabytes of memory and storage
   d. Virtual all-to-all connectivity of processors, memory and storage.

2. Specialized components
   a. High-bandwidth memory: HBM is faster than DRAM, better than GDDR
   b. High-bandwidth interconnects (between nodes): 1 us latency and 10+ GB/s per link
   c. High-bandwidth interconnects (within node): >2x of PCIe BW with much lower latency and better support for HPC software
   d. More reliable components - individual component failures multiplied by system scale
DATA-COMPUTE DEMAND GROWING FASTER THAN SYSTEM BANDWIDTH

GPU Starved by CPU Memory and PCIE Bandwidth

<table>
<thead>
<tr>
<th>Device</th>
<th>Bandwidth (GB/sec)</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU</td>
<td>8,000</td>
</tr>
<tr>
<td>CPU</td>
<td>200</td>
</tr>
<tr>
<td>PCIE Gen 4</td>
<td>16</td>
</tr>
<tr>
<td>Mem-to-GPU</td>
<td>64</td>
</tr>
</tbody>
</table>

https://youtu.be/eAn_oIzwUXA
A NEW COMPUTING ARCHITECTURE FOR AI AND DATA SCIENCE

30X Increase System Memory to GPU

<table>
<thead>
<tr>
<th></th>
<th></th>
<th>GB/sec</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPU</td>
<td>8,000</td>
<td></td>
</tr>
<tr>
<td>CPU</td>
<td>500</td>
<td></td>
</tr>
<tr>
<td>NVLINK</td>
<td>500</td>
<td></td>
</tr>
<tr>
<td>Mem-to-GPU</td>
<td>2,000</td>
<td>30X</td>
</tr>
</tbody>
</table>

https://youtu.be/eAn_oIZwUXA
How do we program supercomputers?

1. Find lots of computation that can be done concurrently (at the same time)
2. Figure out the input and output dependencies of those computations
3. Map the compute and data-dependency graphs to well-known patterns
Domain decomposition pattern

Lots of physics and engineering problems can be parallelized using domain decomposition, where a grid of points/cells is divided up like a checkerboard.

The groups of points need to exchange data at their boundaries (halos).
Task parallelism

Task parallelism involves finding a number of tasks that work on their own data and assigning them to different processing units.

Tasks often produce data that is consumed by other tasks, or combined to produce a final result, which creates dependencies, and thus partial orderings, between tasks.
Mixed parallelism

The most successful parallel codes combine all available forms of parallelism, and use the best known strategies for each.

The bookkeeping associated with many forms of parallelism is challenging for programmers, hence the use of specialized systems like Charm++.
It’s parallelism all the way down...

Loop 5  for $j_c = 0 : n-1$ steps of $n_c$
\[
J_c = j_c : j_c + n_c - 1
\]
Loop 4  for $p_c = 0 : k-1$ steps of $k_c$
\[
\mathcal{P}_c = p_c : p_c + k_c - 1
\]
\[
B(\mathcal{P}_c, J_c) \rightarrow B_p
\]
Loop 3  for $i_c = 0 : m-1$ steps of $m_c$
\[
\mathcal{I}_c = i_c : i_c + m_c - 1
\]
\[
A(\mathcal{I}_c, \mathcal{P}_c) \rightarrow \tilde{A}_i
\]
// macro-kernel
Loop 2  for $j_r = 0 : n_c-1$ steps of $n_r$
\[
J_r = j_r : j_r + n_r - 1
\]
Loop 1  for $i_r = 0 : m_c-1$ steps of $m_r$
\[
I_r = i_r : i_r + m_r - 1
\]
//micro-kernel
Loop 0  for $p_r = 0 : p_c-1$ steps of 1
\[
C_c(I_r, J_r) := \alpha \tilde{A}_i(I_r, p_r) \tilde{B}_p(p_r, J_r)
\]
endfor
endfor
endfor
endfor
endfor
Battle of the exponentials
Top500 #1 performance (Rmax) is exponential...
...but is fighting against another exponential: power
Computing will be limited by power

Widely cited forecasts suggest that the total electricity demand of information and communications technology (ICT) will accelerate in the 2020s, and that data centres will take a larger slice.

- Networks (wireless and wired)
- Production of ICT
- Consumer devices (televisions, computers, mobile phones)
- Data centres

Amazon Approaches 1 Gigawatt of Cloud Capacity in Virginia

YOU ARE HERE: HOME / CLOUD / AMAZON APPROACHES 1 GIGAWATT OF CLOUD CAPACITY IN VIRGINIA

BY RICH MILLER - JANUARY 18, 2017 — LEAVE A COMMENT
China suffers worst power blackouts in a decade, on post-coronavirus export boom, coal supply shortage

- Provinces across China are struggling with blackouts, as authorities use restrictions to curb energy use and manage supply
- Analysts blame the resurgence of manufacturing, a coal shortage and China’s central economic planning for the problem

If power supply is fixed, computers compete with homes and factories.

Increasing power supply has environmental consequences.

*Power-intensive computing means the value of computing results will be judged more critically than in the past.*

Moore’s Law: The number of transistors on microchips doubles every two years

Moore’s law describes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. This advancement is important for other aspects of technological progress in computing – such as processing speed or the price of computers.

Data source: Wikipedia (wikipedia.org/wiki/Transistor_count)

OurWorldInData.org – Research and data to make progress against the world’s largest problems.

Licensed under CC-BY by the authors Hannah Ritchie and Max Roser.
Moore’s Law (exponential scaling) isn’t a sure thing...

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>MANUFACTURING</td>
<td>45nm</td>
<td>32nm</td>
<td>22nm</td>
<td>14nm</td>
<td>10nm</td>
<td>7nm</td>
<td>2017</td>
</tr>
<tr>
<td>DEVELOPMENT</td>
<td></td>
<td></td>
<td></td>
<td>2014</td>
<td>2015</td>
<td>2017</td>
<td>2020</td>
</tr>
<tr>
<td>RESEARCH</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Not to scale

- Carbon Nanotube: ~1nm diameter
- QW III-V Device
- Graphene: 1 atom thick
- Nanowire: 10 atoms across
Moore’s Law has already faltered...

Intel’s 7nm is Broken, Company Announces Delay Until 2022, 2023

By Paul Alcorn  July 23, 2020

From bad to worse

Manufacturing costs are also growing exponentially (Rock’s Law or Moore’s Second Law)

<table>
<thead>
<tr>
<th>ENVISIONED OUTLAY</th>
<th>WHEN/WHAT</th>
</tr>
</thead>
<tbody>
<tr>
<td>TSMC</td>
<td>$100 billion Over three years to expand capacity</td>
</tr>
<tr>
<td>Intel</td>
<td>$20 billion To build two new fabs in Arizona</td>
</tr>
<tr>
<td>Samsung</td>
<td>$116 billion Over a decade to expand foundry business</td>
</tr>
</tbody>
</table>

Summary

- No more frequency scaling: ~all performance comes from parallelism
- Power efficiency growth is not keeping up with compute demand (Dennard)
- The manufacturing exponential means that more transistors cost more money

Compute demand outpaces power efficiency growth and transistor manufacturing, so we computing usage will be prioritized by who can afford the power bill and the transistors.

When we talk about “democratizing computing”, we don’t mean the Citizens United form of democracy...
More Moore is not necessarily better

“Bandwidth is money, latency is physics”

HPC applications are hitting the latency wall, particularly in critical scientific domains like weather and climate modeling.

Programming models for next-generation HPC
HPC vs the Internet

Orders of magnitude difference in latency sensitivity differentiates software:

- Internet computation consumed by humans on the scale of **milliseconds**.
- Internet computation consumed by computers on the scale of seconds?
- HPC computation consumed by computers on the scale of **microseconds**.
- HPC computation consumed by humans on the scale of minutes.

To achieve microsecond latencies, HPC hard-wires the network routing, which changes the reliability model significantly. Internet workloads are highly resilient, whereas most HPC codes crash as soon as the hardware exposes a single fault (server hardware hides correctable hardware errors).
Distributed computing

Cloud

https://en.wikipedia.org/wiki/Internet_protocol_suite

Standard since the 1970s, new features added at the top of the stack.

HPC

MPI: Standard since the 1990s, still changing.

New features added at the bottom of the stack (expose more HW/perf).
Computing within the node

CPU programming is relatively consistent since the 1970s:

Fortran 77, 90, 95, 2003, 2008, 2018, ... 2100

C++ 98, 03, 11, 14, 17, 20, 23, 26, 29, ...

New languages percolate in from outside of HPC, e.g. Python

Few successes born within the HPC community: Matlab and Julia
Computing within the node

GPU computing for HPC began is relatively new:

2002 contorting physics onto graphics programming
2003 Brook (C with streams)
2006 CUDA introduced by NVIDIA
2011 OpenACC 1.0 (directives for GPUs)
2013 OpenMP 4.0 (more directives for GPUs)
2020 ISO language standard parallelism for GPUs

Normalizing GPU computing

GPU computing has become progressively easier since 2002 but easier GPU computing is not sufficient. We need parallel computing to be GPU computing.

- ISO Fortran 2008 standard parallelism runs on GPUs now
- ISO C++17 standard parallelism runs on GPUs now
- Python supports GPUs with standard tools like Numba

These developments require GPU hardware features like independent forward progress of threads and unified memory. NVIDIA Volta is the first GPU that can run the same software as CPUs (assuming it uses ISO standard features).

https://youtu.be/KhZvrF_w1ak
https://youtu.be/75LcDvlEjYw
NWChem TCE CCSD(T) Kernel
Computational Chemistry with Fortran Standard Parallelism

- NWChem provides a massively parallel implementation of the "gold standard" CCSD(T) method that scales to hundreds of thousands of CPU cores.
- The compute bottleneck is a set of 27 loop-driven tensor contractions, which are part of the >100k LOC TCE module.

https://github.com/jeffhammond/nwchem-tce-triples-kernels
The End