top of page

Here are some key facts and technical details about the supercomputers that power AI — including the type used to train models like ChatGPT, GPT-4, and other large-scale AI systems:

Tanya Hill @treetownhomes
Jul 29
2 min read

1. They’re Not Traditional CPUs — They Run on GPUs.

ree

GPUs (Graphics Processing Units) are optimized for massive parallel computations, which are critical for training neural networks.
The most popular GPUs for AI workloads are:
- NVIDIA A100 (used in GPT-3 and GPT-4 training)
- NVIDIA H100 (next-gen, faster and more efficient)
- Google’s TPU v4 (custom-built for AI, used internally at Google)
One NVIDIA A100 can cost over $10,000 and consumes around 400W of power.

2. They Are Housed in Massive Data Centers

These are not "one machine" but networks of thousands of GPUs in racks.
Example: Microsoft’s AI supercomputer (built for OpenAI) includes:
- 10,000+ NVIDIA A100 GPUs
- 285,000 CPU cores
- 400 Gbps InfiniBand networking
These are connected with ultra-fast fiber-optic networks for low-latency communication.

3. They Consume Huge Amounts of Power

A large AI training run (like GPT-4) can consume millions of kilowatt-hours of electricity.
Data centers often co-locate with renewable energy sources or hydro-powered plants.

4. The Infrastructure Includes:

High-bandwidth memory (HBM) — faster than standard DRAM, allowing GPUs to access massive datasets quickly.
NVLink/NVSwitch — allows GPUs to share memory and compute tasks more efficiently than PCIe.
InfiniBand Networking — supports ultra-low latency and high throughput between GPUs across clusters.
Liquid cooling systems — required due to heat generated by densely packed GPUs.

5. Training at Scale = Petaflops to Exaflops

Petaflops = 1 quadrillion floating point operations per second
Exaflops = 1 quintillion operations per second
GPT-3 training reached over 3640 petaflop/s-days — meaning it sustained that speed for multiple days.
Frontier (Oak Ridge National Lab) is one of the first exascale supercomputers (over 1 exaflop), and it is used for both science and AI research.

6. Cloud-Based Supercomputers

Amazon (AWS), Google Cloud, Microsoft Azure, and Oracle all offer GPU-based AI supercomputing instances.
These are scalable, allowing companies to "rent" thousands of GPUs for training large models.

7. Some Key AI Supercomputers in the World

Name	Location	Notable Use
Microsoft Azure AI	U.S. (for OpenAI)	GPT-3, GPT-4
Meta’s Research SuperCluster (RSC)	U.S.	AI model training & metaverse research
Frontier (ORNL)	Tennessee, USA	World’s fastest, >1 exaflop
Cerebras Wafer-Scale Engine 2	U.S.	World’s largest chip (850,000 cores)
Google TPU v4 Pods	U.S.	Gemini and BERT model training

Fun Fact

Training a model like GPT-4 required an estimated 25,000 to 40,000 GPUs running for weeks, with costs possibly exceeding $100 million just for the training run.

Which of these supercomputer facts surprised you the most—and what do you think the future of AI computing should look like? Share your thoughts in the comments!

Recent Posts

How is the construction of a super computer benefiting the workforce

How is the construction of a super computer benefiting the workforce

🌱 Greenwashing in Real Estate: Are “Eco Homes” Just a Marketing Gimmick?

🌱 Greenwashing in Real Estate: Are “Eco Homes” Just a Marketing Gimmick?

Comments

bottom of page