Massive Energy for Massive GPUs Empowering AI
Massive GPUs for AI model training and deployment require significant energy. As AI scales, optimizing energy efficiency will be crucial
OpenAI founder Sam Altman has made ambitious calculations, suggesting a potential investment scale of $7 trillion in GPUs for an AI future. This number, rejected by industry leaders like Nvidia’s founder Jensen Huang, implies a monumental acquisition of GPUs, requiring enormous energy, almost on a galactic scale. To put this in perspective, Nvidia’s current market worth is around $3 trillion, below half of Altman’s proposed investment. When compared to the GDPs of the United States (approximately $26.8 trillion) and China (around $17.8 trillion), this $7 trillion investment is still indeed staggering.
Despite this, the AI era is still in its infancy, and achieving such a scale might necessitate even more advanced computational structures. This brings us to a critical underlying question: how much energy will be needed to power computational units and data centers?
Let’s take a look at some simple and direct numbers from three perspectives,
1. Energy consumption per computational unit
2. Energy costs of training/operating modern models
3. Energy supply and demand
Energy consumption per computational unit
From a user perspective, some video game enthusiasts have built their own PCs equipped with high-performance GPUs like the NVIDIA GeForce RTX 4090. Interestingly, this GPU is also capable of handling small-scale deep-learning tasks. The RTX 4090 requires a power supply of 450 W, with a recommended total power supply of 850 W (in most cases you don’t need that and will not run under full load). If your task runs continuously for a week, that translates to 0.85 kW × 24 hours × 7 days = 142.8 kWh per week. In California, PG&E charges as high as 50 cents per kWh for residential customers, meaning you would spend around $70 per week on electricity. Additionally, you’ll need a CPU and other components to work alongside your GPU, which will further increase the electricity consumption. This means the overall electricity cost can be even higher.
Now, your AI business is going to accelerate. According to the manufacturer, an H100 Tensor Core GPU has a maximum thermal design power (TDP) of around 700 Watts, depending on the specific version. This is the energy required to cool the GPU under a full working load. A reliable power supply unit for this high-performance deep-learning tool is typically around 1600W. If you use the NVIDIA DGX platform for your deep-learning tasks, a single DGX H100 system, equipped with 8 H100 GPUs, consumes approximately 10.2 kW. For even greater performance, an NVIDIA DGX SuperPOD can include anywhere from 24 to 128 NVIDIA DGX nodes. With 64 nodes, the system could conservatively consume about 652.8 kW. While your startup might aspire to purchase this millions-dollar equipment, the costs for both the cluster and the necessary facilities would be substantial. In most cases, it makes more sense to rent GPU clusters from cloud computation providers. Focusing on energy costs, commercial and industrial users typically benefit from lower electricity rates. If your average cost is around 20 cents per kWh, operating 64 DGX nodes at 652.8 kW for 24 hours a day, 7 days a week would result in 109.7 MWh per week. This could cost you approximately $21,934 per week.
According to rough estimations, a typical family in California would spend around 150 kWh per week on electricity. Interestingly, this is roughly the same cost you’d incur if you were to run a model training task at home using a high-performance GPU like the RTX 4090.
From this table, we may observe that operating a SuperPOD with 64 nodes could consume as much energy in a week as a small community.
Energy costs of training/operating AI models
Training AI models
Now, let’s dive into some numbers related to modern AI models. OpenAI has never disclosed the exact number of GPUs used to train ChatGPT, but a rough estimate suggests it could involve thousands of GPUs running continuously for several weeks to months, depending on the release date of each ChatGPT model. The energy consumption for such a task would easily be on the megawatt scale, leading to costs in the thousands scale of MWh.
Recently, Meta released LLaMA 3.1, described as their “most capable model to date.” According to Meta, this is their largest model yet, trained on over 16,000 H100 GPUs — the first LLaMA model trained at this scale.
Let’s break down the numbers: LLaMA 2 was released in July 2023, so it’s reasonable to assume that LLaMA 3 took at least a year to train. While it’s unlikely that all GPUs were running 24/7, we can estimate energy consumption with a 50% utilization rate:
1.6 kW × 16,000 GPUs × 24 hours/day × 365 days/year × 50% ≈ 112,128 MWh
At an estimated cost of $0.20 per kWh, this translates to around $22.4 million in energy costs. This figure only accounts for the GPUs, excluding additional energy consumption related to data storage, networking, and other infrastructure.
Training modern large language models (LLMs) requires power consumption on a megawatt scale and represents a million-dollar investment. This is why modern AI development often excludes smaller players.
Operating AI models
Running AI models also incurs significant energy costs, as each inquiry and response requires computational power. Although the energy cost per interaction is small compared to training the model, the cumulative impact can be substantial, especially if your AI business achieves large-scale success with billions of users interacting with your advanced LLM daily. Many insightful articles discuss this issue, including comparisons of energy costs among companies operating ChatBots. The conclusion is that, since each query could cost from 0.002 to 0.004 kWh, currently, popular companies would spend hundreds to thousands of MWh per year. And this number is still increasing.
Imagine for a moment that one billion people use a ChatBot frequently, averaging around 100 queries per day. The energy cost for this usage can be estimated as follows:
0.002 kWh × 100 queries/day × 1e9 people × 365 days/year ≈ 7.3e7 MWh/year
This would require an 8000 MW power supply and could result in an energy cost of approximately $14.6 billion annually, assuming an electricity rate of $0.20 per kWh.
Energy supply and demand
The largest power plant in the U.S. is the Grand Coulee Dam in Washington State, with a capacity of 6,809 MW. The largest solar farm in the U.S. is Solar Star in California, which has a capacity of 579 MW. In this context, no single power plant is capable of supplying all the electricity required for a large-scale AI service. This becomes evident when considering the annual electricity generation statistics provided by EIA (Energy Information Administration),
The 73 billion kWh calculated above would account for approximately 1.8% of the total electricity generated annually in the US. However, it’s reasonable to believe that this figure could be much higher. According to some media reports, when considering all energy consumption related to AI and data processing, the impact could be around 4% of the total U.S. electricity generation.
However, this is the current energy usage.
Today, Chatbots primarily generate text-based responses, but they are increasingly capable of producing two-dimensional images, “three-dimensional” videos, and other forms of media. The next generation of AI will extend far beyond simple Chatbots, which may provide high-resolution images for spherical screens (e.g. for Las Vegas Sphere), 3D modeling, and interactive robots capable of performing complex tasks and executing deep logistical. As a result, the energy demands for both model training and deployment are expected to increase dramatically, far exceeding current levels. Whether our existing power infrastructure can support such advancements remains an open question.
On the sustainability front, the carbon emissions from industries with high energy demands are significant. One approach to mitigating this impact involves using renewable energy sources to power energy-intensive facilities, such as data centers and computational hubs. A notable example is the collaboration between Fervo Energy and Google, where geothermal power is being used to supply energy to a data center. However, the scale of these initiatives remains relatively small compared to the overall energy needs anticipated in the upcoming AI era. There is still much work to be done to address the challenges of sustainability in this context.
Please correct any numbers if you find them unreasonable.
Massive Energy for Massive GPU Empowering AI was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Massive Energy for Massive GPU Empowering AI
Go Here to Read this Fast! Massive Energy for Massive GPU Empowering AI