Timing Trick Cuts Energy Used in LLM Training by Up to 14 Percent
Researchers at the University of Twente have shown that by adjusting GPU clock frequencies at the per-kernel level, they can save up to 14% of the energy used in LLM training with minimal impact on speed.
--> Raven.config('https://[email protected]/147999').install(); Tweaking GPU Clock Frequency Cuts LLM Training Energy - IEEE Spectrum
Sign InJoin IEEE
Timing Trick Cuts Energy Used in LLM Training by Up to 14 Percent
Share
FOR THE TECHNOLOGY INSIDER
Enjoy more free content and benefits by creating an account
Saving articles to read later requires an IEEE Spectrum account
The Institute content is only available for members
Downloading full PDF issues is exclusive for IEEE Members
Downloading this e-book is exclusive for IEEE Members
Access to Spectrum 's Digital Edition is exclusive for IEEE Members
Following topics is a feature exclusive for IEEE Members
Adding your response to an article requires an IEEE Spectrum account
Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .
Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →
Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more about IEEE →
Close
Access Thousands of Articles — Completely Free
Create an account and get exclusive content and features: Save articles, download collections, and post comments — all free! For full access and benefits,
subscribe
to Spectrum.
CREATE AN ACCOUNTSIGN IN
Timing Trick Cuts Energy Used in LLM Training by Up to 14 Percent
Dina Genkina
10 Jun 2026
3 min read
OpenAI’s fourth large language model (LLM), GPT-4, took an estimated 50 gigawatt-hours to train, or the equivalent of 5,000 American homes’ yearly power consumption. That was in 2023. Since then, the computational resources used to train frontier LLMs have only increased, though direct power usage numbers are hard to come by.
Now, a research group at the University of Twente in the Netherlands has shown that you can save up to 14 percent of the energy used in LLM training without sacrificing speed by cleverly adjusting the clock frequency of the GPU during computation. Jeffrey Spaan, Ph.D. candidate at University of Twente and lead author on the article, presented the results at the Computing Frontiers conference in Catania, Sicily, last month.
“My research is about finding computing waste,” Spaan says. “It’s similar to underutilization of the hardware, but instead of optimizing the software for the hardware, we try to optimize the hardware for the software.”
Making the GPU tick
Spaan and his collaborators accomplished this by using a technique known as dynamic voltage and frequency scaling (DVFS). Every chip—including the GPUs commonly used for training frontier models—uses at least one clock to orchestrate computations. Each operation in the chip is triggered by a clock pulse. The frequency with which that clock ticks controls how fast the chip operates and how much power it draws.
Modern GPUs have two clocks, one for the computational core and one for the memory. When the core is hard at work crunching numbers, the clock frequency is kept high to ensure speedy calculation. However, with DVFS, the memory clock can slow down in that time, allowing for less power draw. In principle, it’s possible to just turn off the memory part of the chip, but GPUs designs don’t enable software control for that off switch, and it would take too long to turn back on mid-calculation anyway. Similarly, when the core is waiting for data to be loaded from memory, the core clocking frequency can be slowed to a crawl while the memory clock frequency ramps up.
DVFS has been a well-known technique that goes back to at least the 1990s. But Spaan says other researchers haven’t been able to usefully apply it to LLM training because their methods either slowed down calculations too much or were not fine-grained enough to improve energy usage.
Previous DVFS attempts adjusted the frequency at each iteration of the training process. In LLM training, each iteration consists of two parts: the forward pass, in which data is run forward through the layers of the model with the weights as they are; and backpropagation, in which the weights are adjusted layer by layer based on the results of the forward pass. So prior work kept one value of the frequency for the forward pass and adjusted to another for backpropagation.
Spaan and coworkers tuned the clock frequencies on a shorter timescale. GPU workloads are broken down into tiny computational nuggets known as kernels. For example, a single vector-vector multiplication can make up a single kernel. The kernels are fed to the GPU to be processed many times in parallel. In Spaan’s implementation, the computation of a single layer of a deep neural network is broken up into approximately 40 kernels. By adjusting the clocking frequencies on a per-kernel level, the team was able to find much greater energy savings.
The GPU also does DVFS automatically when the chip’s internal systems detect higher or lower demand, Spaan notes. “Some people might therefore think: We’ll just let the GPU handle it,” he says. “However, because the GPU doesn’t have the foresight we have of what kernels will run, it has to work with an on-the-fly best-effort guess and can therefore never attain the same savings.” That’s where the manual adjustments come in.
Less energy, same time
The team performed their experiment by training GPT-3-XL, a 1.3 billion parameter model, on an Nvidia RTX 3080 Ti GPU. To save time, they focused on training a single layer of the model. In this setting, they found a set of frequency adjustments that gave them 14 percent energy savings while slowing the training time by only 0.6 percent. Performance of the model depends on both computing speed and energy usage.
There is one challenge: Ramping down the clock frequency is much faster than turning a core off and on, but it’s still not instantaneous. In their experiment, the researchers evaluated one kernel at a time, not taking into account the frequency switching speed. So 14 percent energy savings is a best-case scenario. How much of an issue it would be in practice, Spaan says, depends heavily on the GPU being used. Newer hardware, like the Blackwell GPUs, have much faster switching speeds than older versions and should be able to harness the full energy savings.
Now, the team is developing a tool that would be able to implement optimal frequency scaling automatically for a particular workload. Spaan hopes their method will be attractive enough to industry leaders to merit adoption. “We optimize for saving energy without losing performance,” Spaan says. “In the real world, performance is the holy grail.”
From Your Site Articles
Better Hardware Could Turn Zeros into AI Heroes ›
Generative AI’s Energy Problem Today Is Foundational ›
Related Articles Around the Web
The Energy Footprint of Humans and Large Language Models – Communications of the ACM ›
Energy consumption when training LLMs in 2022 (in MWh) ›
Dina Genkina
Why Orbital Data Centers Are Harder Than Silicon Valley Thinks
13h
10 min read
Beyond Dexterity: Why Contact May Define the Next Era of Robotics
09 Jun 2026
6 min read
Fusion Startup’s Commercial Reactor Design Gets a Big Boost
04 Jun 2026
4 min read
Generative AI's Energy Needs Are Reshaping Our World
Can AI Chatbots Reason Like Doctors?
AI Is Starting to Build Better AI