AI News HubLIVE
In-site rewrite7 min read

Embracing empiricism – from the lottery hypothesis to creating real-world impact: an interview with Jonathan Frankle

Jonathan Frankle discusses the lottery ticket hypothesis, for which he won the 2023 AAAI/ACM Doctoral Dissertation Award. He covers empiricism vs. theoretical proofs, the shift in computer science methodology, the pressure on young researchers to deliver impact, and his current focus on evaluating how well AI systems work in practice.

SourceAIhubAuthor: Ella Scallan

In this crosspost from AI Matters – a publication of the ACM SIGAI – Ella Scallan sat down with Jonathan Frankle to discuss the lottery ticket hypothesis, for which he was awarded the 2023 AAAI/ACM Doctoral Dissertation Award. In this wide-ranging conversation, Jonathan delves into empiricism vs theoretical proofs, how the approach to computer science has changed (even if the fundamental problems haven’t), how younger researchers are rapidly adapting to a world that values impact above all else, and what it means to be a researcher. Read on for an insightful and thought-provoking discussion.

You were awarded the 2023 AAAI/ACM Doctoral Dissertation Award. What was the topic of your dissertation research, and why was this an interesting area of study to you?

It was on this topic that I called the lottery ticket hypothesis, which I first developed back in 2018. The goal was to understand how deep neural networks learn, why they learn, and what they learn. Despite all the attention around neural networks in the past five to ten years, it’s still a very mysterious process and I don’t think we’ve gotten any closer to a clear answer. In fact, any answers we’ve achieved have been out of date by the time we’ve achieved them, because things keep changing – the systems keep becoming bigger and more complex.

One question I asked was: how big do these systems have to be to learn? Generally, we make the models bigger and they learn better – but is this necessary? There’s this strange phenomenon that’s been observed in the literature for decades – and was especially popular a few years ago – where after you’ve trained a deep neural network, you can actually delete a lot of the parameters and the models perform just as well.

This is weird because it makes you ask the question – did the model have to be that big in the first place? But if you were to take that smaller neural network with all those connections deleted and try to train it from the beginning, it doesn’t learn very well. So, why does it seem like whatever the neural network learns by the end can be relatively small, but the learning process requires it to be big? Intuitively, that could make sense. For instance, it’s harder to wrap your head around a complex topic than it is to have a nice synthesized understanding of that topic.

I found this strange answer that applied to what were meaningfully sized neural networks then. Specifically, when you create a neural network and you create all these connections, you set each connection to a random value at the beginning, just by sampling from some distribution. These random values have some properties in the aggregate, but the specific values seemed unimportant. However, it turns out that after you delete the connections in the final neural network, the specific values of the remaining connections actually matter a lot. If you train a smaller network using only those connections with those specific values, it can learn. If you train a smaller network using new values sampled from the same random distribution as the original network, it doesn’t learn. This is very strange. So you could actually train those smaller neural networks from the beginning or near the beginning of training, so long as you set those parameters properly.

To some extent, my thesis was trying to tackle this question of how big a neural network needs to be in order to learn, when what you learn seems smaller than what is needed. Then I found this weird property that maybe you can learn with something smaller, and that there is something special about the random values. This is where the lottery metaphor comes from. It’s a small insight into how neural networks might learn, which continues to elude us. But any insight is important.

What was the lasting impact of your work?

The main finding was the observation that maybe there isn’t a difference between the amount of capacity you need to learn and the amount of capacity you need to know something in a neural network. And that perhaps there is some method to the madness in how these things learn. From a technical perspective, it was the insight that the sub-network you get after training and pruning could have been trained from the beginning, if only you knew the right parameters.

The theme from my work that has stood the test of time is the focus on empiricism. This sounds silly in 2026 – like of course we do empirical work to understand AI. What else would you do? In 2018, it was actually pretty strange and controversial to not have some nice elegant theorem to tell us what to do. People were hoping for some grand unified theory, because I think we do have a bias towards that in computer science, as well as towards properties that guarantee certain behaviours.

It was quite uncomfortable when I talked about this work at that stage, as I had a hypothesis about how learning happened in deep neural networks, but I couldn’t show mathematically that this had to be true. That was actually a big challenge. Some people really didn’t like the work as a result, including one of the reviewers of the paper. When I was at the faculty market, someone asked me this question which is burned in my mind: “You’ve done this empirical work. Do you plan to do anything principled?” I knew what they meant: principled in computer science is code for formal mathematics. They wanted to know if I was going to prove my work, or if it was ‘just’ empirical. But I resented the subtext that empiricism is unprincipled.

I think principled empiricism is a very valid way of getting knowledge about the world. In any science other than computer science, that is our entire way of getting knowledge about the world. That was a somewhat controversial point of view in computer science and AI at the time. It feels very silly looking back, because now everything is empirical and we have no other approach – the sheer scale and complexity of the systems is such that it defies any mathematical framework we have. And real data defies mathematical description – if you’re training a model on all of human language, how do you mathematically describe the internet and all the data on the internet? In 2017 and 2018, there was a large community of us who did believe in the empirical approach, but it was such a weird minority view that I don’t think was very welcome in traditional computer science circles.

My dissertation was so unapologetically empirical among a lot of other work that was unapologetically empirical at that time. In some sense it became a bit of a symbol of that – people saw this well-known paper that had shown a really interesting piece of scientific insight, purely empirically. I think that was part of a trend of folks who helped to change the way we view doing science and AI. To me, that’s like the lasting impact of the work, much more than any specific insight on the size of neural networks.

It sounds like there’s been a real mindset shift in AI. So, in 2017/2018 you were finding the parameter values via brute force. Has anyone come up with an alternative methodology since then?

I haven’t seen anything. People still propose methods with reasonable frequency, although when I see papers on this topic, I generally tell people to find something more relevant to work on. It was an interesting paper for 2018, but it’s 2026 now. The nature of what science is worth doing has changed a lot, and I don’t think this is a problem worth working on today.

In some sense, the original method in the paper has stood the test of time because it is so simple. It works like this: take a big neural network, save the values of every weight that was assigned at the very beginning, train that network to the end, delete whatever parts end up being unimportant at the end of training, and then just imagine you had known that at the beginning of time and go back. You can’t find the smaller network unless you’ve trained the bigger one, hence why this method is reasonably impractical for any real applications. And the best way to do this is actually to do this incrementally, by getting rid of a few parameters at a time, and iterating this process a dozen or a couple of dozen times. You can see how this becomes expensive fast and your advisor wonders what in the world you’re doing. So that is the gold standard method. There are lots of other methods in the literature, some of which are more efficient, but with lower quality.

Thanks for clarifying. I can see how that negates the potential benefit of having a faster network.

At the time, the paper raised the hope that you might be able to find this smaller network more efficiently, now that we know it exists. But it has been hard to find. It’s also that the nature of the smaller network that you get is one that’s relatively hard to take advantage of on contemporary hardware. Because our hardware is designed to do these big blocks of computation where you have a matrix that is completely full of numbers. And if you’re missing a number here and a number there, it’s hard to take advantage of that.

As you said – that was 2018, this is 2026. What are you focusing on now? What does your research look like these days?

For me, the most fundamental problem in AI right now is not how smart it is or how efficient it is, but how we know if it’s working. I think we lack good science on that topic at the moment. The nice thing about this problem is it’s one that doesn’t require GPUs, a supercomputer, or tens of millions of dollars. So for my team at Databricks, there’s no excuse that we don’t have the money that OpenAI has or Google has. The only thing you need is humans and your own ingenuity and creativity. It’s a really lovely problem from that perspective.

I almost feel like I’m an HCI researcher these days because I have one unfair advantage that I think a lot of my colleagues in academia don’t have. I have real users, I have customers, and I can just go talk to them and ask why they’re using AI and why they’re not using AI and where they’re stuck. In the academic world we rely on these benchmarks. There’s things like SWEbench, Humanity’s Last Exam, or Math Olympiad problems – things that have even made it into the popular press as measures of AI progress. Those are all supposed to be proxies for how valuable this technology is in the real world. My measure is to go into the real world and ask the users directly.

Using these insights, I do a lot of fundamental research on what’s necessary to make this work better in those real settings. So I’m doing quite a bit of reinforcement learning these days, which turns out to work reasonably well for solving these problems. I spend a lot of time trying to figure out how to build better benchmarks.

It sounds like a key problem to solve as AI systems become more embedded into people’s lives.

Yes, and it’s very hard for computer scientists to grasp, because now we’re not just going from science in the theoretical and formal world to science in the empirical world – we’re getting to the question of what makes systems good for humans. And so that is another conceptual leap for computer scientists that may take some time to sink in. We’re moving closer to the boundary where we leave mathematics behind and get closer to the real world. Each of those leaps tends to take time for computer science as a scientific community to get around.

It’s great to hear that you’re working on this aspect of AI, because I think it’s been neglected. So much of the focus is on building bigger, on the Silicon Valley ethos of ‘move fast and break things’, without really accounting for what makes human life better. It’s an important question, when the level of technology we have already far outstrips how much we can use it.

I would frame it even a little bit differently. If you are a capitalist and all you want is to make the most money, I don’t think the bottleneck to that is how smart the models are. I think AI is smart

[truncated for AI cost control]