Data Is Unreasonably Effective, and There Is Plenty of It to Explore
BLOG: Heidelberg Laureate Forum
The first lecture at the 11th Heidelberg Laureate Forum, which brings together some of the top scientists in mathematics and computer science each year, kicked off in style with Alexei Efros, the recipient of the 2016 ACM Prize in Computing. Efros is a leading figure in the fields of computer vision and artificial intelligence, known for his groundbreaking work that bridges the gap between machine learning and visual data. At the Heidelberg Laureate Forum, Efros talked about what he calls the “unsung hero” of the AI revolution: data.
The AI “Stone Soup”
Imagine a pot of stone soup – a folk tale as old as time. In the tale, a few hungry travelers promise to make a delicious soup with just stones and water, convincing skeptical villagers to contribute an onion here, a potato there, until everyone enjoys a rich, hearty meal – while still thinking they are eating “stone soup”. What started with stones became something far more through collaboration (and a little bit of trickery).
In a sense, today’s AI development is like that stone soup. We’re promised astonishing results from clever algorithms – transformers, deep learning, neural networks – but the true flavor of AI emerges only when you add vast amounts of data to the pot. This analogy, shared by Efros during the lecture, illustrates a fundamental truth about modern artificial intelligence: The algorithms, while important, cannot do much without the data that powers them.
For much of Efros’ career, the importance of data was downplayed in favor of algorithms.
“In the old days, data didn’t really get much respect,” the laureate explained. “It was all about the algorithms. You wanted to publish papers, you needed to come up with new algorithms. That kind of mentality has not served us well in this AI-related field.”
A Turning Point in Computer Vision
In the 1990s and early 2000s, computer vision research was dominated by the pursuit of elegant, innovative algorithms, but Efros realized that raw data – images, videos, and other inputs – deserved far more attention. Nowadays, the power of data is no longer contested. “We’re all on the same page now,” Efros remarked, though he noted that recent discussions about the future of AI continue to underplay the central role of data.
The turning point came in the late 1990s, when computers began to achieve breakthroughs in facial recognition. This had been a long-standing challenge and remarkably, was achieved with relatively low processing power. At the time, three different algorithms – each with roughly equal performance – rose to prominence. Yet, although they all got awards and are all widely cited, only one (Viola and Jones, 2001) is still widely remembered in computer science textbooks today. The paper has close to 30,000 citations.
What made the difference? It was not the cleverness of the algorithm, Efros says. It was the realization that, in addition to images containing faces, the algorithms also needed negative examples (images without faces) to perform well. In other words, the real breakthrough came from improving the dataset, not the algorithm. This is an extremely important lesson because, “all things being equal, we prefer to credit our own cleverness,” quipped the laureate. Yet in this case, it was better data, not a better algorithm.
The Unreasonable Effectiveness of Data
Efros cited a seminal paper by Google researchers, titled “The Unreasonable Effectiveness of Data”, which argued that for certain problems, especially those involving messy, complex systems like psychology, genetics, or AI, the sheer volume of data can outweigh the influence of any specific algorithm.
This approach has proven pivotal for the field of machine learning, but it was not always evident. The main idea is that some parts of our world can be explained by elegant mathematics – but for others, like psychology, genetics, or economics, it is much more difficult. These fields are more chaotic, influenced by evolution, randomness, and vast variability, and algorithms often struggle to make sense of them. Yet, with enough data, even relatively simple algorithms can achieve impressive results.
Consider the task of simulating smoke for a computer graphics project. While you can write complex equations to simulate the physics of smoke, it is often far easier to simply gather thousands of examples of real smoke in different scenarios and let the computer learn from those. The more data you have, the more your AI can capture the apparent unpredictability of the real world.
Good Data Works with Simpler Algorithms as Well
Efros shared examples from his own research to show how data can often outperform sophisticated algorithms. In one project from 2007, he and his team downloaded 2 million images from Flickr and used a straightforward approach to perform tasks like filling in missing parts of an image or guessing where a photograph was taken. The method worked surprisingly well, not because it involved complex algorithms, but because it relied on vast amounts of data.
He also carried out work in the geo-localization of photos and again, showed that with vast amounts of data, even relatively simple algorithms do well. This is where it got interesting.
A year later, in 2008, researchers at Google attempted to improve upon Efros’ image localization work using neural networks. While their neural networks did achieve better results, it was not purely because of the algorithmic advances. They had also increased the size of the training dataset.
Efros was (conveniently) a reviewer for this paper and he asked the authors if a simpler algorithm would produce comparable results with a similarly large dataset – and it did. When they compared the two approaches using the same amount of data, the deep neural network performed similar to a simple nearest-neighbor method.
“If you keep the data size constant, the fancy shmancy neural network was doing no better than simple nearest neighbor,” Efros said. “I’m not saying ‘forget neural networks’. Of course, neural networks have many other properties and they can do many more things. But in this particular setting, it was really data that was doing all the main lifting.”
Can We Get Enough Data?
AI systems have become remarkably proficient with text, but we are starting to see the end of quality data. Large Language Models have already incorporated a big part of the quality text data humankind has produced. Now, much of the text used for this training is derived from social platforms like Reddit or Twitter/X, which, as you may expect, is not exactly quality data.
“In text, I think we have already run out [of data],” Efros said in a subsequent press conference. “In text, we had 2,000 years of human culture and we’ve used it all up.”
However, the laureate sees potential in areas beyond text-based AI. The real exciting frontier lies in visual data, video, and robotics – fields where there is still an abundance of untapped data.
“I’m much more optimistic about low-level sensory data, images, videos, tactile for robots; those I think we have in infinite supply.”
In the press conference, we also asked Efros whether he believes AI will produce any innovation or true novelty in literature or arts. In the games of chess and Go, algorithms have produced novel, unprecedented strategies – but that is unlikely to happen in arts, the laureate says.
“I think the reason why there are some novel moves in chess and other games is because it is fundamentally a very closed system, a simple system. There are a lot of possibilities, but the rules are very well-defined and very simple. I would be surprised if we will have something like this for literature or art or music, because the search space is, in fact, infinite. So I think we are not going to see a computer artist anytime soon.”
After all, even data can only get you so far. There is still plenty of room for algorithmic innovation, there is still plenty of room to add new data to existing algorithms, but at the end of the day, some things remain inherently human. For now, at least.
Andrei Mihai wrote (26. Sep 2024):
> The first lecture at the 11th Heidelberg Laureate Forum […] Alexei Efros, the recipient of the 2016 ACM Prize in Computing.
> […] in the late 1990s, when computers began to achieve breakthroughs in facial recognition. […] Efros says. It was the realization that, in addition to images containing faces, the algorithms also needed negative examples (images without faces) to perform well. In other words, the real breakthrough came from improving the dataset, not the algorithm.
Rather than plainly including images without faces as well as images containing faces in the (“training”-)data set, without any further annotation or pre-processing,
the decisive improvement was presumably due to each picture being also correctly evaluated and accordingly marked as “containing a face”, or “not containing a face” (or perhaps even as “indeterminate regarding containing a face” as well).
It would be comforting to know that Alexei Efros himself, at least, appreciates this point; even if he failed to communicate it to some of his audience, in this lecture.
p.s.
> “The Unreasonable Effectiveness of Data” [ https://static.googleusercontent.com/media/research.google.com/en/pubs/archive/35179.pdf ]
This particular link results in:
404. That's an error.
.An article by this title, and by this exact file name, can easily be retrieved elsewhere.
Dear @Frank, thank you for pointing that out. We have updated the link.
What specific qualities do you believe make human creativity unique and irreplaceable in fields like literature and art geometry dash breeze, compared to the structured strategies seen in games like chess an Go?