Spatial Intelligence is the Next Frontier in AI
Warning: This text was generated using AI and has not yet been reviewed by humans
My entire career is going after problems that are just so hard, bordering on delusional. To me, AGI will not be complete without spatial intelligence. And I want to solve that problem. I just love being an entrepreneur. Forget about what you have done in the past. Forget about what others think of you. Just hunker down and build. That is my comfort zone.
So, I'm super excited here to have Dr. Fei-Fei Li. She has such a long career in AI. I'm sure a lot of you know her, right? Raise your hand. I know you too. She's been named the godmother of AI. One of the first projects that you created was ImageNet in 2009, 16 years ago. Oh my god. Don't remind me of that. That has over 80,000 citations and it really kicked off one of the legs of the stool for AI, which is the data problem. Tell us about how that project came about. It was pretty pioneering work back then.
Yeah. Well, first of all, Diana and Gary and everybody, thanks for inviting me here. I'm so excited to be here because I feel like I'm just one of you. I'm also an entrepreneur right now. I just started a small company. So, very excited to be here. ImageNet was, yeah, you're right. We actually conceived that almost 18 years ago. Time really flies. I was a first-year assistant professor at Princeton. Oh, wow. Hi. Hi, Tigers. Yeah. And the world of AI and machine learning was so different at that time. There was very little data. Algorithms, at least in computer vision, did not work. There was no industry. You know, as far as the public was concerned, the word AI doesn't exist. But there is still a group of us starting from the founding fathers of AI, right? John McCarthy and then we go through people like Geoff Hinton. I think we just had an AI dream. We really, really want to make machines think and work. And with that dream, my own personal dream was to make machines see because seeing is such a cornerstone of intelligence. Visual intelligence is not just perceiving—it's really understanding the world and doing things in the world.
So I was obsessed with the problem of making machines see, and as I was obsessively developing machine learning algorithms at that time, we did try neural networks, but it didn't work. We pivoted to Bayes nets, to support vector machines, whatever it was. But one problem always haunted me, and it was the problem of generalization. If you're working in machine learning, you have to respect that generalization is the core mathematical foundation or goal of machine learning. And in order to generalize, these algorithms need data. Yet no one had data at that time in computer vision, and I was the first generation of grad student who was starting to dabble into data because I was the first generation of graduate student who saw the internet—the big internet of things.
So fast forward, around 2007-ish, my student and I decided that we have to take a bold bet. We have to bet that there needs to be a paradigm shift in machine learning, and that paradigm shift has to be led by data-driven methods, and there was no data. So we're like, "Okay, let's go to the internet, download a billion images"—that's the highest number we could get on the internet—"and then just create the world's entire visual taxonomy and use that to train and benchmark machine learning algorithms." And that was why ImageNet was conceived and came to life. And it took a while until there were algorithms that were promising. It wasn't until 2012 when AlexNet came out, and that was the second part of the equation with getting to AI—was getting the compute and throwing enough at it and algorithms.
Tell us about what was that moment where you started to see, "Oh, you seeded it with data, and now people started—the community started to figure more things out for AI." Right. So between 2009, we published this tiny little CVPR poster in 2009, to 2012, the AlexNet, there were three years that we really believed that data will drive AI, but we had very little signal in terms of if that was working. So we did a couple of things. One is we open-sourced. We believed from the get-go we have to open-source this to the entire research community for everybody to work on this. The other thing we did is we created a challenge because we want the whole world's smartest students and researchers to work on this problem. So that was what we call the ImageNet challenge. So every year we release a testing dataset. Well, the whole ImageNet is there for training, but we release testing, and then we invite everybody openly to participate, and then the first couple of years was really setting the baseline. You know, the performance was in the 30% error rate. It wasn't zero, or I mean it wasn't completely random, but it wasn't that great.
But the third year, 2012, I—you know, I wrote this in a book that I published—but I still remember it was around the end of summer that we were taking all the results of ImageNet challenge and running it on our servers. And I remember it was late night one day I got a ping from my graduate student. I was home, and said, "We got a result that really, really stands out, and you should take a look." And we looked into it. It was a convolutional neural network, something. It wasn't called AlexNet at that time. That team, Geoff Hinton's team, was called "SuperVision." It was a very clever play of the word "super" as well as supervised learning. So "SuperVision," and we look at what SuperVision did. It was an old algorithm. Convolutional neural network was published in the 1980s. There was a couple of tweaks in terms of the algorithm, but it was pretty surprising at the beginning for us to see that there was such a step change. And of course, we—you know, the rest of the history you all know—we presented this in the ImageNet challenge workshop in that year's ICCV Florence, Italy, and Alex Krizhevsky came and many people came. I remember Yann LeCun also came, and now the world knows this moment as the ImageNet challenge AlexNet moment.
I do want to say that it's not just convolutional neural network—it was also the first time that two GPUs were put together by Alex and his team and were used for the computing of deep learning. So it was really the first moment of data, GPUs, and neural networks coming together.
Now, following this trend of the arc of intelligence for computer vision, ImageNet was really the seed to solve the concept of object recognition. Then, right after that, it started to also—AI got to the point that could solve the scenes, right, because you had a lot of the work with your students like Andrej Karpathy being able to describe scenes. Tell us about that transition from objects to scenes.
Yeah. So ImageNet was solving the problem of you present—you're presented with an image, and then you call out objects. There's a cat, there's a chair, and all that. That's a fundamental problem in visual recognition. But ever since I was a graduate student entering the field of AI, I had a dream. I thought it was a hundred-year dream, which is storytelling of the world—that when humans open their eyes, you know, imagine you just open your eye in this room. You don't just see person, person, chair, chair, chair. You actually see a conference room, you know, with screen, with stage, with people, with the crowd, the cameras. You actually can describe the entire scene. And that's a human ability that is at the foundation of visual intelligence, and it's so critical for us to use in terms of our everyday life.
So I really thought that problem will take my entire life. I literally, when I graduated as a graduate student, I told myself, "On my deathbed, if I can create an algorithm that can tell the story of a scene, I've succeeded." That was how I thought my career would be. But AlexNet moment came, deep learning took off, and then when Andrej and then later Justin Johnson entered my lab, we start to see signals of natural language and vision start to collide, and then Andrej and I proposed this problem of captioning images or storytelling. And long story short, around 2015, Andrej and I published a series of papers that was among the first, with a couple of concurrent papers, of making literally a computer that captioned an image. It was—I almost felt like, "What am I going to do with my life?" That was my lifelong goal, you know? It was such an incredible moment for both of us.
And you know, last year I gave a TED talk, and I actually used something that Andrej tweeted a couple of years ago around the time he finished image captioning work. That was pretty much his dissertation. I actually joked with him. I said, "Hey, Andrej, why don't we do the reverse? Take a sentence and generate an image." And of course he knew I was joking, and he said, "Haha, I'm out of here." The world was just not ready. But now, fast forward, now we all know generative AI, you know, now we can take a sentence and generate beautiful pictures. So this moral of the story is AI has seen an incredible growth, and personally I feel I'm the luckiest person in the world because my entire career started at the very beginning of the end of AI winter, the beginning of AI starting to take off, and so much part of my own work, my own career, is part of this change or helped with this change. So I feel so fortunate and lucky and, in a way, proud.
And I think the wildest thing—even to achieve your lifelong dream of describing scenes and even generating them with diffusion models—you're actually dreaming bigger because the whole arc of computer vision went from objects to scenes, and now this concept of world, and you actually decided to move from academia, being a professor, to now being the founder and CEO of World Labs. Tell us about what world is. It's even harder than scenes and objects.
Yeah, it is. It is kind of wild. So, of course, you all know the past—it's really hard to summarize the past five or six years. For me, it's we're living in such a civilizational moment of this technology's progress, right? While computer vision, as a computer vision scientist, we're seeing this incredible growth, you know, from ImageNet to image captioning to image generation using some of the diffusion techniques. While this is happening in a very exciting way, we also have another extremely exciting thread, which is language, which is LLMs, which is that really 2022 November, ChatGPT blasted open the door of truly working generation models that can basically pass the Turing test and all that. So this becomes very inspirational even for someone as old as me to really think audaciously about what's next.
And I have a habit as a computer vision scientist. A lot of my inspiration actually comes from evolution as well as brain science. I find myself in many moments of my career where I'm looking for the next north star problem to solve. I ask myself, "What is what evolution has done or what brain development has done?" And there's something that's really important to notice or to appreciate. The development of human language in evolution took about, if you're super generous, let's just say it took about 300 to 500 million years—less than a million years. That's the length of evolution that took to develop human language. And pretty much humans are the only animals that have sophisticated language. We could argue about animal language, but really language in its totality, in terms of being a tool of communication, reasoning, abstraction, it's really humans. So that took less than even half a million years.
But think about vision. Think about the capability of understanding 3D world, figuring out what to do in this 3D world, navigate the 3D world, interact with the 3D world, comprehend the 3D world, communicate the 3D world. That journey took evolution 540 million years. The first trilobite developed a sense of vision underwater 540 million years ago. And since then, really, vision was the reason that set off this evolutionary arms race. Before vision, animals were simple for—you know, the half billion years before vision, there's just simple animals. But the next half billion years, 540 million years, because of the capability of seeing the world, understanding the world, evolutionary arms race began, and animal intelligence just started to race each other.
So for me, solving the problem of spatial intelligence—to understand the 3D world, to generate the 3D world, to reason about the 3D world, to do things in the 3D world—is a fundamental problem of AI. To me, AGI will not be complete without spatial intelligence, and I want to solve that problem. And that involves creating world models. World models that go beyond flat pixels. World models that go beyond language. World models that truly capture the 3D structure and the spatial intelligence of the world.
And the luckiest thing in my life is no matter how old I am, I always get to work with the best young people. So I founded a company with three incredible young but world-class technologists: Justin Johnson, Ben Mildenhall, and Christoph Lassner, and we are just going to try to solve, in my opinion, the hardest problem in AI right now, which is incredible talent. I mean, Chris, he was the creator of Pulsar, which was the initial seed before Gaussians and Splats that do a lot of differentiable rendering. There's Justin Johnson, your former student, who really has this super system engineering mind that got real-time neural style transfer. Then you got Ben, who was the author of the NeRF paper. So this is a super crack team, and you need such a crack team because we were chatting a bit about that—that vision is actually harder than LLMs to some extent. Maybe this is a controversial thing to say because LLMs are basically 1D, right? But you're talking about understanding a lot of the 3D structures. Why is this so hard, and it's behind language research?
Yeah. No, I really appreciate that you empathize how hard our problem is. Yeah. So, language is fundamentally 1D, right? Syllables come in sequence. I mean, this is why sequence-to-sequence, sequence modeling is so classic. There's something else that is language that people don't appreciate. Language is purely generative. There's no language in nature. You don't touch language. You don't see language. Language literally comes out of everybody's head, and that's a purely generative signal. Of course, you put it on a piece of paper, it's there. But the generation, the construction, the utility of language is very, very generative.
The world is far more complex than that. First of all, the real world is 3D. And if you add time, it's 4D. But just let's confine ourselves within space. It's fundamentally 3D. So that by itself is a much more combinatorially harder problem. Second, the sensing—the reception of the visual world is a projection. Whether it's your eye, your retina, or a camera, it's always collapsing 3D to 2D. And you have to appreciate how hard it is. It's mathematically ill-posed. So you have to—this is why humans and animals have multi-sensors. And then you have to solve that problem. And third, the world is not purely generative. Yes, we could generate virtual 3D world. It still has to obey physics and all that. But there is also a real world out there. You are now subtly dialing between generation and reconstruction in a very fluid way. And the user behavior, the utility, the use cases are very different. If you dial all the way to generation, we can talk about gaming and metaverse and all that. If you dial all the way to real world, you're—we're talking about robotics and all that. But all this is on a continuum of world modeling and spatial intelligence.
So it's—and of course, the elephant in the room is there's a lot of data on the internet for language, and where is the data for spatial intelligence? You know, it's all in our head, of course, but it's not as easily accessible as language. So these are the reasons it's so hard, but frankly it excites me because if it's easy, somebody else has solved it, and my entire career is going after problems that are just so hard, bordering on delusional, and I think this is the delusional problem. Thank you for supporting that.
And even thinking about this from first principles, the human brain has a lot more in the visual cortex and amount of neurons that process visual data as opposed to language. How does that translate into the model architectures? Are they very different from LLMs from what you're kind of finding out, right?
Yeah, that's actually a really good question. And I mean, there's still different schools of thought out there, right? There is the LLM—a lot of what we see in LLM is really writing the scaling law all the way to happy ending, and you can almost—you can just brute force self-supervision all the way. Constructive world model might be a little more nuanced. The world is more structured. There might be signals that we need to use to guide it. You can call it in a shape of prior, you can call it supervision in your data, whatever it is. I think that these are some of the open questions that we have to solve.
But you're right, and also if you think about human—first of all, we don't have all the answers even to human perception, right? How does 3D work in human vision is not a solved problem. We know mechanically the two eyes had to triangulate information, but even after that, where is the mathematical model? And we're not that great—humans are not that great as 3D animals. So there is a lot that is to be answered. So we are definitely at World Labs. I'm just counting on—really counting on one thing. I'm counting on we have the smartest people in the pixel world to solve this.
Is it fair to say that what you're building at World Labs is these whole new foundation models where the output are 3D worlds? And what are some of the applications that you're envisioning? Because I think you listed everything from perception to generation. So there's this always this tension between generative models and discriminative models. So where—what do these 3D worlds do?
Yeah. So I'm not going to be able to talk too much about the details of World Labs per se, but in terms of spatial intelligence, that's what also excites me. Just like language, the use case is so huge—from creation, which you can think about designers, architects, industrial designers, as well as just artists, 3D artists, game developers—from creation all the way to robotics, robotic learning. The utility of spatial intelligence model or world models is really, really big. So, and then there are many related industries from marketing to entertainment to even metaverse. I'm actually really, really excited by metaverse. I know so many people are kind of still like—it's still not working. I know it's still not working. That's why I'm excited because I think the convergence of hardware and software will be coming. So that's also another great use case down the road.
I'm personally very excited that you're solving metaverse. I gave it a try in my previous company. So I'm so excited that you're doing that now.
Yeah. Well, I think there's more signal. I mean, I do think hardware is part of the hurdle, but you know, you need content creation, and metaverse content creation needs world models.
Let's switch gears a little bit. So maybe to some of the audience, they might find your transition from going from academia to now being a founder CEO to be sudden, but you actually have a remarkable journey through your whole life. This is not your first time you've gone zero to one. You were telling me about how you immigrated to the US and you didn't speak any English in your teens, and you even ran a laundromat for a good number of years. Tell us about how all those skills shaped who you are now.
Right. I'm sure you guys are here trying to listen to how to start a laundromat. Yeah, that was when you were 19, right? Yeah, I was 19, and that was out of desperation. So I had no means of supporting my family, my parents, and I need to go to college to be a physics major at Princeton. So I started a dry cleaning shop, and in Silicon Valley language, I fundraised. I was the founder CEO. I was also the cashier and all the other things, and I exited. So after seven years—all right, you guys are very kind. I've never got claps for my laundromat, but thank you. So, but anyway, I think Diana's point, especially to all of you, I look at you. I'm so excited for you because you're like literally half my age or even, you know, maybe 30% of my age, and you're so talented. Just do it. Don't be afraid.
You know, all my entire career—of course, I did laundromat, but even as a professor, I chose a couple of times I chose to go to departments where I was the first computer vision professor, and that was against a lot of advice. You know, as a young professor, you should go to a place where there's a community and senior mentors. Of course, I would love to have senior mentors, but if they're not there, I still have to blaze my trail, blaze my way, right? So I wasn't afraid of that. And then I did go to Google to learn a lot about business in Google Cloud and B2B and all those. And then I started a startup within Stanford because around 2018, AI was not only taking over the industry—AI became a human problem. Humanity will always advance our technology, but we cannot lose our humanity. And I really care about creating a beacon of light in the progress of AI and try to imagine how AI can be human-centered, how we can create AI to help humanity. So I went back to Stanford and created Human-Centered AI Institute and ran that as a startup for five years. Probably some people were not too happy I ran it as a startup for five years at a university, but I was very proud of that.
So in a way, I think I just love being an entrepreneur. I love the feeling of ground zero, like standing on ground zero. Forget about what you have done in the past. Forget about what others think of you. Just hunker down and build. That is my comfort zone, and I just love that.
The other really cool thing about you, another—on top of all the awesome things you've done, you advise a lot of legendary researchers like Andrej Karpathy, Jim Fan who's at Nvidia, Jia Deng who's your co-author for ImageNet—they all went on to have these incredible careers. What really stood out about them when they were students? Like advice for the audience that you could tell, "Ah, this person is gonna change the field of AI," and you could tell.
So first of all, I'm the lucky one. I don't—I think I owe more to my students than the other way around. They really make me a better person, better teacher, better researcher, and having worked with so many—like you said—legendary students is really the honor of my life. So, they're very, very different. Some of them are just pure scientists trying to hunker down and solve a scientific problem. Some of them are industrial leaders. Some of them are, you know, the greatest disseminator of AI knowledge. But I think there is one thing that unifies them, and I would encourage every single one of them—of you—to think about this. I also—for those founders who are hiring, this is also my hiring criteria—is I look for intellectual fearlessness. I think it doesn't matter where you come from. It doesn't matter what problem we're trying to solve. That courage, that fearlessness of embracing something hard and go about it and be all in and trying to solve that in however way you want is really a core characteristic of people who succeed. I learned this from them, and I really look for young people who have that, and then that as a CEO at World Labs, in my hiring, I look for that quality.
So you're hiring a lot for World Labs too. So you're looking for that same trait, right? Yes. I get permission from Diana to say that we're hiring. So yes. So we are hiring a lot. We are hiring engineering talents. We're hiring product talents. We're hiring 3D talents. We're hiring generative model talents. So, if you feel you're fearless and you're passionate about solving spatial intelligence, talk to me or come to our website.
Cool. We're going to open it up for questions for the next 10 minutes.
Question 1: Hi Fei, thank you for your talk. I'm a big, big, big fan, and yeah, so my question is more than two decades ago you worked on visual recognition. I am—I want to start my PhD. What should I work on so I become a legend like you are? I want to give you a thoughtful answer because I can always say, "Do whatever excites you." So first of all, I think AI research has changed because academia—if you're starting a PhD, you are in academia. Academia no longer has most of the AI resources. It's very different from my time, right? The chip, the compute, and the data are really low in terms of resourcing academia, and then there are problems that industry can run a lot faster. So as a PhD student, I would recommend you to look for those north stars that are not on a collision course of problems that industry can solve better using better compute, better data, and team science. But there are some really fundamental problems that we can still identify in academia that it doesn't matter how many chips you have, you can make a lot of progress, you know.
First of all, interdisciplinary AI to me is a really, really exciting area in academia, especially for scientific discovery. There's just so many disciplines that can cross AI. I think that's a big area that one could go to. On the theoretical side, I find it fascinating that the AI capability has 100% outrun theory. We don't know how—you know, we don't have explainability. We don't know how to figure out the causality. There's just so much in the models we don't understand that one could push forward. And you know, the list can go on. In computer vision, there's still representational problems we haven't solved, and also, you know, small data—that's another really interesting domain. And so yeah, these are the possibilities. Thank you so much, Fei.
Question 2: Thank you, Professor Li, and congratulations again on your honorary doctorate from Yale. I was honored there to witness that moment one month ago, and my question is, in your perspective, will AGI emerge more likely as a unified single unified model or as a multi-agent system? The way you ask this question is already two kinds of definition. One definition is more theoretical, which is define AGI as if there is an IQ test that one passes that defines AGI. The other part of your question is much more utilitarian. Is it functional? If it's agent-based, what tasks can it do? I struggle with this definition of AGI to be honest. Here's why. The founding fathers of AI who came together in 1956 in Dartmouth—you know, John McCarthy and Marvin Minsky of them—they wanted to solve the problem of machines that can think, and that's a problem that Alan Turing also put forward a few years earlier, 10 years or whatever earlier than them. And that statement is not a narrow AI. It's a statement of intelligence. So I don't really know how to differentiate that founding question of AI versus this new word AGI. To me, they're the same thing. But I get it that the industry today likes to call AGI as if that's beyond AI. And I struggle with that because I feel there—I don't know what exactly is AGI different from AI. If we say today's AGI-ish system performs better than the narrower AI system in '80s, '70s, '90s, or whatever, I think that's right. That's just the progression of the field. But fundamentally, I think the science of AI is the science of intelligence—is to create machines that can think and do things as intelligently or even more intelligently as humans. So I don't know how to define AGI. So I don't know—without defining it, I don't know if it's monolithic. If you look at the brain, it's one thing—you know, you can call it monolithic, but it does have different functionalities, and you can even—there's Broca's area for language. There's visual cortex, there's motor cortex. So I don't really know how to answer that question.
Question 3: Hi, my name is Yashna, and I just want to say thank you. I think it's really inspiring to see a woman playing a leading role in this field, and as a researcher, educator, and entrepreneur, I wanted to ask what type of person do you think should pursue graduate school in this rapid rise of AI? That's a great question, and that's a question even parents ask me. I really think graduate school is the four or five years where you have burning curiosity. You're led by curiosity, and that curiosity is so strong that there's no better other place to do it. It's different from a startup because startup is not just—you have to be a little careful. Startup cannot be just led by curiosity. Your investors will be mad at you. A startup has a more focused commercial goal, and some part of it is curiosity, but it's not just curiosity. Whereas for grad school, that curiosity to solve problem or to ask the right questions is so important that I think those going in with that intense curiosity would really enjoy the four or five years even if the outside world is passing by at the speed of light. You'll still be happy because you're there following that curiosity.
Question 4: I first wanted to say thank you for your time, thank you for coming out to speak to us. You mentioned that open-sourcing was a big part of the growth from ImageNet, and now with the recent release and growth of large language models, we've seen organizations taking different approaches with open source, which with some organizations staying fully closed source, some organizations fully releasing their entire research stack, some being somewhere in the middle, open-sourcing weights or having restrictive licenses and things of that nature. So I wanted to ask what do you think of these different approaches to open source, and what do you believe the right way to go about open source as an AI company is? I think the ecosystem is healthy when there are different approaches. I'm not religious in terms of you must open-source or you must close-source. It depends on the company's business strategy. And for example, it's clear why Facebook—Meta—wants to open-source, right? They are right now—their business model is not selling the model yet. They're using it to grow the ecosystem so that people come to their platform. So open-source makes a lot of sense. Whereas another company that is really monetizing on—even monetizing—you can think about an open-source tier and a closed-source tier. So I'm pretty open to that category or—at a meta level, I think open-source should be protected. I think if there is efforts of open-source both in public sector like academia as well as private sector is so important. It's so important for the entrepreneurial ecosystem. It's so important for public sector that I think that should be protected. It shouldn't be penalized.
Question 5: Hi, my name is Carl. I flew in from Estonia. I have a question about data. So you called very well the shift in machine learning towards data-driven methods with ImageNet. Now that you're working on world models, and you mentioned that we don't have this spatial data on the internet, it exists only in our heads. How are you solving this problem? What are you betting on? Are you collecting this data from the real world? Are you doing synthetic data? Do you believe in that, or do you believe in good old priors? Thanks. You should join World Labs, and I'll tell you. Oh, it's a good one. Look, as a company, I'm not going to be able to share a lot, but I think it's important to acknowledge that we're taking a hybrid approach. It is really important to have a lot of data, but also have a lot of quality data. Data at the end of the day, there is still garbage in, garbage out if you're not careful with the quality of data.
Question 6: We'll do one last question. Hi, Dr. Li. My name is Annie, and thank you very much for speaking with us. So in your book, "The World I See," you talk about the challenges you face as an immigrant girl and woman in STEM. I'm curious to know if there's a time that you feel the moment of being a minority in the workplace, and if so, how did you manage to overcome this or persuade others? Thank you for that question. I want to be very, very careful or thoughtful in answering you because we all come from different backgrounds, and how each of us feel is very unique. You know, it almost doesn't even matter what are the big categories. All of us have moments that we feel were the minority or the only person in the room. So of course I felt that way. Sometimes it's based on who I am. Sometimes it's based on my idea. Sometimes it's just based on—I don't know—the color of my shirt, whatever that is. But this is where I do want to encourage everybody. Maybe it is because since I was young, coming to this country, I kind of have experienced it is what it is. I am an immigrant woman. I almost developed a capability to not over-index on that. I'm here just like every one of you. I'm here to learn or to do things or to create things. I thank you. That was a great answer. And I really—all of you, you're about to embark on something or in the middle of embarking something, and you're going to have moments of weakness or strangeness or—I feel this every day, especially startup life. Sometimes I'm like, "Oh my god, I don't know what I'm doing." Just focus on doing it. Gradient descend yourself to the optimized solution. Yeah. All right. That's a great way to end. Thank you, Dr. Li.