Question 1 (Chelsea): Thank you for all your work on robot learning. They're all really impressive. I have two questions, especially regarding the post-training part you mentioned. First, you mentioned that in post-training, the most important part is to have high-quality action data. I'm wondering what the components of that would be. Second, what do you think RL will play into the part of post-training?
Answer: I think that the different components of it—a lot of it comes down to consistency of the data and the strategy being followed and whether the robots, whether the data completes the task efficiently and with a reliable strategy. On the second question, I think that reinforcement learning can play a very large role in post-training. I think that online data from the robots, which reinforcement learning allows you to use, can allow robots to have a much higher success rate and also be faster than if they're just trained with imitation learning.
Question 2: Your work is really fascinating and there is no doubt that it will have a lot of impact in the future. But can I ask you at this stage how can you find the fundings because honestly I can't imagine how hard it can be to convince people to invest in a robot that folds clothes and deals with the dishes?
Answer: It's a good question. I think that well, first I'll mention that we aren't just focused on applications in the home. We really want to solve this broader problem of physical intelligence, and we've been starting with those applications because they're ones that are kind of easy to make progress on. But we've also been doing tasks like inserting an Ethernet cable, which I put in the talk, as well as constructing a cardboard box.
Generally, I think that this sort of problem has a ton of potential for making impact in all sorts of realms, not just in domestic tasks but all sorts of realms as well. And even in domestic tasks, I think there's a huge market for this kind of technology.
We ourselves haven't had a lot of challenge with fundraising, and I think that a lot of robotics companies recently have also done a great job and found that there's actually a lot of excitement around this sort of technology because I think things are actually starting to work. I started working on this technology more than 10 years ago at this point, and things really weren't working then, and so yeah, I think that there's a lot of excitement that is starting to mature and actually be ready for the real world. I think that there's a lot more work to do, but generally it seems like there's a lot of people excited about this technology and eager to actually put funds behind it.
Question 3 (Charu Thomas): I have two questions—one more broad and one more technical. The technical one is VLAs in my opinion, at least to my understanding, are a framework that is a bit separate from world modeling, and I wonder how the two of them will interplay among each other and whether you have actually planned to somehow use them together. As I see right now, VLAs are more of policies that could actually benefit a lot from world modeling. From a broader perspective, I wonder which kind of infrastructure layers could be the most useful to work on such as explainability, traceability, or safety in general to deploy such models in the real world?
Answer: On the first point, we there's actually fairly natural ways to incorporate world model objectives into vision language action models, and we've done some work where instead of only predicting the next action, you predict some intermediate subgoal image—like what should happen in the future in order to accomplish the task—and then predict an action from there. We've seen some signs of life that that seems to be quite promising. So I think there's ways to merge the two paradigms.
At the same time, I think there's a lot of challenges that come up with world modeling with regard to the ways in which basically the data that you put into it not necessarily being reflective of the ways in which you're going to use it. You might train it on demonstration data of successful data of completing the task and then evaluate it on to try to actually use it to evaluate actions that are not optimally completing the task. The world model will hallucinate a video of completing the task successfully even if the actions that you provide as input weren't actually going to successfully lead to a good outcome. So there's challenges there to overcome, but there's also ways to integrate it into the VLA paradigm.
For your second question about infrastructure layers: you need a real-time system that needs to actually be hitting a certain frequency to actually execute actions successfully. If you have lag in that system and so forth, it introduces all sorts of challenges. So thinking about fast inference and infrastructure for that's actually going to be on the robot is a big part of what our software team does. And then also thinking about large-scale machine learning infrastructure, training large models, ingesting large amounts of data. The data that we have is different from a lot of typical data sets because it's very multimodal in nature. It's videos, actions, language segments, and various other components as well. So yeah, some interesting infrastructure problems I think both on the robot side and on the model training side.
Question 4 (Frederick): I have a question about model sizes in general. I think what we're seeing right now is that in general larger model sizes lead to better accuracy. For example, also in your experiments, it's also what OpenAI and Anthropic and others are doing right now with their LLMs. However, there's also the approach of using a quite small model and then outsourcing the world knowledge into a database of some sort with which the model can interact. What is your take on that? Do you think that's a valid approach or do you think encapsulating all the world knowledge inside of the model is better or works better?
Answer: It's an interesting question. In my experience working on retrieval-based systems is that it actually is a little bit tricky to well, first figure out what should be offloaded versus actually done by the model, and second, sometimes the model will ignore the retrieved content and try to generate something itself, and it actually seems to be very quite tricky to get that technically to work exactly the way you want it.
I think it's probably going to depend on the application and the use case in terms of how best to like whether that might make sense, but in my experience, it ends up being quite tricky to figure out what the division of labor is. Even the model part of it will need to have some degree of intelligence in order to actually make use of the retrieved information and so forth. So I think it's a really fascinating research problem, but it also needs a lot of research to make that work successfully.
Question 5 (Charu Thomas): First off, really appreciate the talk. It was really fascinating and have been a big fan of your work since meta-learning. When you think about how software and hardware are going to continue to evolve, what are the biggest opportunities for builders today for your vision of physical intelligence?
Answer: I think that yeah, there's lots of different opportunities to make things work a lot better and a lot of open questions. I think kind of like what I was mentioning before, thinking about better ways of having infrastructure on the robot side. I think that there isn't a lot of—there's some open source code for that sort of thing, but there's a lot of opportunities to make robot infrastructure better, and not a lot of people I think are working on that aspect of the problem.
Also lots of opportunities—I guess one of the things I love about AI and computer science as a whole is there's a really big open source community, and I think that there's a ton of opportunity to actually do open source work and contribute to a broader community that's trying to collect data, open source models, fix bugs on those models, fine-tune those models, figure out new recipes for fine-tuning those models. So yeah, all sorts of questions also on the research side, especially in the open source realm.
Question 6: I also, just like everyone else, am a big fan of all your work. So, thank you for putting that all out. I've been reading through a lot of your group's work recently and particularly enjoyed reading Siraj's PhD thesis. It taught me a lot about scaling real world robotics with data. A question I have is how do you think synthetic data will sort of scale for robotics in the future? As we've seen with LLMs, we've moved away from not moved away from pre-training, but moved away from human collected data into more creating synthetic data and a lot of filtering and a lot of self-grading. So, how do you think using generative synthetic data for creating environments or reward models will impact robotics?
Answer: I have many thoughts on this topic. I think that at the end of the day there's going to be no replacement for real data, and so large amounts of real robot data is going to be a necessary component of any system that's going to work in a generalizable way. So we're going to need that.
At the same time, I do think that there's tools for simulation and synthetic data especially to potentially play on the evaluation side because it's very tricky to actually as you for example are generalizing to many environments. It's very tricky to actually evaluate how well that model generalizes not just in one new environment but in 10 new environments because then you actually need to bring the robot to those 10 environments or construct 10 environments. Whereas in simulation, that gets a lot easier. So I think I'm really excited about simulation and synthetic data for that use case.
I should also mention that I think that the analog of synthetic data in language models is actually not necessarily simulation in robotics but closer to something like reinforcement learning. I think that a lot of synthetic data is generated by the model that's actually trying to do the task and then trying to reason through different ways of doing the task. I think that the analogy there is a robot that's trying to attempt the task and learn from its own attempts and get better from its own attempts. That sort of online data from the model I think will also play a really critical role in post-training and something that we're working on quite a bit. So yeah, that I think is really important and really helpful.
Question 7: It's super cool to see you as an MIT EECS alumni now working in a really cool robotics and talking to us about robotics and entrepreneurship. But I've been wondering how robotics research that involves hardware components plays out differently in academia versus industry and are there typically more resources, fewer constraints or broader applications in one setting over the other? And what kind of people or goals do you think might be better suited for each path?
Answer: It's an interesting question. I still love both startup and academic environments and industry environments. I think they all have various pros and cons. Certainly, I think that generally academic environments aren't quite as well resourced in terms of data collection throughput, eval throughput, and compute as startups and industry labs. But at the same time, I think that there's a lot of problems that you can solve without large amounts of resources that we need to figure out on the algorithm side. So I think that there's a lot of really interesting work to be done there.
On the industry and in startups, I think the actually trying to do some of the research on these big models, scaling up data, seeing what happens at large scales is really great to do there. Yeah, I think that there's a place for both. I also think that the gap isn't as large as often people make it seem. Oftentimes people in industry environments kind of wish they had more compute. You kind of always wish that you had more resources. Sometimes when you have a lot of resources, you don't actually think as carefully and as critically about what runs you're going to be doing and so forth, and you end up being sometimes more wasteful of compute than if you were kind of more compute constrained. So there's also actually downsides to having more resources in my experience.
Question 8: I'm really sorry. Can I just ask a one quick question on architecture? I know that the scaling laws have worked well for transformer-based architectures, and I was thinking do you see currently limits in VLM-based architecture which are kind of made for text tokens because they don't have like modules for physical awareness? And how do you deal with that?
Answer: We tokenized the actions, and so I'd encourage you to take a look at the fast tokenizer paper that we put out as a way to accomplish that. And yeah, we should wrap up there. Thanks everyone and yeah, hope you enjoy the event.