Rhoda AI CEO on Video Training: I Don't Think the World is Going Back

By Brian Heater, Managing Editor, A3
04/02/2026
5 minutes

Rhoda AI arms

AI is remarkably good at surprisingly the humans that build it. Developers will gleefully recount stories of unexpected connections and breakthroughs they’d never envisioned until they unfolded in front of them. Such subversions of expectations are a big part of why generative AI remains such a compelling area of study. However smart and forward-thinking you might be, the system is bound to catch you off-guard sooner or later.  

Jagdeep Singh describes one such moment, as Rhoda worked to train robot arms to perform real world tasks in an industrial setting. “At first glance, you might think that, okay, if you have more videos of actual workers doing the job, you might perform better,” the CEO explains. It’s a simple premise: If you want a robot to learn to perform certain tasks, focus on feeding it content of those tasks being performed.  

“It turns out we found the opposite,” Singh adds. “Models that worked best were the ones that had the less curated data, where there's a bunch of other random stuff in there that again might not even involve human beings.” 

Rhoda’s findings led to a reframing of how data is processed. Focusing on a specific task necessarily narrows the data input. Less curated video, on the other hand, has the potential to teach the system broader, more universal lessons.  

“In hindsight, it makes sense because what you want the model to learn is just a general priority of how things move. Basically, you learn the laws of physics — intuitive physics, as you might say. And once you've learned that well, then you can pick up a new task with very little data.” 

Getting there, on the other hand, will require a ton of the stuff. Rhoda’s path is lined with video — massive troves. In much the same way large language models were trained on an internet’s worth of data, the startup is pre-training its models with hundreds of millions of online videos.  

“The data has been around for a long time,” says Singh. “By some estimates, we've heard 80% of all the data on the internet is video. Much of this video is openly accessible. That isn't the issue. But until recently, you just didn't have the techniques to be able to ingest all this video, to make sense of it, and then to be able to learn a prior from which you could put it in a policy prediction.” 

Video also plays a key role in Rhoda’s post-training. Here the startup employs what it refers to as a direct video to action or DVA model (contrasting the more establish VLA/vision-language-action model). Rather than planning out the robot’s actions in, say, five second bursts, Rhoda’s closed loop system keeps predictions to a few hundred milliseconds at a time, meaning it can respond more quickly to the inevitable variations in real-world settings. The system continually generates video that serves as the robot’s policy, which, when stitched together, (hopefully) forms the intended action.  


 NEW ONLINE TRAINING COURSE

Designing Industrial AI Agents

Gain the skills to orchestrate advanced AI agents that learn, adapt, and collaborate like experts in real-world automation environments

Learn More

 

“You see that scene and you see the language instructions say, pick up this cup of coffee,” Singh explains. ‘You might generate a video of the robot arm moving a few centimeters or a few inches closer to the cup. And after you do that video, that video is converted to actions, robot motor torques, and joint angles, and you can execute those actions. The robot moves forward a few inches. Then you generate a new video based on that new scene that you observed, and so on. And that's done continuously, so your robot moves smoothly through space.” 

If the task is successful, the robot is allowed to carry on. If it’s unsuccessful, a human intervenes to correct the action.  

Like many others in the physical AI space, Rhoda is working toward a generalist robot model. Kickstarting the data flywheel entails deploying its model to perform real world tasks. Singh cites a recent job, wherein the Rhoda system guided a pair of arms folding containers with collapsible sidewalls. After a couple of weeks of training, the startup was able to run an in-house pilot autonomously. As is always the case, however, things got complicated after entering the real world. 

“In the factory, these are the same containers, but they’re really beat up,” he explains. “The latches are in some cases broken, the ball transfer tables on which we're handling these things are rusted, the friction's different. Even though it's the same nominal task, when you switch it over to that environment, it's a whole different kind of set of variables. We were able to train in the factory for about a total of three hours, given the constraints they had. And with those three hours of training, the model ended up working autonomously in that factory.” 

Among the more spirited debates currently at the center of physical AI is the degree to which simulation, real world data collection, and video will play in pre- and post-training of “general purpose” robot models. Most experts agree that some combination of three will contribute to the creation of robust models, as each has its relative strengths and weaknesses. For Rhoda, video is set to play an outsized role at each key step in the process.  

“I don’t think the world is going to go back to non-video-based pre-training,” says Singh. “It makes no sense. There’s just so much you can learn from actual video. Why would you not take advantage of that? In the same way you look at language models — pre-training was on tens of trillions of tokens. The original ChatGPT was trained on probably between 10 and 30 trillion tokens of textual information. The post-training was on the order of tens of thousands of Q&A pairs that were curated by OpenAI. It was orders of magnitude less. That sort of paradigm has worked for image generation and video generation. There’s no reason to believe the robotics domain is different and can generalize with dramatically less data.” 

MEET THE AUTHOR

Association for Advancing Automation

Discover how Association for Advancing Automation can support your automation journey with their complete range of solutions and expertise.

Visit Company Website