DeepMotor, founded by Chen Kai in early 2025, claims that training robots on large‑scale first‑person video captured from humans is the missing ingredient for embodied AGI. The company has raised hundreds of millions of RMB, citing recent moves by Tesla, NVIDIA and Figure AI toward similar data pipelines. This article examines what is new about DeepMotor’s approach, how it differs from prior work, and what practical limits remain.

DeepMotor’s Claim

DeepMotor argues that the bottleneck for embodied artificial general intelligence (AGI) is the lack of authentic first‑person visual streams that capture how humans actually move, manipulate objects, and interact with environments. By collecting hundreds of thousands of hours of head‑mounted video from workers, technicians and everyday people, the startup says it can pre‑train large‑scale vision‑language‑action models that transfer directly to robot control.

The company’s recent funding round—reported to be in the several‑hundred‑million‑RMB range—was justified on the basis of three milestones:

Data scale: a proprietary pipeline that aggregates > 250 k hours of first‑person video, annotated with timestamps and coarse action labels.
Model architecture: a transformer‑based “Ego‑Transformer” that ingests synchronized video, inertial measurement unit (IMU) data and sparse language cues.
Hardware integration: a partnership with a Chinese robot manufacturer to deploy the pretrained model on a 7‑DoF manipulator for pick‑and‑place trials.

What’s Actually New?

1. Scale of Human‑Centric Data

The idea of using first‑person video for robot learning is not new. Projects such as Ego4D (Meta, 2022) and EPIC‑KITCHENS (2020) released millions of minutes of egocentric footage for action recognition. What DeepMotor adds is a closed‑loop pipeline that couples the video capture with a downstream robot training loop. Their dataset is reportedly larger than Ego4D (which contains ~ 3 k hours of video) but still an order of magnitude smaller than the 270 k hours used by GeneralistAI in its June 2025 demo.

2. Ego‑Transformer Architecture

DeepMotor’s technical note describes a multi‑modal transformer that processes video frames (30 fps), IMU streams (200 Hz) and optional textual prompts. The model is trained with a contrastive loss that aligns video‑action pairs with robot‑state trajectories generated in simulation. This is reminiscent of the RT‑1 approach from Google (2022) and the EgoScale model released by NVIDIA in February 2026, which also uses a large‑scale egocentric pre‑training phase before fine‑tuning on robot data.

3. Integration with Real‑World Robots

The most concrete deliverable from DeepMotor so far is a demonstration where a 7‑DoF arm, equipped with a wrist‑mounted camera, executes a sequence of kitchen tasks after being initialized with the pretrained Ego‑Transformer. The success rate reported (78 % on a 10‑task benchmark) is comparable to the 80 % achieved by NVIDIA’s EgoScale on a similar setup.

Limitations and Open Questions

Data Diversity vs. Domain Gap

Even with 250 k hours of footage, the dataset is heavily weighted toward industrial and office settings in Beijing. Actions such as delicate food preparation or outdoor tool use are under‑represented. This raises the classic domain gap problem: a model pretrained on indoor office footage may still struggle when transferred to a warehouse or a home kitchen.

Annotation Overhead

DeepMotor relies on “coarse action labels” generated by a semi‑automatic pipeline. Human verification is still required for ambiguous segments, which limits how quickly the dataset can be expanded. In contrast, GeneralistAI’s approach uses self‑supervised video‑prediction objectives that avoid explicit labeling.

Compute Cost

Training the Ego‑Transformer on 250 k hours of video reportedly required 1.2 k GPU‑years on NVIDIA H100s. At current cloud pricing, that translates to tens of millions of USD in compute spend. The economics of scaling beyond the current dataset are unclear, especially for a privately funded startup.

Comparison with Industry Moves

Tesla’s shift to human video data for Optimus (May 2025) and NVIDIA’s EgoScale (Feb 2026) demonstrate that the idea has traction beyond DeepMotor. However, both companies still treat egocentric data as a pre‑training stage, followed by extensive robot‑specific fine‑tuning. DeepMotor’s claim of “embodied AGI” appears to conflate pre‑training scale with the ability to generalize across all embodied tasks—a leap that is not yet supported by empirical results.

Outlook

DeepMotor is ahead of many domestic investors in betting on first‑person data, and its funding round reflects growing confidence that egocentric video will become a staple of robot learning pipelines. The company’s progress will be judged on two fronts:

Generalization – can the pretrained Ego‑Transformer handle tasks that differ significantly from the training distribution?
Economic viability – will the marginal gains from additional egocentric data justify the massive compute and annotation costs?

If DeepMotor can demonstrate robust transfer to a variety of robot platforms with minimal fine‑tuning, it will provide a useful data‑centric complement to existing simulation‑heavy methods. Until then, the hype surrounding “embodied AGI” should be tempered by the practical constraints of data diversity, labeling effort, and compute budget.