MIT's Sonar-MASt3R Fuses Sonar and Vision to Map Murky Underwater Scenes in Real Time

Underwater robots usually stall when they stir up sediment and lose their cameras. A new MIT and Woods Hole method called Sonar-MASt3R borrows scale from sonar to fix a vision algorithm's depth estimates, letting vehicles map and navigate cloudy water as they go.

Remotely operated underwater vehicles have a recurring problem that has little to do with their hardware and everything to do with the water around them. When a vehicle touches down on the seafloor or digs into a sandbed, it lifts a cloud of sediment that blinds its optical cameras. The usual response is to wait, sometimes for many minutes, until the marine dust settles and visibility returns. For a robot working on a tight power budget or a time-sensitive recovery task, that delay is expensive.

A team from MIT and the Woods Hole Oceanographic Institution (WHOI) has built a mapping technique that sidesteps the wait. Called Sonar-MASt3R, the method fuses optical camera images with acoustic sonar data to produce 3D maps of an underwater scene in real time, even when the water is too turbid for a camera to see through. Amy Phung, a graduate student in MIT's Department of Aeronautics and Astronautics and a member of the MIT-WHOI Joint Program, led the work and presented it at the IEEE International Conference on Robotics and Automation (ICRA). Her co-author and advisor is Richard Camilli, a senior scientist at WHOI.

Two sensing modes, two different weaknesses

Underwater perception has historically forced a choice between two imperfect options. Optical cameras give you rich visual detail: texture, color, the difference between a coffee mug and a rock. But they only work in clear, well-lit water. Push them into a sediment plume and they return a wall of green-brown noise.

Sonar has the opposite profile. By emitting acoustic pulses and timing their return, a sonar sensor measures the distance, depth, and rough shape of objects regardless of how cloudy the water is. Acoustic waves pass through suspended sediment that scatters light. The catch is that sonar produces geometry without appearance. You get the contour of an object but not enough detail to identify it or inspect it closely.

The research community has been working on combining the two, an approach known as opti-acoustic fusion. Prior efforts mostly targeted object recognition or reconstructing fixed workspaces, and many of them required offline processing to align and merge the two data streams. That ruled out real-time use. Only a handful produced true 3D maps, and none had been applied to high-resolution mapping in genuinely murky, turbid conditions. That gap is what Phung and Camilli set out to close.

Borrowing scale to fix a vision model

The clever part of Sonar-MASt3R is what it builds on. The base is MASt3R, an image-matching algorithm developed by researchers in France. MASt3R takes 2D images of the same scene and estimates the relative depth of every pixel, fast enough to build a 3D reconstruction on the fly from ordinary camera frames. It is a strong piece of work, but it has a structural blind spot.

"The downside is that there is no sense of scale," Phung says. "It will say 'this pixel is five units closer than this pixel,' but it can't say whether that's 5 meters or 5 feet." This is the well-known scale ambiguity of monocular and learned depth estimation. The geometry can be internally consistent and still be off by a constant multiplier, which is fatal if a robot needs to know exactly how far to advance before it bumps into something.

Sonar fills precisely that hole. Because the timing of an acoustic reflection translates directly into an absolute distance, sonar gives the one thing MASt3R lacks: ground-truth scale. The team uses sonar returns to correct MASt3R's scaling, anchoring its relative depth map to real-world units. The result is a 3D map that is both detailed, where the camera can see, and metrically accurate, courtesy of the acoustics. In murky water, the sonar-corrected map tells a vehicle where objects actually are, and therefore how far it can safely move in for a closer optical look once it gets within visual range.

The behavioral analogy the team uses is a marine one: pairing a dolphin's echolocation for long-range awareness with a sea turtle's close-range vision for detail. The vehicle uses sonar to understand the general shape of its surroundings, moves toward a structure of interest, and switches to its cameras only when it is close enough for them to resolve something useful.

Three images showing sea sediment at less than 0.5 NTU, 5.41 NTU, and more than 12 NTU.

How the mapping loop runs

The pipeline has a sensible division of labor. On the first pass, the system runs a sweep trajectory: the sensor platform moves slowly across the scene, capturing both sonar and visual data. From that sweep, Sonar-MASt3R builds a coarse sonar-based map of the shapes and contours present. This coarse map is the navigation backbone, the thing that exists even when the cameras are useless.

The map is then refined with imagery using a keyframe strategy borrowed from the wider visual SLAM toolkit. Each new camera frame is compared against the most recent keyframe. If the frame contains new information, it becomes a keyframe and contributes its visual detail to the map. If it is redundant, it is discarded immediately. This keeps the computation bounded and the map current, which is what makes real-time operation feasible rather than a batch process run after the dive.

The trade-off here is worth naming. A keyframe scheme keeps things fast by throwing away most frames, which means the visual detail in the final map is only as dense as the vehicle's path and viewpoint allowed. The method is not trying to reconstruct every surface in photographic detail; it is trying to give a robot enough geometry to act safely and enough imagery to identify what matters.

Three pairs of images showing sea sediment and their visualizations at three levels: less than 0.5 NTU, 5.41 NTU, and more than 12 NTU.

Testing across eight grades of cloudiness

The team validated Sonar-MASt3R in a tank at WHOI, which let them control visibility directly by stirring up sediment. They filled the tank with water, sediment, and an assortment of test objects, including a small boulder, a coffee mug, and a packing crate, then mounted an underwater camera and a sonar sensor on a robotic arm that could sweep across the scene.

Amy Phung working on a tank experiment

They ran the system across eight levels of turbidity, measured in nephelometric turbidity units (NTU), ranging from nearly clear water below 0.5 NTU up past 12 NTU. Against competing opti-acoustic fusion approaches, Sonar-MASt3R produced more accurate 3D maps and resolved finer, centimeter-scale details, and it held that advantage as the water got cloudier. In the worst condition, where the cameras saw nothing, the sonar still generated a rough map of the hidden objects. That initial map was enough to guide the arm safely through the murk and bring its camera close enough to image specific objects in detail.

Camilli frames the capability in everyday terms: "An analogy would be if you were to go into a china shop in the dark, and try to pick your way around to find a specific coffee mug without knocking things over. This would allow you to do that."

Where it goes from the tank

One counterintuitive finding is that the tank may be the hard case. Enclosed water reflects acoustic energy off its walls, producing reverberations, ghost images, and distortions that complicate the sonar processing. "It's like trying to do this in a funhouse mirror setting," Camilli says. The team expects open-water performance to be cleaner, since the ocean does not bounce sound back the way a steel tank does. That is the next round of testing, in natural conditions.

The motivating applications are concrete. One is the recovery of unexploded underwater ordnance, often located in surf zones where churning water destroys visibility and where keeping crewed vessels nearby is unsafe. Robotics is the right tool for that job, but only if the robot can perceive its surroundings well enough to operate, which is exactly the constraint Sonar-MASt3R targets. Beyond ordnance disposal, the team points to scientific exploration, underwater construction and maintenance, and deep-sea recovery as settings where low visibility currently keeps vehicles out.

"The real value in this effort is so we can use this technology in mission scenarios that are untractable right now," Phung says. "And there are plenty of untractable missions because we don't have the observational or perception capabilities." The framing is honest about what the work is and is not. It does not give underwater robots new eyes; it gives them a way to combine two sensors they already carry so that the strengths of one cover for the blindness of the other. The research was supported in part by NASA and the National Science Foundation, and is detailed in the paper "Sonar-MASt3R: Real-Time Opti-Acoustic Fusion in Turbid, Unstructured Environments."