A single camera is not enough to teach a robot how to manipulate objects in the real world. The robot needs to see (RGB-D), feel (force-torque), and understand its own motion (IMU and joint encoders). Each modality captures a different dimension of physical interaction that the others cannot. RGB captures appearance and spatial layout. Depth captures 3D geometry. Force-torque captures contact dynamics. IMU captures ego-motion and vibration. Joint encoders capture the robot's proprioceptive state. Combined, they give learning algorithms — behavior cloning, diffusion policies, action chunking transformers — a rich representation that no single sensor can provide.
But multimodal data collection is not simply adding more sensors. Each additional modality introduces synchronization requirements, calibration complexity, storage overhead, and training pipeline challenges. A poorly synchronized multi-sensor dataset is worse than a well-synchronized single-camera dataset, because the model learns temporal associations that do not exist in reality. This guide covers what to capture, how to synchronize it, how to calibrate across modalities, and how current training methods consume multimodal robotics data — with specific sensor models, data rates, and format recommendations based on what works in production robotics data collection pipelines.
Why Single-Modality Data Falls Short
Each sensor modality captures a specific slice of physical reality. Using any single modality alone leaves critical gaps that directly affect model performance.
RGB alone provides rich visual appearance — color, texture, lighting, spatial relationships — but no metric depth information. A model trained only on RGB must infer 3D geometry from 2D cues, which works for coarse tasks but fails for precision manipulation. RGB cannot capture contact forces: when a gripper squeezes an object, the visual change is minimal, but the force-torque data shows a rich signal — initial contact, force increase, stabilization, slip events. RGB also suffers from appearance variation: changing the lighting or background can confuse a model that learned surface-level visual patterns rather than task-relevant features.
Depth alone provides 3D geometry but strips away appearance information. Two objects with identical shape but different surface properties — a metal bolt and a plastic bolt, a ripe tomato and an unripe one — are indistinguishable in a depth map. Depth sensors also have specific failure modes: transparent objects produce no depth return, specular surfaces reflect the IR pattern away from the sensor, and edges create depth discontinuity artifacts. Depth-only models cannot leverage color-based object recognition or surface texture cues that indicate material properties.
Force-torque alone captures contact dynamics with high fidelity but provides no spatial context. The sensor reports a 6-axis wrench — three force components and three torque components — at the robot's wrist or fingertip, but without visual data there is no information about what is causing the force, where the contact is occurring, or what the overall task state looks like. Force-torque data is reactive rather than predictive: it tells you what is happening at the contact point right now, but planning the approach to the next grasp requires vision.
Consider a concrete task: picking up a ripe tomato from a bin. Visual detection identifies which object is the tomato and estimates its position. Depth data provides the 3D geometry needed to plan an approach path that avoids other objects. Force-torque feedback during the grasp detects when contact is made and controls grip force to avoid crushing — the threshold between a secure grasp and a crushed tomato might be 2N. No single modality handles all three requirements. Train with only RGB, and the model cannot control grasp force. Train with only force-torque, and the model cannot find the tomato. Train with all three modalities, and the model learns a complete representation of the task.
The Core Modalities for Robotics Data
Six sensor modalities form the standard toolkit for multimodal robotics data collection. Each has specific hardware options, data rates, and integration considerations.
RGB-D cameras are the primary visual sensor. The Intel RealSense D405 is the standard for wrist-mounted applications: 640x480 resolution at 30fps for both RGB and depth, with active stereo depth technology providing ±2mm accuracy at 50cm range. Its compact form factor (42mm x 42mm) allows mounting on robotic wrists without excessive payload impact. For external viewpoints, the RealSense D435 provides wider field of view (86° x 57°) at the same resolution and frame rate, suitable for overview cameras mounted 0.5-1.5m from the workspace. Depth accuracy degrades with distance: at 1m the D435 achieves ±4mm, at 2m approximately ±14mm. For tasks requiring higher depth accuracy — sub-millimeter assembly, surface inspection — structured light sensors like the Zivid Two provide ±0.1mm accuracy but at much lower frame rates (2-5fps), making them unsuitable for dynamic manipulation capture.
Force-torque sensors capture the contact dynamics that visual sensors cannot. The ATI Mini45 is the industry standard for wrist-mounted applications: 6-axis measurement (Fx, Fy, Fz, Tx, Ty, Tz), force range up to 145N on each axis, torque range up to 5Nm, with resolution of 0.025N and 0.5mNm. The critical specification for data collection is sample rate: the Mini45 operates at 1000Hz through a dedicated data acquisition system, providing the temporal resolution needed to capture contact transients — the initial contact spike during a grasp lasts 10-50ms and is completely invisible at 30Hz. For fingertip-level force sensing, ATI Nano17 sensors provide similar performance in a smaller package, suitable for integration into custom grippers.
IMU sensors provide ego-motion estimation and vibration detection. A standard 6-axis IMU (accelerometer + gyroscope) samples at 200-400Hz and captures the robot's body dynamics — acceleration during transport, vibration during contact, orientation changes during manipulation. 9-axis IMUs add a magnetometer for absolute heading, useful for mobile manipulation platforms. In practice, IMU data is most valuable for mobile manipulation tasks and for detecting environmental vibration that affects sensor calibration — a forklift passing nearby a manufacturing data collection station produces vibration that shifts camera mounts.
Joint encoders record the robot's proprioceptive state: position, velocity, and optionally torque at each joint. Most industrial and research robots provide joint data at 100-1000Hz through their control interfaces. Joint data is essential for behavior cloning and action chunking — the model's output space is joint positions or velocities, and the input must include the current proprioceptive state for closed-loop control. Joint torque measurements, when available, provide an additional signal for contact detection and load estimation.
Hand pose tracking captures human hand kinematics during demonstration. For egocentric data collection, 21-joint hand skeleton tracking at 30fps or higher provides a dense representation of human grasp strategies. Approaches include depth-camera-based tracking (MediaPipe Hands, which works with standard RGB cameras but has accuracy limitations), marker-based tracking (OptiTrack, sub-millimeter accuracy but requires infrastructure), and instrumented gloves (StretchSense, direct joint angle measurement at 60Hz). The choice depends on the accuracy requirements and the degree to which tracking equipment can interfere with natural demonstration behavior.
Tactile sensors provide fingertip-level contact geometry that force-torque sensors cannot resolve. GelSight sensors capture the deformation of a gel surface upon contact, producing a high-resolution tactile image (typically 320x240 at 30fps) that reveals the shape of the contact surface, the force distribution across the contact patch, and incipient slip events. DIGIT sensors provide similar capability in a form factor small enough for integration into robotic fingers. Tactile data is most valuable for tasks involving deformable objects, fragile items, or in-hand manipulation where fingertip forces determine success.
Synchronization: The Hard Part Nobody Talks About
Synchronization is the most technically challenging aspect of multimodal data collection, and the most commonly underestimated. Different sensors operate at different rates — cameras at 30fps, force-torque at 1000Hz, IMU at 200Hz, joint encoders at 100-1000Hz — and each sensor has its own clock. Without rigorous synchronization, the temporal relationships between modalities are corrupted, and models learn false associations between events that did not actually co-occur.
The fundamental question is: when the force-torque sensor reports a contact event at timestamp T, which camera frame corresponds to that moment? If the camera and force-torque sensor use independent software timestamps derived from the host computer's system clock, the answer depends on USB polling latency, driver buffering, operating system scheduling, and CPU load — all of which vary unpredictably. Software timestamps from two different USB devices on the same computer can differ by 10-50ms under normal conditions, and much more under high CPU load. For a grasp event that lasts 200ms, a 50ms timing error means the model associates the force signal with the wrong visual frame.
NTP (Network Time Protocol) synchronizes clocks across networked computers to within 1-10ms under good conditions. This is adequate for loosely coupled multi-machine setups but insufficient for tight sensor synchronization. PTP (Precision Time Protocol, IEEE 1588) achieves sub-microsecond synchronization between devices on the same network, which is more than sufficient. Industrial robots and high-end data acquisition systems often support PTP natively.
The gold standard for synchronization is hardware trigger lines. A single master clock generates trigger pulses that initiate capture on all sensors simultaneously. The cameras capture a frame on each trigger pulse; the force-torque DAQ records a sample on each pulse (or on a divided clock). This eliminates software timing uncertainty entirely. The implementation requires physical wiring — trigger cables from a pulse generator to each sensor's trigger input — which adds setup complexity but provides deterministic synchronization.
For setups where hardware triggering is impractical, the next best approach is hardware timestamps with post-hoc alignment. Some sensors (RealSense cameras, many industrial robots) provide hardware-generated timestamps from their internal clocks. These timestamps are locally accurate — the interval between consecutive frames is precise — even if the absolute time is offset from other sensors. Post-collection, the streams are aligned by matching known events: a sharp impact visible in both the camera frames and the force-torque data, a robot motion start visible in both the camera and the joint encoder stream. This approach works but requires careful implementation and validation.
MCAP format handles multi-rate streams natively, storing each channel with its own timestamps and allowing efficient retrieval of time-aligned data across channels. Each message in an MCAP file carries a log-time and an optional publish-time, and the format supports nanosecond-resolution timestamps. When building a collection pipeline, recording all streams into a single MCAP file (or a set of synchronized MCAP files with shared time references) is the recommended approach for preserving temporal relationships through the entire data lifecycle.
Acceptable synchronization tolerances depend on the application. For quasi-static manipulation (slow, deliberate motions), 10ms inter-stream synchronization is adequate. For dynamic manipulation (catching objects, rapid insertion), 1ms or better is required. For teleoperation data collection where the operator controls the motion speed, target 5ms synchronization as a general standard. Validate synchronization empirically: command a known robot motion and verify that the timestamps in the camera stream and joint encoder stream agree to within the required tolerance.
Calibration Across Modalities
Multi-sensor calibration is the process of establishing the spatial and temporal relationships between all sensors in the collection rig. Without accurate cross-modal calibration, data from different sensors cannot be meaningfully combined — a point detected at pixel (320, 240) in the camera cannot be expressed in the robot's coordinate frame, and force-torque readings cannot be associated with visual contact locations.
Camera intrinsic calibration establishes each camera's internal parameters: focal length, principal point, and distortion coefficients. Use a ChArUco board (which provides both checkerboard corners and ArUco marker identification) with 30-50 images spanning the full field of view. Target reprojection error below 0.3 pixels for manipulation tasks, below 0.15 pixels for precision assembly. Intrinsic calibration is stable over time unless the camera is physically damaged, but should be verified monthly.
Camera extrinsic calibration (hand-eye) determines the rigid transform between each camera and the robot base frame. For a wrist-mounted camera (eye-in-hand), this is the transform from camera optical frame to robot flange frame. For an external camera (eye-to-hand), this is the transform from camera frame to robot base frame. Both require the robot to move to 15-20 poses while observing a calibration target. The solver (Tsai-Lenz, Park-Martin, or Daniilidis dual quaternion method) estimates the unknown transform. Verify by commanding the robot to touch a known point and checking that the camera-predicted point matches. Translational error should be below 2mm; rotational error below 0.5 degrees.
Force-torque orientation calibration determines how the sensor's measurement frame aligns with the robot's tool frame. A misaligned force-torque frame causes force readings to be reported in the wrong directions — what the sensor reports as Fz might have a component of the actual Fx. Calibrate by applying known forces (gravity on a known mass attached to the sensor) in multiple orientations and solving for the rotation matrix between sensor frame and tool frame.
IMU alignment registers the IMU's measurement frame with the robot's body frame. For a wrist-mounted IMU, this is typically a fixed rotation determined during installation. Verify by commanding known robot motions and comparing the IMU's measured acceleration and angular velocity against the robot's reported joint-derived values.
Cross-modal calibration verification is the final step: confirming that all sensor transforms are consistent. Place a calibration object at a known position in the robot's workspace. The RGB-D camera should report its position to within depth noise. The robot should be able to command a motion to that position using the camera-derived coordinates and arrive within the expected accuracy. The force-torque sensor, when the robot makes contact with the object, should report force in the direction consistent with the camera-derived surface normal. Any discrepancy indicates a calibration error in one or more transforms.
Recalibration frequency depends on the environment. In a temperature-controlled lab with stable mounts, monthly recalibration with weekly verification is sufficient. On a production floor with thermal cycling and vibration, recalibrate extrinsics at the start of every session and verify every 30-60 minutes. Automated verification — checking ArUco marker detection against expected poses — should run continuously during collection.
Using Multimodal Data in Training
Collecting multimodal data is only valuable if the training pipeline can consume it effectively. Current approaches to training on multi-sensor data fall into two categories: early fusion and late fusion, with significant practical differences in implementation and performance.
Early fusion combines all modalities into a single representation before the policy network processes them. The simplest approach concatenates feature vectors: ResNet features from the RGB image, a downsampled depth map, the current force-torque reading, and the joint position vector are concatenated into a single input vector. This works for small observation spaces but scales poorly — the network must learn to disentangle modalities that have very different statistical properties. A more sophisticated approach uses modality-specific encoders (CNN for images, MLP for force-torque, MLP for joint states) that produce fixed-size embeddings, which are then concatenated and fed to a shared trunk. This is the architecture used by most behavior cloning and diffusion policy implementations that support multimodal input.
Late fusion processes each modality through independent branches that produce separate action predictions or value estimates, which are then combined through learned attention or averaging. Late fusion allows each branch to specialize — the vision branch learns visual features, the force branch learns contact patterns — and is more robust to individual modality failures (a depth sensor dropout degrades the depth branch but does not corrupt the force branch). The downside is that cross-modal reasoning — understanding that a specific visual pattern predicts a specific force pattern — is harder to learn because the modalities only interact at the final fusion layer.
Tokenization strategies inspired by large language models are gaining traction. RT-2 and similar vision-language-action models tokenize visual observations, proprioceptive states, and actions into a shared token space, enabling transformer architectures to learn cross-modal attention. For multimodal robotics data, this means each modality is encoded into a sequence of tokens — image patch tokens for RGB, depth patch tokens for depth, scalar tokens for force-torque and joint states — and the transformer learns which tokens to attend to for each decision. This approach handles heterogeneous modalities elegantly but requires substantial compute for training.
Different tasks benefit from different modalities. Vision-dominated tasks (object recognition, coarse picking from a bin, navigation) rely primarily on RGB and depth — force-torque adds little because the critical decisions happen before contact. Force-dominated tasks (connector insertion, torque-controlled screw driving, deformable object manipulation) depend heavily on force-torque feedback for the contact-rich phases, with vision providing only the initial approach guidance. Multi-modal tasks (assembly with visual alignment followed by force-controlled insertion, fruit harvesting with visual detection and force-controlled picking) require genuine integration of both vision and force throughout the task sequence.
A practical challenge is multi-rate data loading. Force-torque data at 1000Hz and camera data at 30fps require different handling during batched training. The common approach is to resample all streams to a common rate — either downsampling high-rate streams to match the camera rate or providing a fixed-size window of high-rate samples (e.g., the last 33 force-torque samples, corresponding to one camera frame interval) at each timestep. HDF5 and LeRobot format support multi-rate storage with per-stream metadata, but the dataloader must handle the alignment and windowing logic. RLDS stores episodes as sequences of steps, where each step can contain observations at different rates, making it well-suited for multi-rate training data.
Teams can inspect how these modalities align in practice using a data explorer that provides synchronized playback of egocentric video, hand pose overlays, object detection, and action segmentation within a single interface — making cross-modal alignment issues visible before they reach the training loop.
Multimodal data collection requires calibrated multi-sensor rigs, hardware-synchronized recording, and careful cross-modal calibration — infrastructure that takes months to build and continuous effort to maintain. Humaid captures RGB-D, force-torque, IMU, hand pose, and joint data — all timestamped to sub-millisecond precision, spatially calibrated across modalities, and delivered in standard formats including MCAP, HDF5, RLDS, and LeRobot. If your team is building policies that need to see, feel, and act, the sensor infrastructure matters as much as the model architecture. Learn about our data collection capabilities.