Teleoperation is the most direct way to generate robot training data. A skilled human operates the robot while sensors record every aspect of the interaction — joint positions, forces, camera views — producing action-labeled trajectories ready for policy learning. No retargeting, no coordinate transforms, no embodiment gap. The operator's commands are the robot's commands, recorded at the joint level.
But the gap between "let someone drive the robot" and "produce consistently high-quality training data at scale" is wider than most teams expect. Operator skill varies enormously. Fatigue degrades demonstrations over time. Interface latency distorts natural task execution. Inconsistent task protocols produce datasets where the same task is performed differently across operators, sessions, and days. And without systematic quality assurance, bad demonstrations contaminate good ones — a behavior cloning model trained on a dataset where 5% of episodes contain erratic behavior will learn to occasionally produce erratic behavior.
This article explains how teleoperation robotics data collection works in practice — from operator setup through recording, quality control, and delivery into training pipelines. The focus is on what makes teleoperation data reliable enough to train policies that deploy in the real world, not just perform well in the lab.
How Teleoperation Data Collection Works
A teleoperation data collection session follows a defined sequence. Each step has specific requirements and common failure points.
Step 1: Operator setup. The operator arrives at the collection station and verifies the hardware state. The robot is powered on and homed. All sensors are active — wrist-mounted RGB-D camera (RealSense D405), external RGB-D cameras (RealSense D435), force-torque sensor (ATI Mini45), joint encoders. The control interface is connected and calibrated. For leader-follower setups, the leader arm's joint offsets are verified against the follower. For VR controllers, the tracking volume is confirmed and controller mapping is tested with a brief free-movement check.
Step 2: Calibration verification. Before recording begins, the operator runs a calibration check episode. This involves moving the robot through a predefined set of poses while a calibration target (ArUco board or checkerboard) is visible to all cameras. The capture system computes reprojection errors and compares them against the last calibration session. If errors exceed threshold (typically 2mm), recalibration is triggered before proceeding.
Step 3: Task briefing. The operator reviews the task protocol for the current session. The protocol specifies: what objects are in the workspace, their initial arrangement (randomized or fixed), what the task requires (pick object A, place at location B), what constitutes success (object within 5mm of target, correct orientation), speed constraints (natural pace, no rushing), and what to do on failure (reset and retry, or flag and move on). The briefing includes any updates to the protocol since the operator's last session.
Step 4: Data recording. The operator begins task execution. The capture system records all sensor streams simultaneously — RGB-D at 30fps, joint states at 100Hz+, force-torque at 500Hz, gripper state at 100Hz — all hardware-timestamped. Each task execution is one episode. Between episodes, the workspace is reset according to the protocol (objects repositioned, potentially randomized). The operator signals episode boundaries either manually (button press) or automatically (system detects home position departure and return).
Step 5: Episode tagging. After each episode, the operator tags it with a preliminary assessment: success, failure (with failure type), or uncertain. This tag is metadata — it does not replace QA review — but it provides useful signal for downstream processing. If the operator experienced interface issues (lag spike, tracking dropout), they note this in the episode metadata.
Step 6: QA review. Completed episodes enter the quality assurance pipeline. Automated checks verify sensor synchronization, calibration consistency, and metadata completeness. Manual review checks protocol adherence, demonstration quality, and annotation accuracy. Episodes that fail QA are rejected with coded reasons that feed back into operator training.
Interface Types and When to Use Each
The teleoperation interface determines the bandwidth and naturalness of the operator's control, which directly affects the quality and variety of the resulting demonstrations. Each interface type has distinct strengths and is suited to specific task categories.
Leader-follower arms (ALOHA-style). In a leader-follower configuration, the operator physically moves a "leader" arm while a "follower" arm mirrors the motion in real time. The leader arm is typically a kinematically identical replica of the follower (same joint structure, same link lengths), providing an intuitive 1:1 mapping. The operator feels the leader arm's dynamics — its weight, inertia, and joint limits — which provides implicit proprioceptive feedback about the follower's capabilities. This interface excels at bimanual manipulation tasks: folding laundry, assembling components with two hands, opening a container with one hand while extracting contents with the other. The bandwidth is high — operators can produce demonstrations with natural speed and fluidity because the control mapping is direct. The ALOHA platform demonstrated that this interface type produces demonstrations suitable for training ACT (Action Chunking with Transformers) policies that achieve real-world task completion. The limitation is physical co-location — the operator must be at the robot station, and the leader arm hardware adds cost and workspace requirements.
VR controllers (e.g., Meta Quest, HTC Vive). VR controllers map 6-DoF hand pose to end-effector commands. The operator wears a VR headset showing the robot's camera feeds (optionally with depth visualization) and moves the controllers to command the end-effector. This interface supports remote operation — the operator can be in a different room or different building. It is well-suited to single-arm tasks where full bimanual coordination is not required: pick-and-place, tool use, object rearrangement. The control bandwidth is lower than leader-follower because the mapping from controller pose to joint commands involves IK solving, which can introduce latency and singularity issues. Operators also lack direct force feedback, making contact-rich tasks (insertion, assembly) more difficult. However, VR interfaces enable large-scale remote operation — you can hire operators in different locations and collect data from a centralized robot station.
SpaceMouse or joystick. A 6-DoF SpaceMouse or standard gamepad maps axis deflections to end-effector velocities. This is the lowest-bandwidth interface — operators control velocity, not position, which makes precise positioning slow and multi-step tasks tedious. However, SpaceMouse interfaces are inexpensive, simple to set up, and sufficient for tasks with limited dexterity requirements: coarse pick-and-place, pushing, sliding. They are useful for initial prototyping and for collecting demonstrations of simple subtasks within a larger data collection effort.
Haptic devices (e.g., Touch X, Sigma.7). Haptic interfaces provide bidirectional force feedback — the operator feels resistance when the robot contacts an object, compliance when manipulating deformable materials, and constraint forces during insertion tasks. This is the preferred interface for contact-rich manipulation: peg-in-hole insertion, cable routing, snap-fit assembly, and any task where force modulation is the primary skill. The Sigma.7 provides 7-DoF force feedback with gravity compensation, enabling operators to produce demonstrations that include appropriate force profiles — not just position trajectories. The cost and complexity are highest among all interface types, but for tasks where force is the critical variable, no other interface captures the necessary information in the demonstration.
What Gets Recorded: The Teleoperation Data Stack
During a teleoperation session, the capture system records a comprehensive multi-modal data stack. Understanding what is in this stack — and why each stream matters — is essential for both data engineers building the pipeline and ML engineers consuming the output.
Joint positions and velocities are recorded at 100Hz or higher from the robot's joint encoders. These are the primary action labels for behavior cloning: at each timestep, the model learns to predict the joint command given the current observation. Joint positions are typically recorded in radians (revolute joints) or meters (prismatic joints). Joint velocities are either measured directly from encoder derivatives or computed from position differences. For a 7-DoF arm with a 1-DoF gripper, each timestep produces an 8-dimensional action vector.
End-effector 6-DoF pose (position + orientation) is computed via forward kinematics from joint positions but recorded separately for convenience. Policies that operate in task space (Cartesian control) use end-effector pose as both observation and action representation. Quaternions or rotation matrices are preferred over Euler angles for orientation to avoid gimbal lock and discontinuities in the action space.
Gripper state includes at minimum open/close binary state, and ideally continuous gripper width (0–100% open) and gripper force (from integrated force sensors or motor current). Gripper state transitions — the moment of grasp closure — are key events in the episode timeline.
Force-torque at the wrist from the ATI Mini45 or equivalent provides 6-axis contact information: Fx, Fy, Fz (forces in Newtons) and Tx, Ty, Tz (torques in Newton-meters). Force-torque data is recorded at 500Hz and downsampled during format conversion. The force profile during grasp, lift, and place phases contains critical information about contact dynamics that camera observations alone cannot capture. For manufacturing robotics data involving assembly tasks, force-torque is often the most informative modality.
Wrist-camera RGB-D from the RealSense D405 provides a close-up, egocentric view of the hand-object interaction. RGB resolution is typically 640x480 at 30fps (higher resolutions increase storage and processing cost without proportional benefit for most policy architectures). Depth is aligned to the RGB frame and stored as 16-bit millimeter values. The wrist camera moves with the end-effector, providing a consistent viewpoint of the manipulation region regardless of the robot's configuration.
External RGB-D from one or more RealSense D435 cameras provides scene-level context: the full workspace, object positions, obstacles, and the robot arm itself. External cameras are typically mounted on fixed structures (overhead gantry, side tripod) with known extrinsics relative to the robot base frame. Multiple external cameras provide different viewpoints for multi-view training or point cloud fusion.
Operator control inputs — the raw commands from the teleoperation interface (leader arm joint positions, VR controller pose, SpaceMouse axis values) — are recorded alongside the robot state. These provide a record of the operator's intent, which may differ from the robot's actual motion due to dynamic response, safety limits, or singularity avoidance. The delta between operator command and robot execution is useful for analyzing interface-induced artifacts in the demonstrations.
Metadata recorded per episode includes: robot URDF (kinematic model), camera calibration files (intrinsics + extrinsics), force-torque sensor calibration, operator ID, session timestamp, task variant, environment configuration, and protocol version. This metadata is not auxiliary — it is part of the dataset.
Quality Challenges in Teleoperation Data
Teleoperation data quality is not a binary property. It exists on a spectrum, and the factors that degrade quality are systematic, measurable, and addressable — but only if you know what to measure.
Operator learning curves. Every operator requires ramp-up time with a new interface, a new task, and a new robot. An operator's first 20–50 episodes on a novel task are typically lower quality than their subsequent work: slower execution, more hesitation, less consistent approach angles, and higher failure rates. These early episodes are not useless — they may contain useful recovery behaviors — but they should be tagged as ramp-up and treated separately during dataset construction. The practical solution is to allocate explicit warm-up episodes at the start of each new task assignment and discard (or separately tag) the first N episodes per operator-task pair. N depends on task complexity: 10 for simple pick-and-place, 30–50 for precision assembly.
Fatigue-induced degradation. Teleoperation is physically demanding. Leader-follower interfaces require the operator to move physical hardware against gravity and inertia for the duration of each session. Even VR controllers cause arm fatigue during extended sessions. The effect on data quality is measurable: task completion time increases, error rates rise, and movement smoothness decreases. Force-torque profiles during grasping become more variable as the operator's fine motor control degrades. Session length limits are mandatory — typically 45–60 minutes with 15-minute breaks. Per-session quality metrics should be monitored to detect within-session degradation: if the last 20% of episodes in a session show statistically different completion times or success rates, the session is too long.
Interface latency. Latency between the operator's command and the robot's response affects demonstration quality in non-obvious ways. At latencies below 20ms, most operators do not perceive delay and their demonstrations reflect natural task execution. Between 20–100ms, operators unconsciously slow down and become more cautious, producing demonstrations that are slower and more deliberate than what a well-tuned autonomous policy would execute. Above 100ms, operators adopt explicit move-and-wait strategies — they issue a command, wait for the robot to respond, then issue the next command — producing jerky, non-smooth trajectories that are poor supervision signal for continuous control policies. The latency of the teleoperation system should be measured and documented as metadata. If latency is unavoidable (remote operation over network), the training pipeline should account for it — either through post-processing to smooth trajectories or through policy architectures that are robust to temporal discretization.
Inter-operator consistency. Different operators develop different strategies for the same task. One operator might approach objects from the top; another from the side. One might use a power grasp; another a precision pinch. This variability can be beneficial — it exposes the model to multiple valid strategies — or harmful — it confuses the model with conflicting demonstrations of the same task. The solution is not to eliminate variability but to measure and control it. Inter-operator consistency metrics compare trajectory distributions across operators for the same task variant. If operators are producing qualitatively different strategies, the protocol should specify which strategy to use, or the dataset should be stratified by strategy type so models can be trained on consistent subsets. Human-in-the-loop data collection systems should include feedback mechanisms that align operators toward consistent execution while preserving natural variation within the desired strategy.
From Teleop Data to Robot Policies
Teleoperation data's ultimate purpose is to train robot policies — learned functions that map observations to actions. The choice of policy architecture determines how the data is consumed and what format it must be delivered in.
Behavior cloning (BC) is the most straightforward approach. A neural network is trained with supervised learning to predict the action (joint command) at each timestep given the current observation (camera image, proprioceptive state). Teleoperation data is ideal for BC because the action labels are exact — they are the literal joint commands that were executed, not inferred or retargeted values. The primary challenge with BC is compounding errors: small prediction errors at each timestep accumulate during rollout, causing the policy to visit states not represented in the training data. Larger datasets with more diverse demonstrations mitigate this by covering more of the state space. BC datasets are typically delivered in HDF5 format with observation and action arrays aligned per timestep.
Diffusion Policy treats action prediction as a denoising diffusion process (DDPM). Instead of predicting a single action, the model learns to denoise a sequence of future actions conditioned on the current observation. This approach handles multimodal action distributions — situations where multiple valid actions exist for the same observation — which is common in teleoperation data collected from multiple operators. Diffusion Policy operates on action chunks (sequences of 8–32 future actions), which provides temporal consistency and reduces compounding errors. The training data format requires observation-action pairs where actions are represented as contiguous future sequences, not single timesteps. LeRobot format is well-suited for Diffusion Policy training, as the LeRobot dataloader natively supports action chunking and observation history stacking.
ACT (Action Chunking with Transformers) uses a transformer architecture with a conditional VAE to model the action distribution. Like Diffusion Policy, ACT predicts action chunks rather than single actions. The CVAE component enables the model to capture multimodal behavior — the encoder compresses a demonstration's action chunk into a latent code, and the decoder generates actions conditioned on both the observation and a sampled latent. ACT was developed specifically for the ALOHA leader-follower teleoperation setup and has demonstrated strong performance on bimanual manipulation tasks. Training data for ACT is delivered in HDF5 format with specific observation and action key conventions matching the ACT codebase expectations.
Action-space consistency between data collection and deployment is a critical requirement that is easily overlooked. If teleoperation data is collected in joint space (recording raw joint positions from encoders) but the deployment policy outputs Cartesian velocities, there is an action space mismatch that requires conversion. This conversion introduces errors — IK solutions may differ from the original motion, singularities create discontinuities, and joint limits may be violated. The cleanest approach is to collect and deploy in the same action representation. If that is not possible, the conversion must be validated to ensure that converted actions reproduce the original trajectories to within an acceptable error tolerance.
Format requirements for different training frameworks: RLDS for TensorFlow-based training (RT-X, Octo), LeRobot format for the LeRobot training library (Diffusion Policy, ACT), and HDF5 for custom PyTorch training loops. The teleoperation data collection pipeline must support delivery in all target formats, as research teams frequently experiment with multiple policy architectures on the same dataset. Format conversion should be deterministic and reproducible — the same MCAP source episodes should produce identical HDF5/RLDS/LeRobot outputs every time the conversion is run.
Before format conversion, teams can preview the raw teleoperation recordings in Humaid's data explorer — a web-based tool for browsing MCAP episodes with synchronized video, hand pose, action segmentation, and sensor overlays — to verify that the collected data meets quality expectations before committing it to the training pipeline.
Teleoperation data collection at scale requires trained operators, calibrated hardware, and systematic QA — not just a robot with a control interface. The difference between a demo dataset and a production dataset is the infrastructure between the operator and the training pipeline.
Humaid operates dedicated teleop collection stations with leader-follower arms, calibrated multi-sensor rigs, and experienced operators — ready to produce the trajectories your models need in HDF5, RLDS, or LeRobot format.