How to Collect Robot Training Data — Step-by-Step

You have built a model architecture. You have tested it in simulation. Your behavior cloning pipeline produces policies that reliably pick up cubes in MuJoCo. Now you need real-world data — and the gap between "we need data" and "we have usable datasets" is where most robotics programs stall. It is not a gap of knowledge; it is a gap of infrastructure. Collecting real-world robot training data requires coordinating hardware, operators, environments, protocols, annotation, quality control, and delivery into a repeatable pipeline that produces consistent, high-quality episodes.

This guide covers every step of that pipeline, from the initial requirements definition through delivery of annotated datasets in formats your training code can consume. Whether you are building your first real-world dataset or scaling an existing collection effort, these are the engineering decisions that determine whether your data accelerates your models or becomes an expensive bottleneck. For teams evaluating their options, this is also a practical framework for deciding when to build collection infrastructure in-house versus using a robotics data collection platform.

Step 1: Define Your Data Requirements

Every successful data collection campaign starts with a precise requirements specification. Under-specifying requirements is the single most common cause of unusable robotics datasets. A request for "manipulation data" is not actionable. A request for "1500 teleoperation episodes of single-object pick-and-place from a cluttered bin to a target zone, using a Franka Panda with a Robotiq 2F-85 gripper, recorded with wrist-mounted RealSense D405 and two external RealSense D435 cameras, at 30fps RGB-D with 100Hz joint states, in RLDS format" is actionable.

The requirements specification must cover:

Task definition: What manipulation or navigation task. Be specific about objects, start conditions, success criteria, and acceptable failure modes.
Environment: The physical setting — factory floor, kitchen, warehouse shelf, lab bench. Specify lighting conditions, spatial constraints, and any fixtures or background elements.
Sensor modalities: Which cameras (model, resolution, frame rate, mounting position), proprioceptive sensors (joint encoders, gripper state), force-torque sensors, IMUs, and any additional modalities like hand pose tracking or motion capture.
Data format: HDF5, RLDS, LeRobot, or custom schema. Specify the exact observation and action spaces, including units and coordinate frames.
Scale: How many episodes, with what distribution across task variants, object types, and environmental conditions.

A manipulation task needs fundamentally different data than a navigation task. Bin picking requires dense coverage of object poses, grasp approach angles, and handling of entangled or partially occluded objects. Mobile navigation requires diverse trajectories through varying obstacle configurations, lighting conditions, and floor surfaces. Defining requirements precisely is what prevents collecting a thousand episodes that turn out to be unusable because the camera resolution was too low or the joint state recording rate was insufficient for the policy's action frequency.

Step 2: Design the Collection Protocol

The collection protocol translates requirements into operational instructions. It is the document that every operator follows, every session manager references, and every QA reviewer checks against. Protocol rigor directly determines data quality — loosely specified protocols produce high-variance data that confuses models, while overly rigid protocols produce data that lacks the diversity needed for generalization.

A well-designed protocol includes task decomposition: breaking the overall task into phases that correspond to the temporal annotation schema. For a bin-picking task, the phases might be: scan (visual inspection of bin contents), plan (select target object and grasp approach), approach (move end-effector to pre-grasp pose), grasp (close gripper on object), extract (lift object from bin, handling entanglement), transport (move to target zone), place (deposit object), and retract (return to home position). Each phase has defined entry and exit criteria.

Operator instructions specify how to control the robot for each phase, what speed range is acceptable, how to handle common edge cases (object falls during transport, grasp fails on first attempt, two objects are stuck together), and when to abort and restart an episode versus continuing. The instructions must balance completeness with usability — a twenty-page document that operators do not read is worse than a concise document that they follow consistently.

Environment setup specifications define how to configure the workspace before each session and between episodes: object set, randomization procedure for object placement, fixture positions, lighting configuration, and any environmental controls (temperature for temperature-sensitive materials, humidity for food handling). Calibration requirements specify when to run calibration (at minimum, at session start and after any hardware adjustment) and what calibration procedure to follow (checkerboard for camera intrinsics, hand-eye calibration for extrinsics, fiducial verification for ongoing drift monitoring).

Step 3: Set Up the Sensor Stack

The sensor stack must capture every modality specified in the requirements with sufficient quality, frame rate, and synchronization. Camera selection depends on the task: Intel RealSense D405 is a common choice for wrist-mounted cameras due to its small form factor and close-range depth performance (accurate from 7cm). RealSense D435 or D455 work well as external cameras with a wider field of view and longer depth range. For tasks requiring higher resolution or global shutter (fast-moving objects, strobe lighting), industrial cameras from FLIR or Basler may be necessary.

Camera mounting must be rigid and repeatable. A wrist-mounted camera should be attached with a machined bracket, not tape or zip ties — any shift in the camera-to-end-effector transform invalidates the extrinsic calibration and corrupts all spatial labels. External cameras should be mounted on fixed structures (not tripods that can be bumped) with known positions relative to the robot base.

Force-torque sensors (ATI Mini45, Robotiq FT 300) provide contact dynamics that camera-only setups cannot capture. For tasks involving insertion, assembly, or any contact-rich manipulation, force-torque data is not optional — it is the signal that allows the policy to learn appropriate force modulation. Mount the sensor between the robot flange and the gripper to capture the full interaction force.

Synchronization is achieved through hardware triggering (a shared trigger signal that initiates capture on all devices simultaneously) or PTP (Precision Time Protocol) network synchronization. Software timestamps from ROS message headers are insufficient for precise synchronization — they introduce variable latency of 1-10ms depending on CPU load, which corrupts the temporal alignment between high-rate streams like force-torque (500Hz) and joint states (100Hz). Record all data into MCAP files with per-channel hardware timestamps and the clock synchronization method documented in session metadata.

Step 4: Train Your Operators

Operator quality directly determines data quality — this is the most underappreciated factor in robotics data collection. An untrained operator teleoperating a bin-picking task will produce demonstrations with inconsistent approach angles, excessive hesitation, unnecessary retries, and task execution strategies that vary randomly from episode to episode. The resulting dataset has a high variance distribution that behavior cloning policies struggle to learn from, and the waste rate (episodes rejected during QA) can exceed 40%.

Operator training follows a structured progression. First, interface familiarization: the operator learns the teleoperation control interface — whether leader-follower, VR, or SpaceMouse — until they can control the robot smoothly without conscious effort. This typically requires 2-4 hours depending on the interface and the operator's prior experience. Second, task-specific training: the operator practices the specific task following the protocol, receiving feedback on their execution until they achieve the target success rate (typically 85%+ for the task to be collectible) with consistent strategy.

Calibration episodes measure operator readiness and ongoing consistency. Before an operator is cleared for production collection, they complete a set of standardized episodes that are scored against the protocol specification. If their performance metrics — success rate, average episode duration, strategy consistency — fall below threshold, they continue training. During production, periodic calibration episodes (every 50-100 production episodes) monitor for drift.

The difference between a crowd worker and a trained operator is substantial. Crowd workers optimize for speed and episode count. Trained operators optimize for consistency and protocol adherence. For teleoperation data collection that will train real policies, the operator's skill is directly encoded in the training signal — every imprecision, hesitation, and inconsistency in their demonstrations becomes part of what the policy learns.

Step 5: Collect On-Site

Collecting in the actual target environment — or as close to it as possible — is essential for closing the domain gap. A manipulation policy for a warehouse pick station should be trained on data collected at a warehouse pick station, with the same shelving, bins, lighting, and objects. Collecting in a lab with studio lighting and a clean table produces a domain gap that is just as real as the sim-to-real gap from simulation.

On-site collection requires session management: a structured schedule of collection sessions with defined start and end times, calibration procedures at session start, operator rotation to prevent fatigue, and environment reconfiguration between task variants. Each session is logged with its metadata: date, time, operator, station, protocol version, calibration state, and any deviations from the standard configuration.

Real-time quality monitoring catches problems during collection rather than after. A monitoring system checks sensor streams in real time — flagging frame drops, synchronization errors, or calibration anomalies as they occur. This allows operators and session managers to address issues immediately rather than discovering them during post-collection QA, when the only remedy is re-collection.

Batch management organizes episodes into coherent batches that correspond to specific task variants, environmental conditions, or object sets. A batch of 200 bin-picking episodes with automotive connectors is a meaningful unit that can be annotated, QA'd, and delivered independently. This modularity allows parallel processing and incremental delivery — the first batch can enter annotation while the second batch is still being collected.

Step 6: Annotate and QA

Annotation transforms raw sensor recordings into structured training data. Every episode receives temporal segmentation into action phases (approach, grasp, lift, transport, place, release) with frame-precise boundaries. Each phase receives an action label from the task-specific taxonomy. Grasp events receive grasp type classification. Objects receive identity labels tracked across frames and episodes. Each episode receives a success/failure flag with failure mode classification for failed episodes.

QA runs in parallel with annotation. Automated checks flag episodes with calibration drift, frame synchronization errors, sensor dropout, or metadata gaps before annotation begins — there is no point annotating corrupted data. Annotation consistency checks verify temporal coverage (no gaps or overlaps in phase segmentation), valid phase sequencing, and cross-stream agreement (force signals confirm contact during labeled grasp phases).

Manual expert review samples a configurable percentage of annotated episodes (typically 15-25%) and evaluates protocol adherence, annotation accuracy, and semantic correctness. Episodes that fail review are returned to annotation with specific correction instructions. The manual review sampling rate can be adjusted based on annotator reliability — experienced annotators with high historical agreement scores require less frequent review. This pipeline ensures that the manufacturing robotics data or other task-specific data that exits QA meets a consistent quality bar.

Step 7: Deliver in a Usable Format

The delivery stage converts annotated, QA-passed episodes into the format your training pipeline consumes. This is not just file format conversion — it is the interface between your data infrastructure and your model training code.

HDF5 is the most flexible option: each episode is an HDF5 file with datasets for each observation modality (images, joint states, force-torque), actions, and annotation metadata. Custom dataloaders read directly from HDF5 with minimal preprocessing. RLDS (Reinforcement Learning Datasets) packages episodes as TensorFlow datasets with standardized field names, making them compatible with the Open X-Embodiment data pipeline and TensorFlow-based training codebases. LeRobot format stores episodes in a structure optimized for the Hugging Face LeRobot library, with built-in support for action chunking transformers and diffusion policy training.

Beyond format conversion, delivery includes loading utilities: Python scripts or library wrappers that allow a training engineer to instantiate a dataset with a single function call, specifying which observation modalities to load, which episodes to include (filtered by task variant, success flag, or collection date), and any preprocessing (image resizing, normalization, coordinate frame transforms). It includes dataset documentation: a datasheet specifying sensor specifications, annotation schema with definitions, episode counts by task variant and success/failure, known limitations, and recommended train/validation/test splits.

Versioning is critical for ongoing collection campaigns. As new batches are collected, annotated, and delivered, the dataset version increments. Each version has a changelog documenting additions, re-annotations, and removals. Training experiments reference a specific dataset version for reproducibility. Pipeline integration scripts handle incremental updates — appending new episodes to an existing training dataset without reprocessing the entire collection. For teams working with egocentric data collection for robotics or any other modality, this versioning infrastructure prevents the chaos of undocumented dataset mutations.

Once datasets are delivered, a data explorer provides a web-based interface for browsing episodes, inspecting synchronized sensor streams, and verifying that annotations and metadata are correct before data enters the training pipeline. This inspection step catches delivery-stage issues — format mismatches, missing channels, annotation drift — that would otherwise surface as unexplained training regressions.

Build vs. Outsource: When to Use a Collection Platform

The honest analysis: building in-house makes sense in specific circumstances. If your team has unique hardware that no external provider supports, if your total data need is small (under 500 episodes), if you have available operator bandwidth on your research team, and if your task is a one-time collection rather than an ongoing campaign — then in-house collection may be the most efficient path. You control every detail, iterate quickly, and avoid coordination overhead.

A dedicated collection platform becomes the better choice when the conditions change. Scale: when you need thousands of episodes, the operational logistics of multi-station collection with trained operators, session scheduling, and batch management exceed what a research team can manage alongside model development. Multi-site collection: when your model needs data from diverse environments — three different warehouse layouts, five different kitchen configurations — replicating calibrated collection infrastructure at each site is expensive and error-prone. Specialized environments: when you need data from a specific industrial setting (food processing line, electronics assembly station) that you cannot replicate in your lab.

Ongoing campaigns are perhaps the strongest argument for a platform. Most real products require continuous data collection — new object types, new environment configurations, new task variants — over months or years. Maintaining a standing collection operation with trained operators, calibrated hardware, and an annotation pipeline is a permanent cost. A platform amortizes that cost across customers and provides collection on demand. For teams focused on model development and deployment, the calculation is straightforward: does the time your engineers spend on data infrastructure produce more value than the time they would spend improving model architecture, training procedures, and deployment systems?

Collecting real-world robot training data is an infrastructure problem, not a one-time project. Humaid handles the full pipeline — protocol design, sensor setup, operator training, on-site collection, annotation, QA, and delivery in HDF5, RLDS, or LeRobot format — so your team can focus on what it does best: building the models that bring robots into the real world.

Talk to Our Team

How to Collect Real-World Robot Training Data: From Protocol to Delivery