Blog/2026-03-24·12 min read

Building a Robotics Data Pipeline: From Sensors to Training Data

By Humaid Team

A robotics data pipeline is the infrastructure between your sensors and your model's training loop. Most teams underestimate it — they treat data as a one-time collection problem, not an ongoing engineering system. The result: brittle scripts, inconsistent formats, and months of debugging data issues instead of training models.

The pipeline problem is deceptively simple on paper. You have sensors producing data. You have models that consume data. Connect the two. In practice, the space between sensor output and model input contains dozens of failure modes — desynchronized timestamps, uncalibrated depth streams, missing metadata, annotation inconsistencies, format incompatibilities — each capable of silently degrading your training data and, by extension, your policy performance. A team that spends three months collecting demonstrations with an ATI Mini45 force-torque sensor only to discover that their force readings drifted due to thermal expansion has not just lost three months of collection time. They have lost the downstream training runs, the evaluation cycles, and the deployment timeline that depended on that data.

This article walks through each layer of a production robotics data pipeline — from raw sensor capture through annotation, quality assurance, and delivery into training formats like RLDS, HDF5, and LeRobot. The goal is not to describe a theoretical ideal but to document the architecture that works when you need thousands of episodes across multiple environments, operators, and task variants.

Pipeline Architecture Overview

A production robotics data pipeline consists of six layers, each with defined inputs, outputs, and failure modes. Understanding the architecture before building any component prevents the most common mistake: optimizing one layer while creating bottlenecks in another.

Layer 1: Sensor Layer — Inputs: physical environment, robot state. Outputs: raw sensor streams (RGB-D frames, joint encoders, force-torque readings, IMU data). Failure modes: miscalibration, hardware faults, electromagnetic interference on force-torque sensors.

Layer 2: Capture Layer — Inputs: raw sensor streams. Outputs: synchronized, timestamped episode recordings in MCAP format with metadata sidecars. Failure modes: timestamp drift between streams, dropped frames under CPU load, incorrect episode boundaries.

Layer 3: Storage Layer — Inputs: episode recordings. Outputs: organized, versioned, backed-up datasets on object storage. Failure modes: data loss from single-disk storage, version confusion when episodes are re-collected, orphaned files from interrupted uploads.

Layer 4: Annotation Layer — Inputs: stored episodes. Outputs: temporal segmentation labels, action annotations, success/failure flags, object identity tags. Failure modes: annotator disagreement on phase boundaries, inconsistent grasp type classification, mislabeled failure modes.

Layer 5: QC Layer — Inputs: annotated episodes. Outputs: validated episodes with quality scores, rejected episodes with coded rejection reasons. Failure modes: QC checks that are too permissive (letting bad data through) or too aggressive (rejecting borderline but useful episodes).

Layer 6: Delivery Layer — Inputs: QC-passed episodes. Outputs: model-ready datasets in HDF5, RLDS, or LeRobot format with loading scripts and documentation. Failure modes: format conversion errors, missing observation channels, action-space mismatches between collection and training.

Data flows forward through these layers, but feedback flows backward. QC results inform annotation guidelines. Annotation difficulties inform capture protocols. Capture issues inform sensor configuration. This bidirectional flow is what makes a pipeline an engineering system rather than a sequence of scripts.

Sensor Layer: What to Capture and How

The sensor layer defines the raw observations your pipeline will process. For manipulation tasks, the standard sensor stack includes RGB-D cameras, joint encoders, force-torque sensors, and optionally IMU for mobile platforms. Each sensor type introduces specific engineering requirements.

RGB-D cameras serve two roles: wrist-mounted cameras (Intel RealSense D405, with its 42mm minimum depth range) capture close-up hand-object interaction, while external cameras (Intel RealSense D435, with its wider field of view and 0.3m–3m operating range) provide scene context. The D405's compact form factor (42mm x 42mm x 23mm) makes it suitable for wrist mounting without obstructing the gripper workspace. The D435 is better suited for fixed-position external views where its wider baseline improves depth accuracy at 0.5m–2m ranges. Both produce synchronized RGB + depth streams, but their depth technologies differ — the D405 uses active IR stereo optimized for close range, while the D435 uses active IR stereo with a wider baseline for room-scale depth.

Force-torque sensors such as the ATI Mini45 measure 6-axis forces and torques at the wrist, providing critical contact information for tasks like insertion, assembly, and deformable object manipulation. The Mini45 outputs at up to 7000Hz, though most pipelines downsample to 100–500Hz to match joint encoder rates. Thermal drift is a real concern — the Mini45's specifications note a drift of approximately 0.05% of full scale per degree Celsius. For a sensor rated at 145N on Fx/Fy, that is roughly 0.07N per degree, which accumulates over multi-hour collection sessions. Taring the sensor at the start of each episode (not just each session) is essential.

Joint encoders provide proprioceptive state — joint positions and velocities — at rates from 100Hz to 1000Hz depending on the robot controller. These are the action labels for behavior cloning: the joint commands issued during teleoperation data collection become the supervision signal for training.

IMU sensors provide ego-motion estimation for head-mounted or body-mounted capture rigs, critical for egocentric data collection. A 9-axis IMU (accelerometer + gyroscope + magnetometer) at 200Hz or higher enables motion compensation and provides context about operator body dynamics during demonstrations.

Hardware vs. software timestamps — this distinction matters at 100Hz and above. Software timestamps are assigned by the recording application when it receives the data, introducing jitter from OS scheduling, USB bus contention, and CPU load. Hardware timestamps are assigned by the sensor itself, using either an onboard clock or a shared PTP (Precision Time Protocol) clock. The difference can be 10–50ms under load — at 100Hz, that is 1–5 frames of desynchronization between streams. For behavior cloning, where the model learns the mapping from observations at time t to actions at time t, a 30ms offset between the camera frame and the joint state means the model is learning from misaligned observation-action pairs.

Calibration encompasses three types. Intrinsic calibration maps pixel coordinates to metric 3D coordinates for each camera (focal length, principal point, distortion coefficients). Extrinsic calibration maps the spatial relationship between sensors — camera-to-robot-base, camera-to-camera, force-torque-sensor-to-end-effector. Hand-eye calibration specifically establishes the transform between a wrist-mounted camera and the robot's end-effector frame. All three must be verified at the start of each collection session and logged as metadata attached to every episode.

Capture Layer: Recording and Synchronization

The capture layer takes raw sensor streams and produces structured, synchronized episode recordings. The primary engineering challenges are stream synchronization, episode boundary definition, and storage format selection.

MCAP as the recording format. MCAP has emerged as the preferred format for multi-sensor robotic data capture, replacing legacy ROS bag files. The advantages are concrete: MCAP supports schema-aware serialization (each channel has a defined message schema), efficient indexed access (you can seek to a specific timestamp without scanning the entire file), and language-agnostic readers (Python, C++, Rust, TypeScript). Unlike ROS bags, MCAP does not require a ROS installation to read. A single MCAP file can contain RGB frames, depth maps, joint states, force-torque readings, and metadata — all with per-message timestamps and channel-level indexing. For a typical manipulation episode with two RGB-D cameras at 30fps, joint states at 100Hz, and force-torque at 500Hz, an MCAP file runs approximately 50–150MB for a 30-second episode.

Stream synchronization across sensors operating at different frame rates requires a synchronization strategy. The simplest approach is nearest-timestamp matching: for each target timestamp (typically driven by the camera frame rate), find the nearest message in every other stream. This works when stream rates differ by less than 3x and the highest-rate stream has a period shorter than the acceptable synchronization error. For wider rate disparities — e.g., 30Hz cameras and 500Hz force-torque — interpolation of the high-rate stream to camera timestamps is preferred. The capture layer should log the actual synchronization error per frame pair so downstream consumers can filter episodes with excessive timing errors.

Episode boundaries define when recording starts and stops. There are three common approaches: (1) manual triggering by the operator (press a button to start and stop), (2) automatic triggering based on robot state (recording starts when the arm leaves the home position and stops when it returns), and (3) continuous recording with post-hoc segmentation. Manual triggering is simplest but introduces variability — operators forget to start recording, or include excessive idle time before the task begins. Automatic triggering is more consistent but requires careful definition of boundary conditions. Continuous recording avoids boundary issues but creates massive files that require segmentation before annotation.

Metadata tagging at capture time is the most overlooked component of the capture layer. Every episode must be tagged with: operator ID, session ID, environment configuration ID, task variant, robot serial number, calibration file reference, protocol version, and any notes about anomalies. This metadata is not optional — it is the foundation for every downstream quality check and every dataset query. When a training run produces poor results on a specific task variant, metadata enables you to trace the problem back to the specific episodes, operators, and conditions that contributed. Without it, you are debugging blind. A well-designed robotics data collection platform enforces metadata collection at capture time — not as an afterthought.

Storage and Versioning

Once episodes are captured, the storage layer must handle three requirements: durability (data cannot be lost), organization (data must be findable), and versioning (datasets must be reproducible).

Object storage vs. file systems. For datasets under 1TB, a well-organized file system with structured directories (organized by project, task, environment, and session) can work. Beyond 1TB, object storage (S3, GCS, or MinIO for on-premise) becomes necessary. Object storage provides durability through replication, scales without filesystem limits, and supports metadata queries. The tradeoff is latency — random access to individual frames within an MCAP file on object storage is slower than on local NVMe. A common pattern is to use object storage as the system of record and local NVMe as a working cache during annotation and training.

Dataset versioning is essential because robotics datasets are not static. New episodes are collected, episodes are rejected during QC, annotations are refined, and format requirements change. Every training run must reference a specific, immutable dataset version. DVC (Data Version Control) provides git-like versioning for large files, storing file hashes in git while keeping the actual data in remote storage. For teams that outgrow DVC, custom versioning systems built on object storage tags and a metadata database (PostgreSQL or similar) provide more flexibility. The key requirement is that given a dataset version identifier, you can reconstruct the exact set of episodes, annotations, and metadata that comprised that version — even if the underlying data has since changed.

Lineage tracking connects every episode to the conditions under which it was collected. Which calibration file was active? Which protocol version? Who was the operator? What was the ambient temperature (relevant for force-torque sensor drift)? Lineage tracking is not metadata — metadata describes the episode, lineage describes the provenance. When you discover that episodes collected during a specific week have systematically higher force-torque noise, lineage tracking lets you determine whether the cause was a sensor issue, an operator issue, or an environmental issue. Without lineage, you can only discard the affected episodes. With lineage, you can diagnose and prevent recurrence.

Annotation Layer

Annotation for robotics data is fundamentally different from annotation for computer vision or NLP. Image annotation operates on single frames. Text annotation operates on token sequences. Robotics annotation operates on multi-modal temporal sequences — an annotator must simultaneously consider RGB video, depth maps, joint position traces, and force-torque signals to segment an episode into meaningful phases and label each phase with action semantics.

Temporal segmentation is the core annotation task. A typical pick-and-place episode is segmented into phases: approach (end-effector moves toward the object), pre-grasp alignment (fine positioning of the gripper relative to the object), grasp (gripper closes, contact force rises), lift (object leaves the surface, detected by force-torque Z-axis change), transport (object is moved to the target location), place (object is lowered to the target surface), and release (gripper opens). Each phase boundary is a specific frame index in the synchronized data streams. Annotators must identify these boundaries across all modalities — the force-torque signal often reveals the exact contact frame more precisely than the RGB stream.

Action labels go beyond phase segmentation to describe what the robot is doing at each timestep. For behavior cloning and diffusion policy training, the primary action labels are the robot's commanded joint positions or end-effector velocities — these come directly from the capture data and do not require manual annotation. But higher-level semantic labels (grasp type, manipulation strategy, error recovery action) require human judgment.

Grasp type classification is critical for dexterous manipulation datasets. The taxonomy typically includes: power grasp (full hand wrap around the object), precision pinch (thumb-index opposition on small features), lateral pinch (thumb against the side of the index finger, used for thin objects like cards), three-jaw chuck (thumb, index, and middle finger forming a triangular support), and tool-use grasps (holding a tool with specific grip requirements). Each grasp type has different force profiles, different visual signatures, and different failure modes. Models trained on data with grasp type labels can condition their policy on the desired grasp, enabling task-appropriate manipulation strategies.

Success/failure labeling includes not just a binary flag but a failure taxonomy: missed grasp (gripper closed without contacting the object), slip (object slid from the grasp during transport), collision (unintended contact with the environment), incorrect placement (object placed outside the target tolerance), and timeout (task not completed within the time limit). Failure episodes are not discarded — they are valuable training signal for robot training data pipelines that use contrastive learning or hindsight relabeling.

Tooling gaps. The robotics community lacks mature annotation tools for multi-modal temporal data. Computer vision annotation platforms (CVAT, Label Studio, Supervisely) support video annotation but do not natively handle synchronized force-torque traces, joint position overlays, or depth stream visualization alongside RGB. Most teams build custom annotation interfaces — a significant infrastructure investment. The interface must allow an annotator to scrub through an episode while viewing all sensor streams simultaneously, mark phase boundaries with frame-level precision, and apply labels from a predefined taxonomy. This is an area where the tooling gap significantly impacts annotation throughput and quality.

Quality Assurance

Quality assurance is the layer that determines whether your dataset is a reliable training signal or a source of noise that will degrade policy performance. QA operates at three levels: automated checks, manual review, and statistical analysis.

Automated checks catch systematic technical issues. Calibration drift detection compares ArUco marker or checkerboard reprojection errors across episodes within a session. A reprojection error exceeding 2mm indicates that the camera has shifted — either physically (a bump displaced the mount) or thermally (the housing expanded). Episodes collected after a drift event should be flagged for review, and typically rejected if the error exceeds 3mm. Frame synchronization verification measures the actual timestamp offset between paired observations across streams. If the RGB frame and the joint state used to form a single training observation have a timestamp delta exceeding one camera frame period (33ms at 30fps), the pair is unreliable. Sensor dropout detection identifies gaps in any stream — a force-torque dropout during a grasp phase means the contact information for that episode is incomplete. Label consistency checks verify that annotation labels conform to the defined taxonomy and that phase boundaries are temporally ordered (you cannot have a "place" phase before a "lift" phase).

Manual review catches semantic issues that no automated system can detect. An experienced reviewer watches a sample of episodes (typically 10–20% of each batch) and checks: Does the operator follow the protocol? Are approach angles varied or stereotyped? Does the grasp strategy match what the task requires? Are there near-failures that should be flagged as edge cases? Manual review also validates annotation quality — are phase boundaries placed at the correct frames, or are they systematically early or late?

Statistical analysis operates across the dataset rather than on individual episodes. Distribution analysis per operator reveals whether some operators produce biased demonstrations (e.g., always approaching from the left). Distribution analysis per environment reveals whether lighting or layout changes have shifted the visual distribution. Success rate tracking per session detects operator fatigue — if success rates drop below threshold in the second half of a session, those episodes need additional scrutiny. Action distribution analysis ensures that the dataset covers the full range of task-relevant behaviors, not just the easiest execution strategy. These statistical checks feed back into protocol design and operator training, closing the loop between data quality and collection practice. Teams collecting manufacturing robotics data at production scale find that statistical QA catches issues that episode-level checks miss entirely.

Delivery: Getting Data Into Training

The delivery layer converts QA-passed episodes from their capture format into the formats that training pipelines consume. This is where MCAP files become HDF5 datasets, RLDS episodes, or LeRobot-compatible directories. The conversion is not trivial — each format has different assumptions about data layout, observation and action spaces, and episode structure.

MCAP to HDF5 conversion flattens the multi-stream MCAP structure into HDF5 groups. A typical layout stores each episode as a top-level group containing datasets for observations (images, depth maps, proprioceptive state), actions (joint commands or end-effector velocities), and metadata (task ID, success flag, annotation labels). HDF5 supports chunked storage and compression (LZ4 or GZIP), enabling efficient random access to individual timesteps without loading entire episodes. This format is preferred by teams with custom PyTorch or JAX dataloaders.

RLDS (Reinforcement Learning Datasets) organizes data as TensorFlow datasets with a nested episode structure. Each episode is a sequence of steps, where each step contains an observation dict, an action array, a reward scalar, and a discount factor. RLDS is the native format for Google's RT-X and related projects. Converting manipulation demonstrations to RLDS requires defining the observation and action feature specifications — which camera streams are included, what resolution, whether depth is included, and how actions are represented (joint space vs. task space).

LeRobot format is gaining adoption in the open-source robotics learning community, particularly for teams using diffusion policy and ACT (Action Chunking with Transformers) implementations. LeRobot expects parquet files for structured data (joint states, actions, episode indices) and video files for image observations. The format is designed around the LeRobot library's dataloader, which handles frame sampling, action chunking, and observation stacking internally. Converting to LeRobot format requires mapping your sensor streams to LeRobot's expected observation keys and ensuring that action representations match the training configuration.

Preprocessing pipelines applied during delivery include image resizing (training typically uses 224x224 or 256x256, not the native camera resolution), depth normalization (converting raw depth values to a standardized range), proprioceptive state normalization (zero-mean, unit-variance based on dataset statistics), and action space normalization. These preprocessing steps must be documented and version-controlled alongside the dataset — a model trained on data normalized with one set of statistics cannot be evaluated on data normalized with different statistics.

Backward compatibility is a concern when datasets are versioned and appended over time. Version 2 of a dataset may include additional sensor streams not present in version 1. Version 3 may use a revised annotation taxonomy. Training pipelines that consume these datasets must handle schema evolution gracefully. The delivery layer should include a compatibility matrix documenting which dataset versions are compatible with which training configurations, and loading scripts should validate compatibility at load time rather than failing silently mid-training.

Documentation shipped with each dataset version includes: sensor specifications (make, model, resolution, frame rate), calibration parameters, annotation schema with label definitions, episode counts by task variant and success/failure, data splits (train/validation/test) with the splitting methodology, known limitations or biases, and example loading code. This documentation is not optional — it is the interface contract between the data pipeline and the training pipeline.

A data pipeline is not a script — it is infrastructure. Every layer from sensor calibration through format delivery requires engineering investment that compounds over time. Teams that treat data as a one-off collection project spend more time debugging pipeline issues than training models.

Humaid provides the full pipeline from sensor capture to model-ready delivery — calibrated hardware, trained operators, systematic QA, and datasets in HDF5, RLDS, or LeRobot format. Your team focuses on training, not plumbing.