Why Simulation Isn't Enough for Robot Training

Simulation is the default starting point for every robotics team. It is fast — you can generate a million grasp attempts overnight. It is parallelizable — spin up a thousand environments on a GPU cluster. And it is free of hardware failures — no broken grippers, no worn cables, no sensors that lose calibration mid-session. For good reason, simulation has produced some of the most impressive results in robotic manipulation research.

But every team that ships a real product eventually discovers the same thing: simulation is a training wheel, not a destination. The policy that achieves 95% success rate in Isaac Sim drops to 40% on the real robot. The grasping network that handles every object in the simulated bin fails on the first oil-coated metal part it encounters. The navigation stack that works flawlessly in a modeled warehouse collides with a pallet that the LiDAR sees differently than the simulated sensor model predicted.

This article breaks down exactly where simulation fails, why the costs of closing the sim-to-real gap are higher than most teams estimate, and how to architect a training pipeline that uses simulation for what it does well while relying on robot training data from the real world for what simulation cannot provide.

Where Simulation Excels

Before cataloguing simulation's limitations, it is important to be precise about where it genuinely works — because these strengths are real and should be exploited.

Rigid-body dynamics. For tasks involving rigid objects with well-characterized material properties — picking a machined aluminum block from a flat surface, pushing a wooden block across a table — modern physics engines like MuJoCo and NVIDIA Isaac Sim produce dynamics that transfer reasonably well to real hardware. The contact models for rigid-rigid interaction are mature, and the sim-to-real gap for these scenarios is manageable with moderate domain randomization.

Reward shaping and rapid iteration. Simulation allows reinforcement learning teams to iterate on reward functions in hours rather than weeks. You can test whether a shaped reward produces the desired emergent behavior — does the robot learn to rotate an object before grasping it, or does it just slam into it? — without risking hardware damage or waiting for a human operator. This rapid iteration on the learning signal is genuinely valuable and has no real-world equivalent.

Pre-training for visual representations. Large-scale simulation with domain randomization can pre-train visual encoders that extract useful features — object boundaries, depth discontinuities, surface normals — that transfer to real images after fine-tuning. The visual encoder does not need to be perfectly calibrated to reality; it needs to learn features that are structurally similar to those present in real images. Domain randomization over textures, lighting, and camera parameters accomplishes this reasonably well.

Safety validation and stress testing. Simulation is the only practical way to test a policy against thousands of adversarial scenarios before deploying it on real hardware. What happens when an object is placed at the extreme edge of the workspace? When two objects are stacked in an unstable configuration? When the gripper approaches at an unusual angle? Running these tests in simulation is fast and risk-free, and the results identify failure modes to address before any real-world deployment.

The Physics Gap: What Simulation Gets Wrong

The physics gap is the most fundamental limitation of simulation, and it is far wider than most teams appreciate until they attempt transfer.

Deformable objects. Food items, fabric, cables, plastic packaging, rubber gaskets, foam inserts — the real world is full of deformable materials that simulation handles poorly. A simulated rubber gasket compresses perfectly uniformly under a parallel-jaw gripper; a real gasket deforms asymmetrically based on manufacturing tolerances, material aging, temperature, and the exact grip location relative to its geometry. A simulated cable follows a smooth spline model; a real cable has memory from its previous coiling, variable stiffness along its length from manufacturing variation, and friction against the surface that depends on the cable sheath material and surface contamination. Finite element methods (FEM) can model deformation more accurately, but the computation cost makes them impractical for the millions of rollouts that RL training requires, and they still require material parameters that must be measured from the actual objects.

Friction and surface interaction. Friction in simulation is typically modeled with a single coefficient per material pair — rubber on steel gets a number, plastic on wood gets a different number. Real friction depends on surface finish (machined vs. cast vs. stamped), contamination (oil, dust, moisture), contact area (which changes dynamically as a compliant gripper pad deforms), and sliding velocity. A bin picking task involving metal parts with machining oil residue exposes this gap immediately: the simulated friction coefficient does not capture the way oil creates a thin hydrodynamic film that reduces friction at low speeds but behaves differently at high speeds. Domain randomization over friction coefficients helps, but randomizing a scalar does not approximate a velocity-dependent, contamination-dependent, geometry-dependent physical phenomenon.

Transparent and reflective materials. Depth sensors — the backbone of most manipulation perception stacks — fail on transparent objects (glass bottles, clear plastic containers, shrink wrap) and highly reflective surfaces (polished metal, chrome fixtures, wet surfaces). Simulation depth rendering does not replicate these failures because the simulated depth sensor returns the geometrically correct depth regardless of material properties. A policy trained in simulation to pick up a glass bottle sees a perfectly rendered depth map; the real depth sensor returns holes, noise, and multipath artifacts where the glass is. Teams training on physical AI data collection from real sensors learn to handle these failure modes because they are present in every training example.

Soft contacts and compliance. When a robot presses a rubber seal into a groove, inserts a flexible clip into a housing, or grasps a ripe tomato without crushing it, the contact dynamics involve material compliance that simulation approximates poorly. The force-displacement relationship is nonlinear, hysteretic, and rate-dependent. A simulated policy learns the wrong force profile for insertion tasks, leading to either insufficient force (the connector does not seat) or excessive force (the connector or housing is damaged) when deployed on real hardware.

The Perception Gap: Visual Domain Shift

Even with photorealistic rendering — and modern renderers like NVIDIA Omniverse produce genuinely impressive images — simulation misses the visual details that dominate real production environments.

Surface contamination and wear. Real objects in manufacturing and warehouse environments are not pristine. Metal parts have scratches from handling, oil stains from machining, and corrosion spots from storage. Cardboard boxes have dents, tape residue, and label fragments from previous shipping cycles. Plastic containers have scuffs, UV yellowing, and adhesive marks. These are not cosmetic details — they change the visual texture that a perception model uses for object recognition, pose estimation, and grasp point selection. Generating realistic contamination and wear patterns in simulation would require a procedural damage model for every material type, which no current simulation pipeline implements at scale.

Lighting. A simulation scene has lighting defined by a fixed set of light sources with known positions, intensities, and color temperatures. A real manufacturing floor has overhead fluorescents that flicker at 100-120 Hz (creating subtle banding in camera images), skylights that change intensity with cloud cover, task lights that cast hard shadows, and ambient reflections from metallic surfaces that shift as the robot arm moves through the workspace. The camera's auto-exposure and white balance algorithms — which differ between camera models and firmware versions — respond to these lighting changes in ways that simulation does not model.

Motion blur from vibrating machinery. Robots deployed near stamping presses, CNC mills, conveyor drives, or compressors experience transmitted vibration through the mounting surface. This produces consistent low-amplitude motion blur in camera images — typically 1-3 pixels at 30 fps — that is absent in simulation. A perception model trained on perfectly sharp simulated images may fail to detect small objects or estimate precise poses when every real image has this baseline blur. Domain randomization with synthetic motion blur helps but does not replicate the frequency-specific vibration patterns of actual industrial equipment.

Sensor-specific artifacts. Every camera model produces characteristic artifacts: infrared dot projection patterns from structured-light depth sensors (visible in some RGB images), depth quantization steps at specific ranges, color channel crosstalk, and lens distortion that varies across the field of view. These artifacts are sensor fingerprints that simulation does not reproduce, and they create a systematic domain gap between simulated and real observations that affects every frame in every episode.

The Edge Case Gap: What Humans Find That Simulation Misses

The long-tail distribution of real-world scenarios is perhaps the most underappreciated argument for real-world data collection. Simulation environments contain exactly what the simulation engineer scripts into them — nothing more. Real environments contain everything that actually happens.

A human operator performing egocentric data collection in a warehouse encounters a partially crushed box that changed shape since it was stocked. A torn shrink-wrap package with contents partially spilling out. A misplaced pallet that blocks the normal approach path. Two items stuck together with leaked adhesive. A barcode label that has been printed off-center, folded over an edge, and partially obscured by packing tape. None of these scenarios were scripted. All of them occur regularly — in a busy warehouse, some edge case occurs in every tenth to twentieth pick operation.

In manufacturing, the edge cases are different but equally unscriptable. A conveyor jam that leaves parts in non-standard orientations. A fixture that has drifted slightly from its calibrated position due to thermal expansion over an eight-hour shift. A batch of parts that is slightly out of spec — within the official tolerance band but different enough from the nominal geometry that the standard grasp approach fails. Tool wear that changes the part surface finish across a production run.

These edge cases matter disproportionately for deployed systems. A policy that handles the nominal case with 99% success and fails on every edge case will have an effective success rate of 85-90% in production — which is the difference between a viable product and an unreliable curiosity. The only practical way to include edge cases in training data is to collect in real environments where they occur naturally, or to have operators deliberately create them according to a protocol that specifies the known edge case categories. Either way, real-world collection is required.

The Cost Illusion: Why Sim-to-Real Transfer Is More Expensive Than It Looks

Simulation appears cheap because the marginal cost of generating an additional episode is near zero. But the total cost of achieving reliable sim-to-real transfer is dominated by engineering time, not compute — and most teams significantly underestimate this engineering cost.

Domain randomization engineering. Effective domain randomization requires identifying every visual and physical parameter that varies between simulation and reality, then randomizing each one over a distribution that covers the real-world range. For a bin picking task, this means randomizing: object textures, lighting direction and intensity and color temperature, camera noise models, friction coefficients per material pair, object mass and center of mass, bin wall friction, gripper pad compliance, and depth sensor noise. Each parameter must be randomized over a realistic range — too narrow and the policy overfits to simulation; too wide and the policy receives a training signal that is too noisy to learn from. Determining the correct randomization ranges requires — paradoxically — real-world measurements, because you cannot know the distribution of real-world friction coefficients without measuring real objects.

Reward function tuning. Reinforcement learning in simulation requires a reward function that produces the desired emergent behavior. For anything beyond the simplest pick-and-place, this is an iterative, time-consuming process. The reward for a cable insertion task must balance: reaching the insertion point, aligning the cable end, applying appropriate insertion force, detecting successful seating, and avoiding cable damage. Each component needs a weight that is tuned through repeated training runs and qualitative evaluation. Teams routinely spend two to four months on reward engineering for a single task.

Transfer failure debugging. When a sim-trained policy fails on real hardware, diagnosing the cause is extremely difficult. Is the failure due to a visual domain gap (the perception model misidentifies the object), a physical domain gap (the grasp force is wrong), an action space mismatch (the sim controller dynamics differ from the real controller), or an out-of-distribution scenario (the real environment contains something the sim never generated)? Isolating the cause requires systematic ablation experiments, which take weeks. Many teams report spending six to twelve months on sim-to-real transfer engineering before concluding that they need real-world data to close the remaining gap.

When you account for the full cost — domain randomization engineering, reward tuning, transfer debugging, and the opportunity cost of the engineering team's time — sim-to-real transfer for complex tasks often exceeds the cost of a well-organized human-in-the-loop data collection campaign that produces directly usable training data.

The Right Architecture: Simulation + Real-World Data

The most effective approach is not simulation versus real data — it is a deliberate combination that uses each for its strengths.

Pre-train in simulation, fine-tune on real data. Train visual encoders and coarse motor policies in simulation using domain randomization. This produces a model that understands the general structure of the task — approach the object, close the gripper, lift — even if the specific parameters are wrong. Then fine-tune on real-world demonstrations collected through teleoperation data collection or egocentric capture. The fine-tuning corrects the domain-specific errors: the correct friction response for the actual objects, the correct visual features for the actual environment, the correct force profiles for the actual contact dynamics. In practice, a model pre-trained on 100,000 simulated episodes and fine-tuned on 500-1,000 real episodes typically outperforms a model trained on either 100,000 simulated episodes alone or 500-1,000 real episodes alone.

Use simulation for exploration, real data for calibration. Simulation is excellent for discovering the space of possible strategies — different approach angles, grasp points, motion trajectories — because it can explore thousands of variations safely. Real data then calibrates which strategies actually work given real physics. A diffusion policy pre-trained on diverse simulated trajectories and fine-tuned on real demonstrations inherits the diversity of simulation and the physical accuracy of real data.

Concrete dataset size ratios. Based on published results and operational experience, the following ratios represent reasonable starting points for behavior cloning with diffusion policies. For simple pick-and-place (single rigid object, fixed environment): 50,000 sim episodes + 200-500 real episodes. For multi-object manipulation (varying objects, cluttered scenes): 100,000 sim episodes + 1,000-3,000 real episodes. For contact-rich assembly (insertion, snap-fit, screw driving): 200,000 sim episodes + 2,000-5,000 real episodes, with force-torque data in the real episodes being essential. For deformable object manipulation (food, fabric, cables): simulation contributes minimally; the budget should be concentrated on 3,000-10,000 real episodes because the sim-to-real gap for deformables is too wide for simulation to provide useful pre-training.

These ratios are task-dependent and should be treated as starting estimates, not prescriptions. The key insight is that real-world data is the high-leverage investment. Doubling the simulation budget from 100,000 to 200,000 episodes produces marginal improvements; doubling the real data budget from 500 to 1,000 episodes often produces substantial improvements. Real data is the scarce resource, and collection infrastructure is what determines how much of it you can produce at what quality level.

Real-world data is the calibration signal that makes simulation useful. Humaid collects the real-world demonstration data your models need — in your target environment, with calibrated sensors and trained operators. Whether you need 500 teleoperation episodes to fine-tune a sim-trained policy or 5,000 egocentric demonstrations to train from scratch, we deliver training-ready datasets in HDF5, RLDS, or LeRobot format.

Talk to Our Team

Why Simulation Is Not Enough for Robotics Training Data