PerceptPick — Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success

Abstract

3D reconstruction serves as the foundational layer for numerous robotic perception tasks, including 6D object pose estimation and grasp pose generation. Modern 3D reconstruction methods for objects can produce visually and geometrically impressive meshes from multi-view images, yet standard geometric evaluations do not reflect how reconstruction quality influences downstream tasks such as robotic manipulation performance. This paper addresses this gap by introducing a large-scale, physics-based benchmark that evaluates 6D pose estimators and 3D mesh models based on their functional efficacy in grasping. We analyze the impact of model fidelity by generating grasps on various reconstructed 3D meshes and executing them on the ground-truth model, simulating how grasp poses generated with an imperfect model affect interaction with the real object. This assesses the combined impact of pose error, grasp robustness, and geometric inaccuracies from 3D reconstruction. Our results show that reconstruction artifacts significantly decrease the number of grasp pose candidates but have a negligible effect on grasping performance given an accurately estimated pose. Our results also reveal that the relationship between grasp success and pose error is dominated by spatial error, and even a simple translation error provides insight into the success of the grasping pose of symmetric objects. This work provides insight into how perception systems relate to object manipulation using robots.

Method

Benchmark pipeline — Pipeline. An object's mesh and a 6D pose estimate are composed into a planned grasp; the grasp is then executed in PyBullet on the ground-truth object pose. High-resolution PDF →

We formalise the benchmark as a transformation chain through four frames — World (W), Camera (C), Object (O), Gripper (G):

$$T_{w \to g}^{\text{gt}}\;=\;T_{w \to c}\,\cdot\,T_{c \to o}^{\text{gt}}\,\cdot\,T_{o \to g}$$ $$T_{w \to g}^{\text{est}}\;=\;T_{w \to c}\,\cdot\,T_{c \to o}^{\text{est}}\,\cdot\,T_{o \to g}$$

$T_{o \to g}$ is a precomputed canonical grasp library; $T_{c \to o}^{\text{est}}$ comes from a 6D pose estimator. The benchmark always executes the grasp on the ground-truth object pose but plans it from $T_{w \to g}^{\text{est}}$ — this isolates perception error from real-world clutter and dynamics.

Three experimental conditions

Condition	Grasp Mesh	Pose Mesh	What it measures
Oracle / Oracle	GT	GT	Ideal-baseline grasp performance
Oracle / Reconstructed	GT	rec	Pure pose-estimation error impact
Reconstructed / Reconstructed	rec	rec	End-to-end realistic perception

Grasp candidate visualization grid — Sample grasp candidates across objects and grippers, drawn from the antipodal sampler.

Headline findings

1 Pose accuracy dominates grasp success — not mesh quality.

FoundationPose reaches ~89.9% average estimated success; MegaPose ~59.4%. Even with a reconstructed mesh used for both grasp generation and pose estimation, accurate pose estimation can compensate for moderate mesh flaws.

2 3D spatial metrics correlate with grasp success; 2D projection error does not.

MSSD, ADD, ADI, and translation error track grasp success closely. MSPD (2D projection) and pure rotation error are poor predictors — many high-MSPD grasps still succeed. This validates that 2D metrics miss what matters for physical contact.

Estimated success vs. error metrics scatter

3 Mesh artifacts reduce candidate count, not executed-grasp success.

Reconstructed meshes produce more Collision failures during sampling (Instant-NGP suffers most), but given an accurate pose the executed grasps still succeed. UniSurf, with smoother surfaces, occasionally beats the oracle CAD mesh by avoiding edge-case collisions.

4 Symmetry matters.

Rotation error around an object's symmetry axis has negligible effect on grasp success; rotation error orthogonal to the symmetry axis degrades performance sharply.

5 No single best gripper for all objects.

EZGripper wins on ~76% of objects in the YCB-V suite under ideal conditions; WSG-32 and Robotiq 2F-140 cover the rest. The choice is object-dependent — a four-panel summary below shows per-object winners and the failure-mode mix.

Comparison gallery

Side-by-side execution of grasps planned on GT vs. BakedSDF-reconstructed meshes. Pick an object and a gripper to see how reconstruction artifacts affect candidate selection.

Object

Gripper

Reproduce in 5 commands

Pre-prepared assets (~7 GB) are mirrored on Hugging Face — varunburde/perceptpick. You can skip asset preparation entirely and jump straight to the simulation stages.

# 1. Clone + install
git clone https://github.com/varunburde/perceptpick.git && cd perceptpick
pixi install && pixi shell

# 2. Pull pre-prepared assets directly into ./assets
hf download varunburde/perceptpick --repo-type=dataset --local-dir assets

# 3. Stage B — sample antipodal grasps for every (object, gripper)
pixi run python scripts/02_grasp_sweep.py --dataset ycbv --mesh-source GT --n-grasps 5000

# 4. Stage C — Oracle / Oracle (ideal baseline)
pixi run python scripts/04_evaluate.py --dataset ycbv --gt-mesh GT --est-mesh GT \
    --pose-csv FoundationPose.csv --gripper auto --workers 4 --resume --headless

# 5. Stage C — Reconstructed / Reconstructed (end-to-end realistic)
pixi run python scripts/04_evaluate.py --dataset ycbv --gt-mesh BakedSDF --est-mesh BakedSDF \
    --pose-csv FoundationPose.csv --gripper auto --workers 4 --resume --headless

BibTeX

@inproceedings{burde2026perceptpick,
  title     = {Benchmarking the Effects of Object Pose Estimation and
               Reconstruction on Robotic Grasping Success},
  author    = {Burde, Varun and Burget, Pavel and Sattler, Torsten},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
  year      = {2026}
}

Acknowledgments

This benchmark is heavily derived from burg-toolkit (MIT, © 2022 Martin Rudorfer): gripper URDFs and meshes, the antipodal grasp sampler, the PyBullet simulation harness, and most core utility modules trace back to that codebase. The pose-error metrics come from the BOP toolkit by Tomáš Hodaň (CTU Prague). YCB-V meshes and test scenes are from the BOP Challenge.