Across the three test environments, OCULAR conditions calibration on velocity, action, and semantic observations, enabling uncertainty estimates to transfer to new maps without calibration data from the deployment environment.

Abstract We introduce Observation-aware Conformal Uncertainty Local-Calibration (OCULAR), a conformal prediction-based algorithm that uses perception information to provide uncertainty quantification guarantees for unseen test-time environments. While previous conformal approaches lack the ability to discriminate between state-action space regions leading to higher or lower model mismatch, and require environment-specific data, our method uses data collected from visually similar environments to provably calibrate a given linear Gaussian dynamics model of arbitrary fidelity. The prediction regions generated from OCULAR are guaranteed to contain the future system states with, at least, a user-set likelihood, despite both aleatoric and epistemic uncertainty. Our guarantees are non-asymptotic and distribution-free, not requiring strong assumptions about the unknown real system dynamics. Our calibration procedure enables distinguishing between observation-velocity-action inputs leading to higher and lower next-state uncertainty, which is helpful for probabilistically-safe planning. We validate our algorithm on a double-integrator system subject to random perturbations and significant model mismatch, using both a simplified sensor and a more realistic simulated camera. Our approach appropriately quantifies uncertainty both when in-distribution and out-of-distribution, while being comparatively volume-efficient to baselines requiring environment-specific data.

Problem Statement

Let $s_t=(p_t,v_t)\in\mathcal S$ and $a_t\in\mathcal A$ denote a robot’s state and action. We consider discrete-time stochastic systems evolving according to unknown dynamics $s_{t+1}\sim f(s_t,a_t)$ , while the robot observes depth and semantic images $o_t=(o_t^{\mathrm{depth}},o_t^{\mathrm{semantics}})$ . The planner uses a fixed approximate probabilistic dynamics model $\tilde f$ of arbitrary fidelity; here, $\tilde f$ is a linear Gaussian model whose uncertainty can be uncalibrated due to aleatoric disturbances and epistemic model mismatch.

Given a finite exchangeable calibration dataset $D_{\mathrm{cal}}:=\{(s_t,a_t,o_t,s_{t+1})_i\}_{i=1}^{n}$ from environments different from, but visually similar to, the deployment environment, our aim is to construct an adaptive and volume-efficient prediction region $\hat{\mathcal C}(X)$ over $Y:=s_{t+1}$ . For a user-selected acceptable failure rate $\alpha\in(0,1)$ , OCULAR calibrates $\tilde f$ without test-environment calibration data so that

$\mathbb P\!\left(Y \in \hat{\mathcal C}(X)\right) \ge 1-\alpha.$

Method: OCULAR

OCULAR performs local conformal calibration using robot-frame perception, body velocity, and action information. It projects each observation into a planar semantic footprint $o'_t=\varphi(o_t)$ , encodes that footprint with a CAE, and partitions $X:=\mathrm{process}(X^{\mathrm{raw}})=(v_t,a_t,\mathrm{Encode}(o'_t))$ using a DTree trained on nonconformity scores. SplitCP is performed per leaf, yielding an input-dependent threshold $\hat q_k$ and covariance scaling factor $\xi_k$ for the approximate Gaussian prediction.

OCULAR offline calibration pipeline — **Offline component of OCULAR.** 1: observations $o_i$ from $D_{cal}^{part}$ are projected into planar footprints $O_i$ by $p(\cdot)$ . A CAE is trained to reconstruct $O_i$ , and the decoder is discarded. 2: All data in $D_{cal}$ is processed by $\mathrm{process}(\cdot)$ into a learned representation $X_i$ , and nonconformity scores $s_i$ are computed. 3: a Decision Tree is trained on $D_{cal}^{part}$ to partition the learned input space $\mathcal{X}$ into regions of approximately constant score. 4: The holdout processed $D_{cal}^{CP}$ data is fed through the DTree and scores are grouped per leaf node $k$ . SplitCP is performed on each input-space partition $\mathcal{X}_k$ to get an input-dependent probabilistic threshold $\hat{q}_k$ .

OCULAR online uncertainty calibration pipeline — **Online component of OCULAR.** Given an estimated Gaussian at time $t$ , a desired action $a_t$ , and observation $o_t$ , we create an approximate next-step Gaussian $\tilde{\mathcal N}_{t+1}$ via the approximate model $\tilde{f}$ . The current-time information $X_i^{raw}$ is processed, and the learned representation $X_i$ passed to the Decision Tree. The resulting leaf node $\mathcal{X}_k$ has an associated $\hat{q}_k$ , which is multiplied by a fixed constant to get $\xi_k$ . The approximate uncertainty estimate is then calibrated by scaling its covariance by $\xi_k$ , and the output is passed to the following planning step.

Experiments

We validate OCULAR on a double-integrator in Isaac Sim using depth and semantic segmentation cameras across three snowy T-section environments. The white lower-friction regions are OOD relative to the linear Gaussian model; OCULAR uses calibration data from visually similar maps other than the tested map.

The three tested Isaac Sim environments (based off Rivermark).

The individual rollout picker shows per-method inference visualizations for selected test episodes.

Map

Episode

Method

The grid comparison synchronizes all six methods on the same selected map and episode.

Map

Episode

Isaac Sim Test Cases

For held-out test cases, we propagate $10k$ Monte Carlo particles under the true dynamics and report marginal coverage by ID/OOD region. Relative volume is measured against an oracle Gaussian scaling that achieves $90\%$ coverage.

Test-case results across three Isaac Sim roads.
Metric	Method	Tested map not in $D_{cal}$ ?	icySide		icyMain		icyMiddle
Metric	Method	Tested map not in $D_{cal}$ ?	ID	OOD	ID	OOD	ID	OOD
Marginal coverage (%)	No CP	N/A	90.0	56.7	90.0	56.7	90.0	56.7
	SplitCP	×	99.5	89.6	100.0	98.4	99.1	85.9
	LUCCa	×	91.1	91.5	90.1	91.4	90.1	90.9
	OCULAR (ours)	✓	91.5	90.1	90.4	90.1	91.1	90.6
Median volume (relative to oracle) ↓	No CP	N/A	1.00	0.28	1.00	0.28	1.00	0.28
	SplitCP	×	3.73	1.03	8.68	2.40	3.07	0.85
	LUCCa	×	1.08	1.13	1.02	1.10	1.02	1.13
	OCULAR (ours)	✓	1.03	1.02	1.02	1.02	1.06	1.06

red means coverage below 90%. Each map has 4464 test transitions.

Isaac Sim Planning Performance

We run 30 planning trials on each Isaac Sim map. Success means reaching all subgoals without collisions; failures were collisions rather than timeouts.

Planning results across three Isaac Sim roads.
Method	Tested map not in $D_{cal}$ ?	Success (%) ↑			Steps to completion (mean ± std) ↓
Method	Tested map not in $D_{cal}$ ?	icySide	icyMain	icyMiddle	icySide	icyMain	icyMiddle
No CP	N/A	0	0	0	--	--	--
SplitCP	×	0	0	0	--	--	--
LUCCa	×	100	100	100	401.4 ± 21.2	517.7 ± 9.5	302.3 ± 9.1
OCULAR (ours)	✓	100	100	100	238.6 ± 6.5	302.7 ± 5.6	294.7 ± 11.7

BibTeX (cite this!)

@misc{marques2026localconformalcalibrationdynamics,
      title={Local Conformal Calibration of Dynamics Uncertainty from Semantic Images},
      author={Luís Marques and Dmitry Berenson},
      year={2026},
      eprint={2605.13028},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.13028},
}

Local Conformal Calibration of Dynamics Uncertainty from Semantic Images

Luís Marques, Dmitry Berenson

17th International Workshop on the Algorithmic Foundations of Robotics (WAFR) 2026