VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

¹University of California, Berkeley ²Carnegie Mellon University ³University of Hong Kong ⁴Peking University ⁵Stony Brook University ⁶The University of North Carolina at Chapel Hill
^†Indicates Corresponding Author

Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full retraining to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. We then fine-tune only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Code and checkpoints are available in the supplementary materials. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions.

Patch Feature Visualization

Through lightweight Robot Router training, VER learns to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. Patch feature visualization shows that VER concentrates on task-relevant patches, significantly reducing extreme outliers in task-irrelevant regions (e.g., background).

VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

Abstract

Patch Feature Visualization

Patch feature visualization on relocate task across 10 training random seeds.

Patch feature visualization on cross-bin task.

Experiment results

BibTeX