VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing

1University of California, Berkeley 2Carnegie Mellon University 3University of Hong Kong 4Peking University 5Stony Brook University 6The University of North Carolina at Chapel Hill
Indicates Corresponding Author
Supplementary Code Checkpoints

Abstract

Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full retraining to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. We then fine-tune only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Code and checkpoints are available in the supplementary materials. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions.

Description of image

Overall structure of VER. VER comprises two key components: the Base Vision Transformer (BVT), which processes images into unified representations, and the Vision Expert Library (VEL), which stores a diverse set of specialized vision experts and selectively utilizes them to mimic teacher vision foundation models and enhance performance in downstream robotic tasks. Our framework consists of two phases: (1) Pretraining, where we distill multiple foundation models (DINOv2, ViT, CLIP) into VER; (2) Downstream Robotic Tasks, where we freeze the experts and train a lightweight Robot Router (<0.4% parameters) that dynamically selects task-relevant visual features to guide the policy head in generating appropriate robotic actions. This two-stage approach enables efficient knowledge distillation from diverse vision foundation models and adaptive feature selection for robotic tasks.

Patch Feature Visualization

Through lightweight Robot Router training, VER learns to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. Patch feature visualization shows that VER concentrates on task-relevant patches, significantly reducing extreme outliers in task-irrelevant regions (e.g., background).

Patch feature visualization on pen task across 10 training random seeds. We find that if we only use the pretrained vision foundation model experts, they cannot focus on the task-relevant patches. With patch-wise routing (PER) and curriculum top-k annealing (CTA), VER can dynamically and robustly select suitable experts from the pretrained library, resulting in patch features that concentrate on the task-relevant patches.

Patch feature visualization on relocate task across 10 training random seeds.

Patch feature visualization on bin-cup task. We compare Theia (left column), VER before expert selection (middle column), and VER after expert selection (right column). After expert selection, VER concentrates on task-relevant objects and suppresses features from robot-related and background patches.

Patch feature visualization on cross-bin task.

Patch feature visualization on cylinder-plate task. Most surprisingly, VER can focus on the correct patches during different task stages. In the in-hand camera, when grasping the cylinder, VER focuses on the cylinder and ignores the plate. When moving to the plate, VER switches to focus on the plate.

Experiment results

We evaluate VER on different policy heads (flow matching policy, diffusion policy, behavior cloning policy) across 17 diverse robotic tasks. Results show that across all policy heads, VER consistently achieves strong performance.

Description of image

Performance comparison with other vision encoders on 11 tasks from Franka Kitchen, Meta-World, and Adroit environments. The same policy head from Theia is used for a fair comparison of vision encoders. Our approach, VER, achieves the highest average success rate (74.7%), demonstrating its superiority.

Description of image
Description of image

BibTeX

to be added