Using renderings of 3D objects, we first perform multi-view fusion of DINOv2 features and clustering to obtain fine-grained semantic regions of objects, which are then fed to VLM for proposing relevant tasks and corresponding regions (a). The extracted affordance is then distilled by training a language-conditioning FiLM atop frozen DINOv2 features (b). The learned task-conditioned affordance model provides in-the-wild prediction for diverse fine-grained regions, which are used as observation space for manipulation policies (c).
Tasks in simulation and corresponding generalization settings investigated in the paper.
Training Scenario
Unseen Pose
Unseen Instance
(beer bottle & bowl)
Unseen Category
(beer bottle → coke can)
Unseen Instruction
(pour beer → water plant)
Training Scenario
Unseen Pose
Unseen Instance
Unseen Category
(cabinet → fridge)
Training Scenario
Unseen Pose
Unseen Instance (marker)
Unseen Category
(pen holder → cup)
Unseen Instruction
(insert pen → insert carrot)
Videos are played at 2x speed.
Watering Plant.
Opening Drawer.
Inserting Pen.