Guiding Long-Horizon Task and Motion Planning
with Vision Language Models

In Submission

1MIT, 2NVIDIA

VLM-TAMP is a general method for solving long-horizon manipulation planning problems,
combining the strengths and overcome the limitations of Task and Motion Planning & Vision Language Models.

VLM-TAMP

Vision-Language Models (VLM) can generate plausible high-level plans.
However, there is no guarantee that the actions predicted and grounded by VLM/LLMs
are geometrically and kinematically feasible for a particular robot embodiment.

For example, when the VLM predicted step is "Place the cabbage in the pot,"
the shortest plan for the following three robot embodiments are different:

Embodiment Dual-Arm Rummy (Long Arms) Dual-Arm PR2 (Shorter Arms) Single-Arm PR2
Plans pick(left-arm, cabbage)
place(left-arm, cabbage, pot)
pick(left-arm, cabbage)
push-handle(right-arm, cabbage-drawer)
place(left-arm, cabbage, pot)
pick(left-arm, cabbage)
place(left-arm, cabbage, counter)
pick(left-arm, cabbage)
place(left-arm, cabbage, pot)
Trajectories

We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate
both semantically-meaningful and horizon-reducing subgoals that guide a task and motion planner.
When a subgoal cannot be refined, the VLM is queried again for replanning.

VLM-TAMP
VLM: Query Mode: Robot: Init:

We evaluate VLM-TAMP on kitchen tasks where a robot must accomplish cooking goals
that require performing 30-50 actions in sequence and interacting with up to 21 objects.

VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences,
both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%).

result_1

Reprompting VLM visibly benefit VLM-TAMP, increasing task success increasing by 47 to 55% on the harder problems,
while it doesn't help the baseline which predicts actions

result_2

Video

BibTeX

@misc{yang2024guidinglonghorizontaskmotion,
      title={Guiding Long-Horizon Task and Motion Planning with Vision Language Models}, 
      author={Zhutian Yang and Caelan Garrett and Dieter Fox and Tomás Lozano-Pérez and Leslie Pack Kaelbling},
      year={2024},
      eprint={2410.02193},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.02193}, 
}