Guiding Long-Horizon Task and Motion Planning
with Vision Language Models

ICRA 2025

Zhutian Yang^{1, 2}, Caelan Garrett², Tomás Lozano-Pérez¹, Leslie Kaelbling¹, Dieter Fox²

¹MIT, ²NVIDIA

VLM-TAMP is a general method for solving long-horizon manipulation planning problems,
combining the strengths and overcome the limitations of Task and Motion Planning & Vision Language Models.

Vision-Language Models (VLM) can generate plausible high-level plans.
However, there is no guarantee that the actions predicted and grounded by VLM/LLMs
are geometrically and kinematically feasible for a particular robot embodiment.

For example, when the VLM predicted step is "Place the cabbage in the pot,"
the shortest plan for the following three robot embodiments are different:

Embodiment	Dual-Arm Rummy (Long Arms)	Dual-Arm PR2 (Shorter Arms)	Single-Arm PR2
Plans	pick(left-arm, cabbage) place(left-arm, cabbage, pot)	pick(left-arm, cabbage) push-handle(right-arm, cabbage-drawer) place(left-arm, cabbage, pot)	pick(left-arm, cabbage) place(left-arm, cabbage, counter) pick(left-arm, cabbage) place(left-arm, cabbage, pot)
Trajectories

We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate
both semantically-meaningful and horizon-reducing subgoals that guide a task and motion planner.
When a subgoal cannot be refined, the VLM is queried again for replanning.

VLM: Query Mode: Robot: Init:

We evaluate VLM-TAMP on kitchen tasks where a robot must accomplish cooking goals
that require performing 30-50 actions in sequence and interacting with up to 21 objects.

VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences,
both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%).

Reprompting VLM visibly benefit VLM-TAMP, increasing task success increasing by 47 to 55% on the harder problems,
while it doesn't help the baseline which predicts actions

BibTeX

@misc{yang2024guidinglonghorizontaskmotion,
      title={Guiding Long-Horizon Task and Motion Planning with Vision Language Models}, 
      author={Zhutian Yang and Caelan Garrett and Dieter Fox and Tomás Lozano-Pérez and Leslie Pack Kaelbling},
      year={2024},
      eprint={2410.02193},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.02193}, 
}

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

ICRA 2025

VLM-TAMP is a general method for solving long-horizon manipulation planning problems, combining the strengths and overcome the limitations of Task and Motion Planning & Vision Language Models.

Vision-Language Models (VLM) can generate plausible high-level plans. However, there is no guarantee that the actions predicted and grounded by VLM/LLMs are geometrically and kinematically feasible for a particular robot embodiment.

For example, when the VLM predicted step is "Place the cabbage in the pot," the shortest plan for the following three robot embodiments are different:

We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate both semantically-meaningful and horizon-reducing subgoals that guide a task and motion planner. When a subgoal cannot be refined, the VLM is queried again for replanning.

We evaluate VLM-TAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects.

VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%).

Reprompting VLM visibly benefit VLM-TAMP, increasing task success increasing by 47 to 55% on the harder problems, while it doesn't help the baseline which predicts actions

Video

BibTeX

Guiding Long-Horizon Task and Motion Planning
with Vision Language Models

VLM-TAMP is a general method for solving long-horizon manipulation planning problems,
combining the strengths and overcome the limitations of Task and Motion Planning & Vision Language Models.

Vision-Language Models (VLM) can generate plausible high-level plans.
However, there is no guarantee that the actions predicted and grounded by VLM/LLMs
are geometrically and kinematically feasible for a particular robot embodiment.

For example, when the VLM predicted step is "Place the cabbage in the pot,"
the shortest plan for the following three robot embodiments are different:

We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate
both semantically-meaningful and horizon-reducing subgoals that guide a task and motion planner.
When a subgoal cannot be refined, the VLM is queried again for replanning.

We evaluate VLM-TAMP on kitchen tasks where a robot must accomplish cooking goals
that require performing 30-50 actions in sequence and interacting with up to 21 objects.

VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences,
both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%).

Reprompting VLM visibly benefit VLM-TAMP, increasing task success increasing by 47 to 55% on the harder problems,
while it doesn't help the baseline which predicts actions