Intro

To make robots accessible to a broad audience, it is critical to endow them with the ability to take universal modes of communication, like commands given in natural language, and extract a concrete desired task specification, defined using a formal language like linear temporal logic (LTL).

In this paper, We present a learning-based approach to translate from natural language commands to LTL specifications with very limited human-labeled training data by leveraging Large Language Models (LLMs). Our model can translate natural language commands at 75% accuracy with about 12 annotations and when given full training data, achieves state-of-the-art performance. We also show how its outputs can be used to plan long-horizon, multi-stage tasks on a 12D quadrotor.

Approach

Semantic Parsing

Given a predefined set of possible LTL formulas and atomic propositions, and up to one natural language annotation for each formula, we first translate these pre-defined formulas to (structured) English and then use the paraphrasing abilities of modern LLMs to synthesize a large corpus of diverse natural language commands.

Given this corpus, we use the data to fine-tune an LLM. Here, we explore two variants, where for training labels we use 1) raw LTL formulas, or 2) a canonical form of the LTL formulas (an intermediate representation that is more similar to English). At evaluation time, we enforce the LLM’s output to be syntactically correct via constrained decoding.

Result

We evaluate our methods on three datasets, each associated with a different task and environment. We show that our model can translate natural language commands at 75% accuracy with about 12 annotations and when given full training data, constantly obtain state-of-the-art performance. In the paper, we also present comprehensive ablation studies that validate the necessity of each design decision we make.

Results in low-data regimes

In this setup, models are trained with only 12 human annotations at most, and evaluated on the entire original dataset. Our model significantly advances the state-of-the-art in this regime, achieving 75% average accuracy.

Model architecture	Training data	Test data	Drone Dataset	Cleanup Dataset	Pick Dataset
RNN	synthetic	full golden	22.41	52.54	32.39
CopyNet	synthetic	full golden	36.41	53.40	40.36
BART-FT-Raw (ours)	synthetic	full golden	69.39	78.00	81.45
BART-FT-Canonical (ours)	synthetic	full golden	68.38	77.90	78.23

Results in standard data regimes

In this setup, we follow the settings of previous works, where models are evaluated by five-fold cross-validation on the entire dataset. Our model consistently outperforms the state-of-the-art in this regime, with about 1% accuracy improvement on average.

Model architecture	Training data	Test data	Drone Dataset	Cleanup Dataset	Pick Dataset
RNN	4/5 golden	1/5 golden	87.18	95.51	93.78
CopyNet	4/5 golden	1/5 golden	88.97	95.47	93.14
BART-FT-Raw (ours)	4/5 golden	1/5 golden	90.78	97.84	95.97
BART-FT-Canonical (ours)	4/5 golden	1/5 golden	90.86	97.81	95.70

Demo

Finally, we also show how its outputs can be used to plan long-horizon, multi-stage tasks on a 12D quadrotor in simulation.

Cite

If you find this work useful, please cite our paper:

TO BE RELEASED

Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification

Jiayi Pan, Glen Chou, Dmitry Berenson

University of Michigan, Autonomous Robotic Manipulation Lab