JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models

Abstract

Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of ObtainDiamondPickaxe, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks.

Self-Improving JARVIS-1

JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to growing multimodal memory, sparking a more general intelligence and improved autonomy. Next, we will demonstrate the performance of JARVIS-1 at different learning stages when completing the same task. (One Epoch represents that all tasks in the task-pool have been executed by JARVIS-1 in the environment once, regardless of success or failure.)

Given an Minecraft:iron_axe, smelt and craft a Minecraft:shears with empty inventory.

Epoch 1:

1) mine 3 logs
2) craft 12 planks
3) craft 1 crafting_table
4) craft 4 stick
5) craft 1 wooden_pickaxe
6) mine 3 cobblestone
7) craft 1 stone_pickaxe
8) mine 2 iron_ore
9) smelt 2 iron_ingot
10) craft 1 shears
(Lack of furnace as tool)

Epoch 2:

1) Mine 3 logs
2) Craft 12 planks
3) Craft 1 crafting_table
4) Craft 4 sticks
5) Craft 1 wooden_pickaxe
6) Mine 8 cobblestone
7) Craft 1 furnace
8) Mine 3 cobblestone
9) Craft 1 stone_pickaxe
10) Mine 2 iron_ore
11) Smelt 2 iron_ingot
12) Craft 1 shears
(Lack of fuel sometimes)

Epoch 3:

1) mine 4 logs (One more as fuel)
2) craft 12 planks
3) craft 1 crafting_table
4) craft 4 stick
5) craft 1 wooden_pickaxe
6) mine 11 cobblestone
7) craft 1 furnace
8) craft 1 stone_pickaxe
9) mine 2 iron_ore
10) smelt 2 iron_ingot
11) craft 1 shears
(More accurate and efficient!)

1.5x Speed

Intruction-Following in Diverse Biomes

JARVIS-1 can execute human instructions in diverse environments. We illustrate executions in different biomes below.

Pick up a Minecraft:wooden_pickaxe

Execution in Plains:

Execution in Birch Forest:

Execution in Jungle:

Execution in Savanna:

More Results

Below we share some additional results of JARVIS-1 on Minecraft.

Wood Stone Iron Gold Diamond Redstone Blocks Armor Decoration Food