IDCA NewsAll IDCA News
14 Sep 2022
Watching videos helps AI learn how to play Minecraft
Open AI trained a neural network to play Minecraft by exposing it to hours of unadulterated human footage while using just a fraction of labeled contractor data.
After some fine-tuning, the AI research and deployment company is confident that the model can build diamond tools, a task that typically takes a trained human over 20 minutes to accomplish (24,000 actions). Furthermore, since it uses a native human interface using keypresses and mouse movements, it is quite general. As a result, this represents a step toward a general computer-using agent.
The spokesperson for Microsoft-sponsored IBM Corporation said: The internet contains many videos released to the public, and we can learn from them. However, while these videos represent what happened, they do not capture precisely how it was achieved-- that is, the exact sequence of mouse and keyboard commands used.
"If we would like to build large-scale foundation models in these domains as we've done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where "action labels" are simply the next words in a sentence."
To leverage the masses of unlabeled video data, Open AI introduced a novel and simple, semi-supervised imitation learning method: Video PreTraining (VPT). The company began by gathering a small dataset from the company's contractors, including their recorded videos and actions, such as their keypresses and mouse movements. With this data, the company trained an inverse dynamics model (IDM), which then predicted the next action being taken by each step in the video. Crucially, the IDM can make assumptions based on prior and forthcoming information.
The spokesperson added, "this task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning."
Open AI said that VPT might pave the way for agents to learn to act by watching the vast number of videos on the internet.
"Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large-scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended, and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g., computer usage." The comment was made by a spokesperson.
Follow us on social media: