Databricks Offers Open-Sourced Dolly 2.0 as a ChatGPI-type Platform

Dolly is a large language model (LLM) designed to exhibit ChatGPT-like human interactivity, according to its developer, Databricks. It recently released Dolly 2.0, which it says is the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.

"Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new, high-quality human generated instruction following dataset, crowdsourced among Databricks employees," several of the company's employees recently wrote on a blogpost.

"We are open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties."

Dolly 2.0's developers say they learned from an OpenAI research paper that the original InstructGPT model was trained on a dataset consisting of 13,000 demonstrations of instruction following behavior. So they set out on their own, similar adventure. They let all of the company's 5,000 employees know about the project, "so we thought we could crowdsource among them to create an even higher quality dataset than the 40 labelers had created for OpenAI."

The team set out seven specific tasks, as outlined in their blogpost:

Open Q&A: For instance, “Why do people like comedy movies?” or “What is the capital of France?” In some cases, there’s not a correct answer, and in others, it requires drawing on knowledge of the world at large.
Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?”
Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage.
Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distill it to a short summary.
Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”.
Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review.
Creative writing: This task would include things like writing a poem or a love letter.

They ended up with 15,000 results within a week, so moved onto creating the actual product. They say they're working to maintain an ethical approach, writing, "we believe that the important issues of bias, accountability and AI safety should be addressed by a broad community of diverse stakeholders rather than just a few large companies. Open-sourced datasets and models encourage commentary, research and innovation that will help to ensure everyone benefits from advances in artificial intelligence technology."

"(Furthermore,) as a technical and research artifact, we don't expect Dolly to be state-of-the-art in terms of effectiveness. However, we do expect Dolly and the open source dataset will act as the seed for a multitude of follow-on works, which may serve to bootstrap even more powerful language models."

People interested in Dolly 2.0 can start by visiting the Databricks Hugging Face page.