There is very little effect of answer generation on a question more similar to a training question than less similar. Language models are trained with same parameters described for seq2seq above, with 6 decoder layers. We did not train with 12 decoder layers, as we found the deeper Transformer model was harder to optimize and we achieved worse results compared eli5 to a 6-layer language model. overlap between support document and human answer, the abstractive setting is much better at compensating for when the support document has lower relevance. Figure3 shows an example of generation for the language model and the best Seq2Seq and extractive settings . We model a vocabulary of 52,863 tokens for answer generation.

For generation, models generate a minimum of 200 words and a maximum of 500 words. from the information extraction problem of obtaining information from long, multi-document input eli5 to generating more coherent and accurate paragraph-length answers. Our task blends the inter-dependent challenges of retrieving information, reasoning, and writing long outputs.

  • We find only 19% of multi-task model answers are fully accurate; even if the model output answers the question, it can generate a sentence with an incorrect statement.
  • We evaluate accuracy ourselves with the support document in Figure4, right.
  • Crowdworkers assessing accuracy do not have the support document.
  • In answer accuracy , there is a large gap between human performance and all models.
  • Similar to crowdworkers, we find 40% of extractive answers to be accurate.
  • The language model is almost never accurate, while the extractive model is slightly more so than the multi-task model.

Demystifying Model Interpretation Using Eli5

Next, we collect web sources for every question to provide relevant information that a system can draw upon when generating an answer. However, early experiments in our setting showed it to be insufficient to cover the wide range of topics present in eli5 and to address the open-ended nature of the questions. involve long input and multi-sentence generation, but contain much less training data compared to ELI5.

rely less on pre-existing knowledge of the world and use simpler language that is easier to model. When a block is solved, miners receive a reward that is set and changes over time, known as a block reward. The mining process is automated, in that the mining software and computer associated with a user do the work on their own. Faster, higher-performance machines lead to more frequent rewards for miners. Another way to think about this is “sentiment.” What are people saying about your brand or product? Word of mouth has some huge implications for growth, and it spreads even faster as social media expands as a medium. Certain social media dashboards and website tracking tools measure data in real-time.

Eli5: Long Form Question Answering

As can be seen from the classification report, the model is 84% accurate. But we want to know how the model is coming to this conclusion. Hence, let’s go ahead and try to use ELI5 to get some answers.

Compared to existing datasets, ELI5 comprises diverse questions requiring multi-sentence answers. We provide a large set of web documents to help answer the question. Automatic and human evaluations show that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline. However, our best model is still far from human performance since raters prefer gold responses in over 86% of cases, leaving ample opportunity for future improvement.

We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions. The dataset comprises eli5 270K threads from the Reddit forum “Explain Like I’m Five” where an online community provides answers to questions which are comprehensible by five year olds.

The agreement of at least two of the annotators is almost 100% for all of our evaluated systems. The extractive model outputs human-written text which is likely fluent but with the failure mode of concatenating unrelated sentences. The multi-task model performs similarly to the extractive model which indicates that abstractive methods can generate coherent answers. We generate from abstractive models using beam search with beam 5. For the full answer generation task, we tune a minimum and maximum length for generation on the valid set and apply these settings to the test set. We run this algorithm on our support document and on the full set of web sources for each validation and test question, selecting up to 10 sentences with a beam of size 10.

Each crowdworker assessment is made by 3 different evaluators. The same questions are used for all models and must be at least 5 words long. 94.5% of gold answers fully address the question based on the information in the support document. Table1 compares eli5 to related datasets in terms of the length of the question, support document, answer, as well as statistics on the question types. has answers an order of magnitude longer and more open-ended questions.


We train the multi-task model on 25%, 50%, and 75%, and the all of the data to compare performance. For a model to perform best, it would have to handle inputs tens of thousands of words long. In Table3, we show an oracle computed on the full web sources has much higher ROUGE than an oracle computed on the support document. Seq2Seq multi-task score by amount of training data.Data size and initial selection. However, both versions of the language model are still better at FILL-1. These results suggest that the Seq2Seq model is better than the language model in maintaining coherence and that Seq2Seq relies on information over many time steps.

