Internships 2020 at Adarga: completed Question Answering project

What does an intern project actually entail at Adarga? Our 2020 interns have left us, but their parting gift was this write-up on the excellent work they did on a new Question Answering feature for the Adarga platform.

During the course of our internship this summer, the intern team embarked on a project for a new functionality that could be a useful addition to Adarga’s capabilities. The Question Answering team created an end-to-end system for open domain question answering.

The concept was that this system, given a well formatted question, would first fetch a relevant document, then extract the most relevant paragraph, and finally it would find the answer within that paragraph. The system would then return these results to the user along with some reference to where this answer was found.

Throughout the 12-week programme, we made steady progress towards this goal, and successfully built the system. We often exceeded our own expectations for the project and we demonstrated just how advanced the state of the art is in this field. Now, at the end of the internship, it is time to look back at the work that has been done: the challenges, the limitations and what we learnt in the process.

The first step was exploring famous datasets for open domain question answering to decide what kind of task should be the testbed for our system. We settled on the Stanford Question Answering Dataset (SQuAD), the most used dataset for benchmarking QA systems. The questions in SQuAD are gathered from Wikipedia articles and are all single-hop, meaning that the answer can be found in a single paragraph as a continuous span of text.

SQuAD has a second version as well, where additional questions were added which cannot be answered simply from the source texts, and roughly 25% of all questions are “unanswerable” ones. This is useful because by providing negative examples the system can be trained to recognise when it is impossible to provide an answer given the context.

We opted to use the second version of SQuAD because having a system that recognises when a question is answerable is an important feature: users should be given a clear indication of what can and cannot be learned from a set of documents.

With the task set, the next step was to experiment with what kind of models would work. From exploring recent Question Answering literature, it became clear that transformer architectures are the current state of the art.

Transformers are designed to work with sequential data in a highly parallelised fashion, leveraging attention mechanisms. Attention in this context means that the model does not simply give the whole input equal importance but prioritizes certain parts. For example, when translating between two languages, the word order may be different in the source and target languages, therefore when a translated word is produced, more weight (or rather attention) should be given to the relevant part in the text than the current index, or an uninformative equal weighting. Attention and some other operations combine to build transformer blocks with several layers and stack these blocks to create very deep neural networks. Transformer models have proven their effectiveness on many tasks in the past two years, which has led to various breakthroughs in NLP.

Studying papers helped us in making these choices, nevertheless the final decisions for the answer extraction part of the pipeline were made based on empirical evaluation on SQuAD. The passage-ranking part of our concept is built on similar models but instead of SQuAD, the MS MARCO dataset’s ranking task was used to evaluate performance, which is a Microsoft dataset put together from a vast amount of Bing queries.

After arduous paper digging, brainstorming and experimentation, we created an ensemble answer extractor that surpassed human level performance on the SQuADv2 task.

Plugging this into Adarga’s end-to-end system yielded a tool that passed the original aims with flying colours. Given a question, the system showcases not just the answer, but also gives the user references and documents for further exploration, which is particularly useful when a satisfying answer is not found.

Caveats apply, in its current form the system is mostly adept at handling factual questions and the formatting of the questions matters a lot as well. Refinements to handle causal reasoning will be further developed, however the current state-of-the-art does not do very well on datasets presenting tasks for this type of question answering, so most likely a lot of novel improvements would be needed.

Overall the project was a great learning experience. For me personally, it solidified my knowledge of deep learning related concepts I encountered during my last year at the University of Edinburgh. My knowledge of transformer architectures advanced greatly, now I am much more confident about how to use them in production and how they can improve on various NLP tasks.

Years of doing machine learning related courses, readings, etc have culminated in a piece of work that is truly impressive, showing me and my teammates that all that studying does lead to applicable knowledge that can solve real world problems, improve on products and make AI more widespread. So far, I have concentrated on the basics and explored different AI domains, but this internship has pushed me further towards specialising in NLP. At Adarga I glimpsed the rapid and exciting advances being made in this corner (or rather room) of deep learning and language has always been one of my main interests.

It has been a pleasure working with the highly professional and friendly people at Adarga, we got incredible support throughout the project. I am definitely looking forward to seeing what kind of amazing results will be produced here and hopefully I and some of the other interns will have the chance to join again in the future and help scale AI, develop products and enhance human ingenuity.

We will be running another intern programme in summer 2021. Sign up to the Adarga newsletter for an update when we open for applications.