Background

In the modern Internet age, textual data is ever increasing. We need some way to condense this data while preserving the information and its meaning. We need to summarise textual data for that. Text summarisation is the process of automatically generating natural language summaries from an input document while retaining the important points. It helps in the easy and fast retrieval of information.

Experiment objective

The primary objective of this experiment is to deploy advanced NLP techniques to generate grammatically correct and insightful summaries for pharma research articles. To accomplish this, we will test various publicly available transformer models for seq-to-seq modelling and retrain them on the PUBMED dataset.

Further, we will explore multiple approaches to inspect generated summaries, and to build scalable scoring methods for measuring the performance

Business Use Cases and Applications

There are multiple direct and indirect applications of this experiment. Some of them include –

  • Financial research: Investment banking firms spend large amounts of money acquiring information to drive their decision-making, including automated stock trading. Financial analysts inevitably hit a wall and are not able to read everything. Summarisation systems tailored to financial documents like earning reports and financial news can help analysts quickly derive market signals from content.
  • Market Intelligence: Automated summarisation of key competitor content releases, news tracking, patent research, etc to drive competitive advantage.
  • Pharma clinical phase intelligence: Scalable summarisation of ongoing research/clinical trials happening in a specific therapy area or domain. At any point there are thousands of such research papers being published and it’s slow and cumbersome for pharma research teams to keep on top of all of it.
  • Newsletters: Many weekly newsletters take the form of an introduction followed by a curated selection of relevant articles. Summarisation would allow organisations to further enrich newsletters with a stream of summaries (versus a list of links), which can be a particularly convenient format on mobile.

Dataset

Although the pre-trained models provided by Google and Facebook do a decent task in generating short summaries for the news articles, however, we are aiming to transfer that learning for a specific domain. For this we are using the Pubmed dataset.

SumPubMed is created based on biomedical research papers, namely the PubMed database. PubMed is a central repository for 26 million citations, which has literature from MEDLINE, life science journals and online books. We took a small subset of the research documents and used it to retrain our model for this summarisation task.

Environment Setup

  • Python; for Documentation, Exploratory Data Analysis and Preprocessing using reticulate package and Python regular expressions.
  • For Training – AWS Deep Learning AMI (instance type: ps.2xlarge); considering high training required load to generate embeddings, we have used GPU processors to manage training time
  • For inference – AWS Deep Learning AMI (instance type: t2.large or g4dn.xlarge); loading the trained model & using it for similarity scoring

Experiment Outcomes

We were able to summarise pubmed articles within our dataset with a Rouge-1 score above 0.35 and Rouge-L score above 0.3 which are inline with the publicly accepted benchmarks.