A Comparative Analysis of Generative AI Systems for Document Summarisation

TLDR

This paper compares the summarising performance of a sample of the most popular GenAI tools, i.e., ChatGPT, Copilot, Gemini, and Claude, and finds that the different summaries have specific characteristics, such as the length of the sentences or the number of synonyms used connected with the GenAI model on which each is based.

Abstract

Generative AI (GenAI) systems can support knowledge workers in managing their knowledge, for example, by effectively processing explicit knowledge, such as document location, classification, integration, and summarisation. Document summarisation is especially useful in many cases since it allows users to quickly identify and understand the key information in a written document (a technical report, an academic paper, a user manual, etc.). In other words, effective summation facilitates distilling the essential meaning, ideas, or information from documents. At present, the main used GenAI tools allow document summarisation. However, they provide different performances since they are based on different Large Language models. In this paper, we compare the summarising performance of a sample of the most popular GenAI tools, i.e., ChatGPT, Copilot, Gemini, and Claude. Our analysis compares the summaries of six documents (academic and nonacademic, in English and Italian) provided by the four tools. These summaries were obtained through a prompt engineering process in which we specified the requirements for the summaries. These summaries were then analysed using quantitative metrics, such as ROUGE and BERTScore, and qualitative criteria. By integrating both types of analysis, we achieved a comprehensive evaluation, reducing subjectivity and analysing the summaries across multiple aspects. Our analysis results do not allow us to conclude that there is a “best-in-class” tool regarding the summarisation function. However, we find that the different summaries have specific characteristics, such as the length of the sentences or the number of synonyms used connected with the GenAI model on which each is based. Therefore, our results confirm that a correct understanding of how GenAI tools work is needed to use them consciously, exploit their potential, and reduce their limitations. The study has some limitations. In particular, we compared only a limited number of tools based on a likewise limited number of documents. Moreover, as these tools are constantly evolving, their performance continues to improve over time.