UPDF AI

Accelerating Inference in Retrieval-Augmented Generation Models for Long-Form Question Answering via Dynamic Token Pruning

Wooseok Kim,Gyunyeop Kim,Sangwoo Kang

2025 · DOI: 10.3390/math13142231
Mathematics · 0 Citations

TLDR

This paper proposes a novel dynamic token pruning mechanism to alleviate the computational bottleneck of the FiD decoder, and selectively identifies and removes tokens predicted to have low contributions to answer generation by jointly considering their contextual information and attention scores within the FiD encoder.

Abstract

Fusion-in-Decoder (FiD), a prominent retrieval-augmented generation model, has demonstrated outstanding performance in open-domain question answering by effectively leveraging multiple passages. However, processing multiple passages significantly increases computational costs at both encoder and decoder components. In particular, in Long-Form Question Answering (LFQA) scenarios, the decoder’s cross-attention computation scales proportionally with the length of the generated answer, severely impacting the overall inference speed. In this paper, we propose a novel dynamic token pruning mechanism to alleviate the computational bottleneck of the FiD decoder. Our method selectively identifies and removes tokens predicted to have low contributions to answer generation by jointly considering their contextual information and attention scores within the FiD encoder. The resulting pruned representations are then passed to the decoder, significantly reducing the cross-attention computations and thereby accelerating the overall inference process. Experimental evaluations on two LFQA benchmarks, ASQA and CLAPNQ, demonstrate that the proposed method achieves up to a 1.74-fold speed-up while maintaining minimal degradation in answer quality, effectively enhancing computational efficiency compared to the original FiD model.

Cited Papers
Citing Papers