UPDF AI

Doctopus: Budget-Aware Structural Table Extraction from Unstructured Documents

Chengliang Chai,Jiajun Li,4 Authors,Lei Cao

2025 · DOI: 10.14778/3749646.3749647
Proceedings of the VLDB Endowment · 0 Citations

TLDR

Doctopus, a system designed for accurate attribute extraction from unstructured documents with a user-specified cost constraint, is presented, which combines LLMs with non-LLM strategies to achieve a good tradeoff.

Abstract

To fulfill the potential great value of unstructured documents, it is critical to extract structural data (e.g., attributes) from them, which can benefit various applications such as analytical SQL queries and decision-making. Multiple strategies, such as pre-trained language models (PLMs), can be employed for this task. However, these methods often struggle to achieve high-quality results, particularly when dealing with attribute extraction that requires intricate reasoning or semantic comprehension. Recently, large language models (LLMs) have proven to be effective in extracting attributes but incur substantial costs caused by token consumption, making them impractical for large-scale document set.

To best trade off quality and cost, we present Doctopus, a system designed for accurate attribute extraction from unstructured documents with a user-specified cost constraint. Overall, Doctopus combines LLMs with non-LLM strategies to achieve a good tradeoff. First, the system employs an index-based approach to efficiently identify and process only relevant text chunks, thereby reducing the LLM cost. Afterwards, it further estimates the quality of multiple strategies for each attribute. Finally, based on the cost and estimated quality, Doctopus dynamically selects the optimal strategies through budget-aware optimization. We have built a comprehensive benchmark including 4 document sets with various characteristics and manually labeled ground truth using 1000 human hours. Extensive experiments on the benchmark show that compared with state-of-the-art baselines, Doctopus can improve the quality by 11% given the same cost constraint.

Cited Papers