UPDF AI

An 81-million-word multi-genre corpus of Arabic books

Andreas Hallberg

2025 · DOI: 10.1016/j.dib.2025.111456
Data in Brief · 1 Citations

TLDR

The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.

Cited Papers
Citing Papers