An 81-million-word multi-genre corpus of Arabic books
Andreas Hallberg
2025 · DOI: 10.1016/j.dib.2025.111456
Data in Brief · 1 Citations
TLDR
The corpus was originally collected in order to investigate variation in the use of vowel diacritics across genres, but it is also suitable for other linguistic inquiries, especially as relating to genre, and as a source of texts published under free licenses for training language models.
Cited Papers
Citing Papers
