Extracting Grammatical Error Corrections from Wikipedia Revision History
Extracting Grammatical Error Corrections from Wikipedia Revision History
Jhih-Jie Chen,Yiquan Wu,3 Authors,Jason J. S. Chang
TLDR
The process of extracting and filtering Wikipedia revision history as a resource for grammatical error correction (GEC) is described, and the resulting corpus is — to the authors' knowledge — the largest publicly available corpus of parallel possibly erroneous and correct sentences with error type labels.
Abstract
This paper describes the process of extracting and filtering Wikipedia revision history as a resource for grammatical error correction (GEC). Edits in Wikipedia revision history vary widely, including grammatical error corrections, information supplements, format amendments, and even vandalism. To extract only GEC-related revisions, we use an automated error annotation toolkit, ERRANT1, and extend it to process large data in parallel efficiently. With error-type analysis, we can then identify GEC-related edits and omit other unrelated edits (i.e., only the correction parts are reserved). The resulting corpus is — to our knowledge — the largest publicly available corpus of parallel possibly erroneous and correct sentences with error type labels.
