Extracting Grammatical Error Corrections from Wikipedia Revision History

TLDR

The process of extracting and filtering Wikipedia revision history as a resource for grammatical error correction (GEC) is described, and the resulting corpus is — to the authors' knowledge — the largest publicly available corpus of parallel possibly erroneous and correct sentences with error type labels.

Abstract

This paper describes the process of extracting and filtering Wikipedia revision history as a resource for grammatical error correction (GEC). Edits in Wikipedia revision history vary widely, including grammatical error corrections, information supplements, format amendments, and even vandalism. To extract only GEC-related revisions, we use an automated error annotation toolkit, ERRANT1, and extend it to process large data in parallel efficiently. With error-type analysis, we can then identify GEC-related edits and omit other unrelated edits (i.e., only the correction parts are reserved). The resulting corpus is — to our knowledge — the largest publicly available corpus of parallel possibly erroneous and correct sentences with error type labels.