Toward Mitigating Adversarial Texts

TLDR

This paper proposes a defense against black-box adversarial attacks using a spell-checking system that utilizes frequency and contextual information for correction of nonword misspellings and outperforms six of the publicly available, state-of-the-art spelling correction tools in terms of average correction accuracy.

要旨

Neural networks are frequently used for text classiﬁcation, but can be vulnerable to misclassiﬁcation caused by adversarial examples: input produced by introducing small perturbations that cause the neural network to output an incorrect classiﬁcation. Previous attempts to generate black-box adversarial texts have included variations of generating nonword misspellings, natural noise, synthetic noise, along with lexical substitutions. This paper proposes a defense against black-box adversarial attacks using a spell-checking system that utilizes frequency and contextual information for correction of nonword misspellings. The proposed defense is evaluated on the Yelp Reviews Polarity and the Yelp Reviews Full datasets using adversarial texts generated by a variety of recent attacks. After detecting and recovering the adversarial texts, the proposed defense increases the classiﬁcation accuracy by an average of 26.56% on the Yelp Reviews Polarity dataset and 16.27% on the Yelp Re-views Full dataset. This approach further outperforms six of the publicly available, state-of-the-art spelling correction tools by at least 25.56% in terms of average correction accuracy.