Comparison of Subword Tokenization Methods on Natural Language Inference for Filipino

TLDR

This paper explores the impact of five subword tokenization strategies, Byte Pair Encoding, WordPiece, SentencePiece-Unigram, and combinations of BPE and WordPiece with Whitespace pretokenization, on the performance of a CNN-BiLSTM model for the Natural Language Inference (NLI) task.

要旨

Tokenization plays a key role in any natural language processing (NLP), particularly for low resource and morphologically rich languages such as Filipino. This paper explores the impact of five subword tokenization strategies, Byte Pair Encoding (BPE), WordPiece, SentencePiece-Unigram, and combinations of BPE and WordPiece with Whitespace pretokenization, on the performance of a CNN-BiLSTM model for the Natural Language Inference (NLI) task. All tokenizers were trained on the Bantay Wika corpus with a fixed vocabulary size of 32,000 and corresponding fastText embeddings were generated for each tokenized dataset. Experimental results on the NewsPH-NLI dataset demonstrate that tokenizers incorporating whitespace pretokenization significantly outperformed their counterparts, with the Whitespace+BPE tokenizer achieving the highest test accuracy of 86.66%. Qualitative analysis further reveals that SentencePiece-Unigram showed its ability in handling of morphological variants and preserving lemmas. These findings highlight the importance of selecting tokenization strategies that align with the linguistic structure of the target language and offer practical guidance for improving Filipino NLP models.