While natural language processing models (NLP) have picked up pace in recent years, they are not quite there yet. And now, a team of researchers at MIT have created a framework that illustrates one instance of this.
The program, dubbed TextFooler, attacks NLP models by changing specific parts of a given sentence to "fool" them into making the wrong predictions. It has two principal components—text classification and entailment. It modifies the classification to invalidate the entailment judgment of the target NLP models.
Jargon aside, the program swaps the most important words with synonyms within a given input to modify how the models interpret the sentence as a whole. While these synonyms might be common to us and the sentence would still have similar semantics, they made the targeted models interpret the sentences differently.
One example is as follows:
- Input: The characters, cast in impossibly contrived situations, are totally estranged from reality.
- Output: The characters, cast in impossibly engineered circumstances, are fully estranged from reality.
To evaluate TextFooler, the researchers used three criteria. First, changing the model"s prediction for classification or entailment. Second, whether it seemed equivalent in meaning to a human reader, compared with the original example. And third, whether the text output text looked natural enough.
The framework successfully attacked three well-known NLP models, including BERT. Interestingly, by changing only 10 percent of the input sentence, TextFooler brought down models exhibiting accuracies of over 90 percent to under 20 percent.
All in all, the team behind TextFooler commented that their research was undertaken with the hope of exposing the vulnerabilities of current NLP systems to make them more secure and robust in the future. They hope that TextFooler will help generalize the current and upcoming models to new, unseen data. The researchers plan on presenting their work at the AAAI Conference on Artificial Intelligence in New York.