Data Augmentation in NLP

Bhuvana Basapur
12 min readNov 28, 2021



To develop a good machine learning model the most important necessity is availability of good quality data. By good quality here refers to a fairly uniform distribution of data of all varieties. To help alleviate this problem of scarcity of data, a technique called “Data Augmentation” is being widely used.

In simple terms Data Augmentation is creation of synthetic data without directly collecting more data. It encompasses methods used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data.

Data augmentation is widely applied in the field of computer vision (like flipping and rotation) and was then introduced to NLP field as well.

Categories of DA methods

Many methods to implement Data Augmentation have been proposed recently and this article tries to cover a few such methods.

One of the main thrusts of DA effectiveness is to improve diversity in training data. To facilitate this DA methods can be broadly categorised into three categories namely

  1. Paraphrasing: Generates paraphrases of the original data as augmented data. It brings very limited changes when compared to the original data
  2. Noising: Adds continuous or discrete noise to the original data and involves more changes.
  3. Sampling: Master the distribution of the original data to sample new data as augmented data and can generate very diverse data.
Data augmentation techniques include three categories. The examples of the original data and augmented data are on the left and right, respectively.


Paraphrases are alternative words used to convey the same information as the original word. Paraphrases in a natural language can be generated as three levels including Word Level, Phrase Level and Sentence-level. Let’s look into some of the examples of Paraphrasing-cased DA techniques.

Data augmentation techniques by paraphrasing include three levels: word-level, phrase-level, and sentence-level.


Some works replace words in the original text with their true synonyms and hypernyms, so as to obtain a new way of expression while keeping the semantics of the original text as unchanged as possible.

Paraphrasing by using thesauruses

WordNet — a famous lexical database of semantic relations between words sorts synonyms of words according to their similarity. For each sentence , all replaceable words are retrieved and a certain number of words (r) are randomly chosen to be replaced. Choosing the value of r can be determined by using either a geometric distribution or can be a random number. (Note: should not choose stop words). Instead of synonyms, hypernyms can also be used to replace some original words for DA. VerbNet — a broad-coverage verb lexicon can be used to retrieve synonyms, hypernyms and words of the same category.

Semantic Embeddings

This method overcomes the limitation of the replacement range and word part-of-speech in the Thesaurus-based method. It uses pre-trained word vectors, such as Glove, Word2Vec, FastText, etc., and replaces them with the word closest to the original word in the vector space.

Paraphrasing by using semantic embeddings

Instead of using just discrete words, either word embeddings or frame embeddings or both can be used to replace words. For word embeddings, each original word in the sentence is replaced with one of the k-nearest-neighbor words using cosine similarity. For example, “Being late is terrible” becomes “Being behind is bad”. As for frame semantic embeddings, the sentences are parsed and a continuous bag-of- frame model to represent each semantic frame using Word2Vec will be developed. Thesaurus can also be used in conjunction with embeddings for balance.

Language Models

The pre-trained language model has become the mainstream model in recent years due to its excellent performance. Masked language models (MLMs) such as BERT and BoBERTa have obtained the ability to predict the masked words in the text based on the context through pre-training, which can be used for text data augmentation. This method alleviates the problem of ambiguity since MLMs consider the whole context.

Paraphrasing by using language models

To obtain task-specific distillation training augmented data, BERT tokenizer is used to tokenize words into multiple word pieces and form a candidate set for each word piece. Both word embeddings and masked language models are used for word replacement. Specifically, if a word piece is not a complete word (“est” for example), the candidate set is made up of its K-nearest-neighbor words by Glove. If the word piece is a complete word, the authors replace it with [MASK] and employ BERT to predict K Words to form a candidate set. Finally, a probability of 0.4 will be used to determine whether each word piece is replaced by a random word in the candidate set.


This method requires some heuristics about natural language that ensure the maintaining of sentence semantics.

Paraphrasing by using rules

One can use regular expressions to transform the form without changing sentence semantics, such as the abbreviations and prototypes of verbs, modal verbs, and negation. For example, replace “is not” with “isn’t”. Also perform replacements from expanded to abbreviated form and inversely between a group of words and the corresponding abbreviation, relying on word-pair dictionaries.

Another way is to change the voice of the sentence by building a dependency tree. For example, “Sally embraced Peter excitedly” can be replaced with “Peter was embraced excitedly by Sally”. A syntactic parser can be used to build a dependency tree for the original sentence. Then the paraphrases generator transforms this dependency tree to create a transformed dependency tree guided by a transformation grammar. The transformed dependency tree is then used to generate a paraphrase similar to the augmented data.

Machine Translation

Translation is a natural means of paraphrasing. With the development of machine translation models and the availability of online APIs, machine translation is popular as the augmentation method in many tasks. Machine translation can be further categorized into Back Translation and Inverse translation.

Paraphrasing by machine translation

Back-translation: This method means that the original document is trans- lated into other languages, and then translated back to obtain the new text in the original language. Different from word-level methods, back-translation does not directly replace individual words but rewrites the whole sentence in a generated way. In addition to some trained machine translation models, Google’s Cloud Translation API service can also be used as a tool for back-translation. Back-translation can also be combined with adversarial training, to synthesize diverse and informative augmented examples by organically integrating multiple transformations.

Unidirectional Translation: Different from back-translation, the unidirec- tional translation method directly translates the original text into other lan- guages once, without translating it back to the original language. This method usually occurs in a multilingual scene. This technique can be applied to unsupervised Cross-Lingual Word Embeddings (CLEW) tasks where an unsupervised machine translation (UMT) model is first trained on source/target training corpora and then translated using the UMT model. The machine-translated corpus is then concatenated with the original corpus for the learning of monolingual word embeddings independently for each language. Finally, the learned monolingual word embeddings will be mapped to a shared CLWE space. This method both facilitates the structural similarity of two monolingual embedding spaces and improves the quality of CLWEs in the unsupervised mapping method.

Model Generation

Seq2Seq models are employed to generate paraphrases directly. These models output more diverse sentences given proper training objects.

Paraphrasing by model generation

One way to implement this is by feeding the de-lexicalised input and a specified diverse rank k (e.g. 1, 2, 3) into the Seq2Seq model as input to generate a new input or use Transformers as the basic structure where the masked original sentences as well as their label sequences are used to train a model that reconstructs the masked fragment as the augmented data.

Noising Based methods

The Noising based methods add faint noise that does not seriously affect the semantics (unlike paraphrasing), so as to make it appropriately different from the original data. Humans can greatly reduce the impact of weak noise on semantic understanding through the mastery of language phenomena and prior knowledge, but this noise may bring challenges to the model. Thus, this method not only expands the amount of training data but also improves model robustness.

The example of noising-based methods


The semantics of natural language is sensitive to text order information, while slight order change is still readable to humans. Therefore, the random swapping between words even sentences within a reasonable range can be used as a Data Augmentation method.

One can choose two words in the sentence and swap their positions and repeat this process n times (n should be proportional to the length of the sentence). Another way is to split the token sequence into segments according to labels, and then randomly choose some segments to shuffle the order of the tokens inside, with the label order unchanged.


This method randomly deletes words in a sentence or deletes sentences in a document. In word-level deletion, words with random probability p are chosen for deletion. Similarly, sentences in a document can be randomly deleted.


This method is similar to deletion and instead of deleting, random words in a sentence are selected and it’s synonyms are inserted in random places and is repeated many times. At sentence level insertion, random sentences of the document are chosen and sentences with the same classification from other documents are added at random places.


This method randomly replaces words or sentences with other strings but it usually avoids using strings that are semantically similar to the original data (that’s how it is different from paraphrasing).

It can be implemented by substituting through existing outer resources like a list of the most common misspellings in English and generating augmented texts containing misspellings (example “across” is easily misspelled as “accross”). Another way is to replace original words with other words in the vocabulary by using the TF-IDF value and the unigram fre- quency to choose words from the vocabulary. Another way is, Multilingual Code-Switching method that replaces original words in the source language with words of other languages.


This technique is mostly used for classification tasks and works by performing interpolations on the original dataset. There are two possible variants — first is wordMixup which conducts sample interpolation in word embedding space and second is senMixup which interpolates the hidden states of sentence encoders.

Sampling Based methods

The sampling based methods master the data distributions and sample novel points within them. Similar to paraphrasing based models, sampling based methods also involve rules and trained models to generate augmented data. The difference is that the sampling based methods are task specific and require task information like labels and data format and paraphrasing-based methods are task-independent and only require the original sentence as input.

Sampling-based models


This method uses some rules to directly generate new augmented data like swapping the subject and object of the resource sentence, and converting predicate verbs into passive form. Heuristics about natural language and the corresponding labels are sometimes required to ensure the validity of the augmented data. Example would be “This small collection contains 16 El Grecos.” into “16 El Grecos contain this small collection.”

Seq2Seq models

Some methods use non-pretrained models to generate augmented data. Such methods usually call for the idea of back translation (BT) , which is to train a target-to-source Seq2Seq model and use the model to generate source sentences from target sentences, i.e., constructing pseudo-parallel sentences. Such Seq2Seq model learns the internal mapping between the distributions of the target and the source. This is different from the model generation based paraphrasing method because the augmented data of the paraphrasing method shares similar semantics with original data.

Language models

In recent years, pretrained language models like GPT-2 have been widely used and have been proven to contain knowledge. Thus, they can be naturally used as augmentation tools. Pretrained models like SC-GPT, SC-GPT-NLU and DistliBERT can also be used to generate augmented data i.e. generating synthetic sentences. Amongst all the pretrained models available, GPT-2 is the most used model for data augmentation.

Self training

In some scenarios, unlabeled raw data is easy to obtain. Thus, converting such data into valid data would greatly increase the amount of data.

One way is to train and fine tune a model like BERT on a golden dataset and use the fine-tunes model to generate labels for unlabeled data. This augmented data along with the golden dataset combined can be used to train another model like SBERT (similar to student-teacher model). Another way is to directly transfer existing models from other tasks to generate pseudo-parallel datasets (transfer learning).


Comparing a selection of DA methods by various aspects. Learnable denotes whether the methods involve model training; online and offline denote online learning and offline learning, respectively. Ext.Know refers to whether the methods require external knowledge resources to generate augmented data. Pretrain denotes whether the methods require a pre- trained model. Task-related denotes whether the methods consider the label information, task format, and task requirements to generate augmented data. Level denotes the depth and extent to which elements of the instance/data are modified by the DA; t, e, and l denote text, embedding, and label, respectively. Granularity indicates the extent to which the method could augment; w, p, and s denote word, phrase, and sentence, respectively.
  • It is easy to find that nearly all paraphrasing-based and noising-based methods are not learnable, except for Seq2Seq and Mixup. However, most sampling-based methods are learnable except for the rules-based ones. Learnable methods are usually more complex than non-learnable ones, thus sampling-based methods generate more diverse and fluent data than the former two.
  • Among all learnable methods, Mixup is the only online one. That is to say, the process of augmented data generation is independent of down- stream task model training. Thus, Mixup is the only one that outputs cross-label and discrete embedding from augmented data.
  • Comparing the learnable column and the resource column, we could see that most non-learnable methods require external knowledge resources which go beyond the original dataset and task definition. Commonly used resources include, semantic thesauruses like WordNet and PPDB, hand- made resources like misspelling dictionary and artificial heuristics.
  • Combining the first three columns, we could see that pretrained or non- pretrained models are widely used as DA methods in addition to exter- nal resources. It is because the knowledge in pretrained models and the training objects play a similar role to external resources when guiding augmented data generation.
  • Comparing the learnable column and the task-related column, we could see that in the two categories of paraphrasing and noising, almost all methods are not task-related. They could generate augmented data given only original data without labels or task definition. However, all sampling methods are task-related because they adopt heuristics and model training to satisfy the needs of specific tasks.
  • Comparing the level column and the task-related column, we could see that they are relevant. The paraphrasing-based methods are at the text level. So does the noising-based methods, except for Mixup because it makes changes in the embeddings as well as the labels. All sampling- based methods are at the text and label level since the labels are also considered and constructed during augmentation.
  • Comparing the learnable column and the granularity column, we could see that almost all non-learnable methods could be used for word-level and phrase-level DA, but all learnable methods could only be applied for sentence-level DA. Although learnable methods generate high-quality augmented sentences, unfortunately, they do not work for document augmentation because of their weaker processing ability for documents. Thus, document augmentation still relies on simple non-learnable methods.