Data augmentation is a valuable technique for increasing the amount of data available for training machine learning models, particularly for unstructured data like images, audio, and text.
However, when applying data augmentation, it is important to make thoughtful decisions. Here are some best practices to consider:
-
Goal of Data Augmentation: The purpose of data augmentation is to generate examples that challenge the learning algorithm, while still being recognizable by humans or a baseline algorithm. The examples should be difficult enough to provide learning opportunities but not so distorted that they become uninterpretable.
-
Designing Data Augmentation: it is crucial to consider the specific problem and the desired outcome. For example, in speech recognition, adding background noise like cafe noise can be an effective form of data augmentation. The choice of background noise type and its loudness relative to the speech should be decided systematically.
-
Evaluating Data Augmentation: Instead of relying on time-consuming training iterations to evaluate the effectiveness of data augmentation, a more efficient approach is to use a checklist to assess the generated data. The checklist includes criteria such as: realistic, clear mapping between inputs and outputs, and poor performance of the algorithm on the new data. Meeting these criteria increases the likelihood of data augmentation leading to performance improvements.
Data Iteration Loop: In some cases, a data-centric approach can be more effective than a model-centric approach. By iterating through a data iteration loop, which involves continuously improving the quality and diversity of the data, and performing error analysis, it is possible to achieve faster improvements in the learning algorithm’s performance.
Can Adding Data Hurt?
Adding data to a training set can raise concerns about the impact on the learning algorithm’s performance, especially when the training set distribution differs from the dev and test set distributions.
However, in unstructured data problems, adding accurately labeled data rarely harms accuracy, given certain conditions:
-
Model Capacity: If the model is sufficiently large with low bias, such as a neural network, it can handle different data distributions effectively. The model’s capacity allows it to perform well on both the added data and the existing data.
-
Mapping Clarity: If the mapping from input to output is clear, meaning humans can accurately predict the output given the input, adding data through data augmentation does not typically harm performance. However, there may be rare cases where the mapping is ambiguous, leading to potential performance issues.
While this corner case exists, it is uncommon for data augmentation or adding more data to harm the learning algorithm’s performance. Generally, as long as the model is sufficiently large, diverse data sources can be beneficial. Understanding these considerations should provide confidence in using data augmentation or collecting more data, even if it alters the training set distribution compared to the dev and test sets.