How long should you spend obtaining data?
-
Define the data requirements: Determine what data is needed for the project, including the definition of the target variable (y) and input variables (x).
-
Time investment: Consider the time spent on obtaining data in relation to the iterative process of machine learning. It is recommended to enter the iteration loop quickly by not spending excessive time on data collection initially.
-
Start with a small amount of data: Instead of spending a long time collecting a large dataset, start with a small amount of data to train an initial model and conduct error analysis. This helps determine if more data is needed.
-
Inventory data sources: Identify and consider various data sources, such as owned data, crowdsourcing platforms, purchasing data, or data labeling options. Evaluate associated costs and time requirements for each source.
-
Data quality and constraints: Consider factors like data quality, privacy, and regulatory constraints when selecting data sources and labeling options.
-
Labeling options: Choose the appropriate method for data labeling based on the application. Options include in-house labeling, outsourcing to specialized companies, or crowdsourcing.
-
Gradual dataset expansion: Increase the dataset size gradually, preferably by no more than 10x at a time. Train models on the expanded dataset, conduct error analysis, and assess the need for further data expansion.