Factor Sandeep Uttamchandani, Ph.D., Both Product / Software Designer (Vice President of Engg) and Director of Enterprise Data / Artificial Intelligence (CDO) Initiatives
Let’s start with the obvious fact: ML models can only be as good as the datasets used to build them! While a lot of attention is paid to building ML models and choosing algorithms, teams often don’t pay enough attention to selecting data sets!
In my experience, investing time in data set selection in advance saves endless hours later during model debugging and production deployment.
Nine deadly sins About the selection of an ML dataset
1. Do not handle deviations in the data correctly
Based on the ML model being constructed, the deviations can be either negligible noise or important to consider. Deviations due to collection errors must be disregarded. Machine learning algorithms differ from those that deviate from their sensitivity – AdaBoost is more sensitive to deviations compared to XgBoos, which is more sensitive than the decision tree, which simply calculates deviations as a misclassification. Proper handling of anomalies requires an understanding of whether they can be ignored, as well as the selection of an appropriate algorithm based on sensitivity.
2. Using normalization instead of standardization when scaling property values
To import properties to the same scale, use normalization (MinMaxScaler) when the data is evenly distributed, and standardization (StandardScaler) when the property is approximately Gaussian. Before using the data set, check iid, in place (does not change over time), and ensures the same distribution during training and testing. Seasonality is often overlooked, which is contrary to classical stationarity.
3. No certification of copies in the training data set
Often we are excited about really high precision figures. A double check often reveals that many examples in the test set are copies of the examples in the exercise set. In such scenarios, measurements of model prevalence are not deterministic (or insignificant). A related aspect is the randomization of the training set – without randomization, we may end up with all the training data and summer data for the test. This can lead to loss time schedules that require unnecessary debugging.
4. Does not confirm inherent data set bias
The data do not describe the final truth from a statistical point of view. They only capture the attributes that application owners needed at the time for their use case. It is important to analyze data sets for biases and dropped data. Understanding the context of a data set is supercritical. There are often one or more error patterns in the data. If these errors are random, they are less detrimental to model training. But if there is such an error that a particular row or column is systematically missing, it can lead to a discrepancy in the data set. For example, device data for customer clicks is missing from Andriod users due to an error, the data set is biased to the actions of the iPhone user.
5. No unit tests to confirm input data
In traditional software development projects, the best way is to write unit tests to confirm code dependencies. ML projects must use similar best practices for continuous testing, verification, and monitoring of all income data sets. This includes ensuring that the test sets produce statistically significant results and are representative of the whole set.
6. False assumptions about the meaning of the data attribute
Data attributes are typically never documentedd. Prior to the big data period, data was curated before being added to the central data warehouse. This is known as formula as you write. Today, the approach with data lakes is to first compile the data and then deduce the significance of the data at the time of consumption. This is known as formula readable. A related issue is the existence of multiple definitions for a particular business metric, i.e., the lack of standardization of business metrics. Even the simplest metrics can involve multiple sources of truth and definitions of business. For example, a basic metric such as “new customer base” may have different definitions depending on whether they are counted by sales, finance, marketing, or customer support teams.
7. Uncoordinated changes in the data source
Changes to the source model are often not coordinated with further processing teams. Changes can range from schema changes (breaking existing pipelines) to hard-to-detect semantic changes in data attributes (very ugly when your model unexpectedly starts to be nuts!) When business information changes, there are no versions of the definitions.
8. Using non-representative data
Expiry date of the data. Customer behaviors ten years ago may not be representative. In addition, data verification IID (Independent and Identically Distributed) model training and taking into account the seasonal nature of the data. Data sets are also constantly evolving. Analyzing the distribution of data is not a one-time activity that is only needed at the time of model creation. Instead, there is a need to constantly monitor materials for drifting, especially for online training. Often, due to the silence of the data, different teams manage and catalog different data sets. A lot of tribal information is used to find the data. Without due diligence, teams will use the first available data. They often make the classic mistake of assuming that all the materials are equally reliable. Source teams update and manage some of them very carefully, while other datasets are discarded or not updated regularly or have unclear ETL piping.
9. Arbitrary sample selection in a large data set
Given the very large data sets, sampling is typically arbitrary. Often, teams either decide to use all the information for training. While more data helps build an accurate model, sometimes the data is huge with billions of records. For a larger body of knowledge, training takes both time and resources. Each internship takes longer, which slows down the completion of the project. Data sampling needs to be used effectively. Paying special attention to leverage techniques such as the importance of sampling.
In summary, be sure to include this checklist in your data set. While these steps add to the effort and may slow you down at first, they pay for themselves many times later in the ML lifecycle!
Bio: Sandeep Uttamchandani, Ph.D.: Data + AI / ML – both Product / Software Designer (Vice President of Engg) and Director of Enterprise Data / Artificial Intelligence Initiatives (CDOs) | O’Reilly Writer Founder – DataForHumanity (non-profit)
Original. Re-posted with permission.