Several notable AI thinkers tweeted on the topic earlier this year information is more important in applied machine learning than model architecture and optimization. François Chollet wrote:
ML researchers work with fixed benchmark files and spend all their time searching for the knobs of their choice: architecture and optimization. In the ML you use, you’ve probably spent most of your time collecting data and labeling – where your investment pays off.
– François Chollet (@fchollet) January 24, 2021
Andrew Ng chimed in that he agrees with Chollet and that more work needs to be done to disseminate best practices on data creation and organization.
Getting the information right iS certainly critical. But Chollet and Ng seem to speak from an “ML-first” perspective, where it is axiomatic that a predictive model — and nothing else — is how applied ML projects work. In this mindset, data may be important, but it is only a means to the end, which is a proactive model.
I think problem formulation, problem layout is even more important than either data or models. Christoph Molnar hit the nail straight in the head:
The way you represent the problem
is more important than
The choice of ML algorithm you throw your problems
– Christoph Molnar (@ChristophMolnar) January 24, 2021
The problem design is a data processing solution design process for a business problem. I assume that in this post, the business problem is defined and given; a different but related problem is to add value to the business from existing analytics or modeling.
Molnar lists some of the design elements of the problem in another tweet:
- “Forecast target selection”
- “What information is used”
- “What to do with the forecast”
My only insight would be to move the last item up; the first part of problem design is to plan how your system will be used and how it will solve the business problem.
Take the problem churn prevention, for example. From ML’s first perspective, it looks straightforward; in a given month, each user either gets mixed up or not. Bam, let’s train a binary-enhanced tree class to predict change for the coming month. Made.
But then what?
The first thing we face is calibration. Many binary classifiers are trained and tuned in a way that just depends investment the result of the prediction points, not their actual values.¹ Assume that there are 3 users in the confirmation set and the predictions of the two models, A and B:
For many binary classifiers, the predictions for models A and B would be equally good, as for each we can choose a threshold that completely distinguishes between detained and mixed users.2
In applications such as fracture prevention, values matter! The customer’s success team reads the score directly and is much more concerned about a user with an 80% chance of swelling than a person with a 7% chance of bubbling, although we know that these points should not be compared with each other. For the bending prediction model to be useful, we should calibrate the points so that they have real meaning.
We usually need to understand how our audience interprets our output, whether it’s model predictions, BI aggregation screens, experiment results, or any other object.
OK, we solved the calibration problem and we now have a flawless refractive prediction model. What should a customer success team do with it?
- Should they reach users with a 10% exchange probability or a 50% or 90% probability?
- What if each user reacts differently to the actions of the customer’s success team? Does it matter whether the user’s pacing probability is 20% or 80% if our intervention has no effect on that user?
- What if the customer’s success team contacts every customer in any case? In this case, our prediction model is useless; we should have learned instead What intervention is most effective.
There are no right answers to these questions because there is a business problem prevent don’t predict it. A proactive model alone cannot solve the problem.
Our articles trade-offs in conversion rate modeling shows an example of the importance of selecting a forecast destination. The ultimate task is to understand how well users convert from one step of the sequence (e.g., the funnel) to the next. We can formulate this as a data science task in two ways:
- Handle the conversion a binary Select a fixed-length time window to monitor the results. If the object transforms in the window, it is a success, otherwise a failure.
- Model how long it will take users can convert. This is a properly censored numeric target.
The choice of the target variable dictates the data to be collected and what types of models are appropriate.
Excellent problem formulation is one of the clearest hallmarks of a strong data scientist and one of the most important things I look for in an interview. What makes someone good at it?
- Curiosity, learn how business works.
- Integrity, to ourselves and our audience. We should give priority to the limitations of the methods and the correct interpretations of the results, especially when we know that the approach does not completely solve a particular problem.
- Extent of knowledge, through a lot of hands – on experience with real, applied problems and reading from the experiences of other researchers. To evaluate alternative formulations, we need to be aware of the alternatives.
- Anticipation, imagine a roadmap and architecture for each potential solution and identify the pros and cons before committing to one option.
There is often a gap between applied data science and the methodological focus of data science courses and start-up campuses. Problem formulation is at the heart of that gap, so mastering problem formulation is one of the best ways to ensure that your work adds value to your organization.
- Logistic regression is a significant exception that has been calibrated through construction.
- The AUC is also identical for the two model outputs because the true positive and false positive rates are the same for all model thresholds.