Data science has been hot for many years now, attracting attention and talent. However, there is an ongoing thread in the comments that says core competencies in computing in statistical modeling are oversized and that managers and aspiring data scientists should focus on technology. Vicki Boykis’ 2019 blog post was the first article I remember accordingly. He wrote:
… Computer science is moving asymptotically closer to technology, and the skills that data scientists need to move forward are less visualization and statistically based and more in tune with traditional computer science curricula …
According to this premise, his sensible advice was:
Don’t graduate in computer science, don’t do bootcamp … It’s much easier to get into a career in computer science and technology through the “back door”, i.e. start as a junior developer or at DevOps, project management and, perhaps most importantly, as a data analyst, data manager or similar…
His list of skills a aspiring data scientist should learn consisted entirely of information technology, MLOps, and tools, and he deliberately left out modeling, saying:
While tuning, visualizing, and analyzing models are part of your data scientist’s time, data science is and has always been primarily about getting pure data from one place to use for interpolation.
More receptionsofGartner’s year 2020 Artificial Intelligence Hype Cycle Report recognizes the role of data scientists, but says:
According to Gartner, developers are the greatest force in artificial intelligence.
Chris I. said it more directly, with the article “Don’t Become a Data Scientist.”
Everyone and their grandmothers want to be computer scientists … I often get messages from new students and career changers asking me for advice on getting into computer science. I tell them to become a software designer.
Mikhail Eric repeated the idea article titled “We don’t need data researchers, we need information technology.”
Nowadays, a bottleneck that helps companies gain machine learning and model insights for the data center about data problems … This may sound boring and asexual, but old-school software design that tends toward data may be what we really need right now … to have fewer places available for what appears to be a plethora of new entrants in the market who are trained to do computing.
I agree with these articles that information technology and MLOps are important in data science work in applied industry, but I also believe that core competencies in data science – statistical modeling – are becoming increasingly important, no less important.. Since we don’t have many opportunities for personal conversations in this Covid era, I imagine how the conversation would go with these skeptics.
Skeptic: What does the term “data science” even mean? It’s such a broad, vague title, plus everyone calls themselves a data scientist these days, so it is completely diluted.
I: Datatiede Is big tent. When people talk about the term should means that it usually revolves around core competencies statistical modeling. For example, Boykis mentions “Machine learning, in-depth learning, and Bayesian simulations,” as younger data researchers expect to work compared to “cleaning, editing data, and moving it from place to place,” which they eventually do. Eric picture the data scientist “is responsible for building models to study what can be learned from some data sources, although often at the prototype rather than the production level.”
Statistical modeling is taught in most statistical, machine learning, and computing courses. It includes, among other things:
- Traditional predictive models, i.e. regression and classification. All major hit-linear models, highlighted trees, neural networks, etc. fall into this category
- Time series forecasting
- Experimental design and analysis
Skeptic: So data science is just educational models? Isn’t it obsolete anyway with the rise AutoML and massive pre-trained models like the GPT-3? Likethe model building becomes a commodity, software engineers does the job, no methodologists.
I: Statistical modeling involves much more than pressing a button in a generic scikit-learn or PyTorch script. AutoML tools can help with some components, such as finding hyperparameters and selecting properties, but there is so much more to it.
Like me wrote a few weeks ago, the first thing a data scientist has to do is understand business problems and formulate them into modeling tasks. For example, you want to reduce variability, but should you treat it as a binary classification or a time-to-event problem? Is a proactive model enough or do you have to draw cause-and-effect conclusions? How do you run experiments to make sure the model works?
The next step in modeling is thoroughly understand and clean the data. This work often creates a lot of value for itself because data scientists often have the unique competence to turn between business logic and information technology and detect problems.
The model fitting process is evolving, especially like model training platforms MLFlow, Cometand Weights and weights (among other things) mature. However, many components still cannot be automated or separated. Data scientists need to decide how to evaluate the performance of a model, for example. Should we use a random or temporal train test division for the predictive model? Which evaluation metric best fits the way the company is used?
The last part of the modeling process is communication. IT and MLOps need to know how to implement the model in production (unless it is also the job of the data scientist). Business units need at least a basic intuition model of how operations work and explanations for unexpected forecasts.
As far as massive pre-trained models like GPT-3 go, researchers in most companies shouldn’t waste time trying to build them from scratch. But these models cover a a small proportion of real-world use cases; Most applications do not have a pre-trained model build.
Skeptic: I do understand. But I hear data scientists say over andover that modeling takes up only a small portion of their time. Evenyou said it knowledge work should come before modeling. So if I were a hiring manager, shouldn’t I focus first on IT and MLOps engineers? If I chose my career, wouldn’t information technology be a safer choice?
I: Let’s go to the same page first. Problem formatting, data retrieval, and data cleansing are part statistical modeling Understanding the operation of information technology and model deployment pipelines part statistical modeling (although the design and implementation of these systems is not). Even data scientists who want to do only statistical modeling should embrace these tasks.
I agree from an organizational perspective information technology is a higher priority than statistical modeling. Even experimental analyzes – which do not need to be put into operation – depend entirely on good instruments and data tubes.
Data scientists at smaller scrappier companies spend more time on information technology and MLOps. People who want to focus on statistical modeling should look for larger, better-funded companies with more specialized groups. I warn against premature career specialization because knowledge of a particular technology enables data scientists to act as a very valuable bridge between the technical and business side of an organization;. It also leaves open the opportunity to move to a closer role on the road.
Skeptic: A point for skeptics. I have also read that most of the data science projects fail, so I don’t understand why a company – especially a small, scrapping one – should waste resources on data scientists.
I: I’ve seen sources say 85 or 87 percent of projects fail, but they just seem to make numbers from scratch. Where is the information? I am skeptical of your skepticism!
More seriously, what does the failure of a computing project mean? Kohavi, Tang and Xu points out that most experiments fail in the sense that the proposed change would not prove to be better than the current system. However, this is not a business failure, as these experiments still lead to good decisions and a rapid pace of innovation.
Most commonly, the most valuable thing that statistical modelers bring to the table is their culture. Data researchers requires justifying ideas with evidence rather than intuition, especially by defining model performance. Before running an experiment, we need to know what metrics we will use to evaluate a new idea. Before we introduce a complex forecasting model, we need to know what the baseline is. It’s probably the current deterministic, hard-coded system that you don’t even think of as a model, let alone measure! So while some projects fail, strong data researchers raise the bar for the entire organization.
Modeling experts also increase innovation detecting potential modeling problems in advance. In recommendation systemsFor example, it is important to think in advance how to avoid closed feedback loops, how to address the cold start problem, and how to ensure algorithmic fairness.
However, not everything can be planned in a data processing project. Unlike other fields of technology, we can not a priori promise concrete results or even solid roadmaps for our partners because we do not know what we find in the data. A separate computing role helps communicate this limitation.
Skeptic: Maybe really good senior scientists add value, but now it is glut / younger data scientists. These poor people end up in companies that are not ready for statistical models, wasting their education and talent.
I: As we said above, data cleansing is part statistical modeling and data processing programs should place more emphasis on this. People who want to specialize in modeling should look for jobs in larger companies, even though this is not the perfect panacea; understanding data and design pipes is always important.
It may be true that the supply of computing labor exceeds demand, at least for jobs with the explicit title of data scientist. This perspective misses the forest because of the trees. Aspiring data researchers may need to extend their job search to other titles or target roles in specific business units, but statistical modeling can and should be applied to virtually all industries. Regardless of the name, people with good modeling skills are more efficient and rise to the top.
However, it is important to learn statistical modeling, either through a degree, a start-up campus, or self-study. To get the job done, you can’t focus entirely on IT and MLOps, and then hope to move into the data processing team later without modeling experience.
The field of data science has certainly received a lot of leaps in the last 10 years, and a certain kind of push is inevitable, even productive. But let’s not forget the value that the core competency of its statistical modeling brings to the table.