Whenever we start machine learning, we often return to simpler grading models. In fact, people outside this area have mostly seen those models work. After all, image recognition has become a poster child of machine learning.
However, the classification is limited, although it is effective. We want to automate many tasks that are impossible to do in classification. A great example of this is the bunch of the best candidates (according to historical data) from the crowd.
We are going to implement something similar in Oxylabs. Some network scraping operations can be optimized for customers through machine learning. At least that’s our theory.
There are many factors in network hijacking that affect whether a website is likely to block you. Crawl patterns, user agents, request frequency, daily requests – all of these and others have an impact on the likelihood of receiving a block. In this case, we are looking for user agents.
We can say that the correlation between user agents and blocking probability is the default. Based on our experience (and some of them have been completely blocked) we can safely say that some user agents are better than others. Therefore, when we know which user agents are best suited for the task, we can receive fewer blocks.
However, there is an important note – it is unlikely that the list is static. It would likely change over time and across data sources. Therefore, approaches based on static rules do not reduce it if we want to optimize UA usage to the maximum.
Regression-based models are based on statistics. They take two (correlating) random variables and try to create a minimal cost function. A simplified way to look at minimal cost functions is to look at it as the line with the smallest average distance from all data points per square. Over time, machine learning models can begin to predict data points.
Simple linear regression. There are many ways to draw a line, but the goal is to find the most effective. Source
We have already rightly assumed that number of requests (which can be expressed in many ways and determined later) correlates somehow with the user agent sent when the resource is used. As mentioned earlier, we know that a small number of UAs have terrible performance. In addition, we know from experience that significantly more user agents have average performance.
My last assumption may be clear – we should assume that there are some anomalies that work exceptionally well. Therefore, we accept that the distribution of UAs number of requests follows the clock curve. Our goal is to find really good ones.
Note that the number of requests correlates with a number of other variables, making the actual presentation much more complex.
Intuitively, our fit should look a little like this. Source
But why are user agents a problem at all? Well, technically there are a virtually infinite number of potential user agents. For example, there has over 18 million UAs available in one database for Chrome only. In addition, the number will increase by the minute new versions of browsers and the operating system are released. Clearly, we can’t use or test them all. We have to guess what is best.
Therefore, our goal with the machine learning model is to create a solution that can predict the efficiency of UAs (defined by the number of requests). We then take these predictions and create a set of most efficient user agents to optimize the scraping.
Often, the first line of defense sends the user a CAPTCHA to the end if he has sent too many requests. In our experience, letting people continue scratching, even if the CAPTCHA has been resolved, will block quite quickly.
Here, a CAPTCHA would be specified as the first instance when that test is submitted and requested to be resolved. Blocking is defined as the loss of access to regular content displayed on a website (regardless of whether a denied connection is received or otherwise).
Therefore, we can define number of requests because the number of requests at a given time to a particular source that one UA can make before receiving the CAPTCHA. Such a definition is reasonably accurate without forcing us to sacrifice proxies.
However, to measure the performance of a particular UA, we need to know the expected value of the event. Fortunately The law of large numbers we can conclude that after large-scale experiments, the mean of the results approaches the expected value.
Therefore, we only need to let our customers continue their day-to-day operations and measure the performance of each user agent number of requests definition.
Because we have an unknown expected value that is deterministic (although noise does occur, we know that IP blocks are based on a defined set of rules), we do mathematical cruelty – we decide when the average close enough to the. Unfortunately, without the data, it is impossible to say in advance how many tests we will need.
Our calculation of how many experiments are needed until our empirical mean (i.e., the mean of the current sample) may be close to the expected value depends on the variance of the sample. Convergence of our random variables to a constant c may be marked as follows:
From here, we can conclude that larger sample variances (2) means more experiments with convergence. Therefore, at this stage, it is impossible to predict how many experiments we would need to approach a reasonable average. In practice, however, dealing with the average performance of a UA is not too difficult to monitor.
Reducing the average performance of a UA is a profit in itself. Because we have a limited number of user agents per data source, we can use the average as a measurement bar for each combination. Basically, it allows us to eliminate underperforming companies and try to find those that succeed.
Without machine learning, finding capable user agents would be a guess for most data sources unless there are clear settings (e.g., certain operating system versions). Outside of such events, we would have little information.
There are numerous possible models and libraries to choose from PyCaret Scikit-Learn. Since we have guessed that regression is a polynomial, the only real requirement is that the model be able to fit such distributions.
I’m not going to get into the data entry section of the conversation. A more compelling and difficult task is data coding. Most, if not all, regression-based machine learning models only accept numeric values as data points. User agents are strings.
In general, we may be able to move to decentralization to automate the process. However, spreading removes the relationships between the corresponding UAs and may even result in the two having the same value. We can’t have it.
There are other approaches. Creating a custom encoding algorithm for shorter strings may be an option. It can be done using a simple mathematical process:
- Custom base (of), where of is the number of all symbols used.
- Specify each symbol as an integer starting with 0.
- Select a string.
- Multiply each specified integer by ofx-1, where x is the length of the string.
- The sum of the result.
Each result would be a unique integer. If necessary, the result can be inverted using logarithms. However, user agents are fairly long strings that can lead to unexpected interactions in some environments.
A more cognitively manageable approach would be to tilt user agents vertically and use version numbers as identifiers. For example, we can create a simple table by taking some existing UAs:
Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit / 605.1.15 (KHTML, like Gecko)
Intel Mac OS X
You may notice that there is no “Windows NT” program in the example. That’s an important bit because we want to create strings as long as possible. Otherwise, we increase the probability that two user agents point to the same ID.
As long as a sufficient number of products are set in the table, individual integers can be easily created by deleting version numbers and creating a combination (e.g., 50101456051150). For products that do not have a specified version (such as a Macintosh), a unique ID can be assigned starting at 0.
As long as the structure remains stable over time, creating and retrieving integers is easy. They are not likely to cause overflows or other evil.
Of course, careful consideration must be given before making a change, as changing the structure would lead to a massive headache. Leaving a few “blind spots” if it needs to be updated may be wise.
When we have performance data and a unique integer generation method, the rest are relatively easy. Since we have assumed that it may follow a clock curve distribution, we will probably have to try to fit our data to a polynomial function. Then we can start filling in the models with the data.
You don’t even have to build a model to benefit from such an approach. Simply knowing the average performance of the sample and number of requests certain user agents allow you to search for correlations. Of course, it takes a lot of effort until the machine learning model is able to do everything for you.