What is Azure Machine Learning?
Machine learning is a computing technology and falls under the umbrella of greater artificial intelligence that allows computers to use historical data loads to predict future behaviors, outcomes, and trends. Through machine learning or artificial intelligence, computers learn to perform tasks without explicit programming. Machine learning predictions or predictions can make applications and devices smarter. For example, when you shop online, machine learning helps you recommend other products you might want based on what you’ve bought in the last six months. Or when your credit card is erased, machine learning compares the transaction to a transaction database and helps detect fraud.
Azure Machine Learning is a service provided by Microsoft Azure that can be used for all types of machine learning tasks, from classic ML to deep, controlled, uncontrolled and reinforcing. Either by using the Python / R SDK to write code or by using ML Studio to implement a low-code / non-coded model, you can build, train, and track machine learning and in-depth learning models in the Azure Machine Learning workspace.
Azure lets you start training on your local machine and then expand into the cloud. The service is also compatible with popular in-depth learning and validation open source tools such as PyTorch, TensorFlow, sci-kit-learn and Ray RLlib.
Going through the workflow step by step
In the five steps described below, we perform all the required tasks from data collection, processing, modeling, training to building proactive analysis reports for the client. Before we start digging into the details, I have represented all the steps with brief descriptions in the diagram below.
We start the workflow by defining a function that assesses when the pipelines should start operating. This is usually called a ‘trigger’. It is an event that triggers a workflow. Each data tube needs data to start working. This information can be in the form of structured streams (.ss files), text files (.csv, .tsv, etc.) or it can be spark streams from Azure Event Hubs. Assume in the scenario that we have structured data streams that are exported to the COSMOS location (Cosmos is the NoSQL store in Azur) every 6 to 12 hours. If you want to notify Machine Learning that upstream files are available and new raw data is running, Azure runs every hour to check for new files. The rest of the workflow is executed only if this Azure function is lit. True and specifies that new information is present.
a. The Azure function is performed every hour to check for new information.
b. If new information becomes available, it returns a True logical value, which in turn triggers the rest of the workflow.
c. If there are no new files, it will return a False value and nothing will happen. It is performed again after one hour to perform the same inspection.
The ADF supports all modern data structures, including structured and unstructured data flows through storage services such as Data Lakes and Warehouses. However, the best way to process data is to integrate ADF into Azure Databricks laptops. This is a typical Python environment that runs on top of a workspace created in Azure ** and can perform all the machine learning and computing functions that Python can perform.
When Azure sends a signal that new data is being processed, Databricks cluster is activated and this notebook starts working. We import data in Python, perform preliminary cleaning, sorting, and other preprocessing work, such as calculating zeros and invalid inputs, discarding data that is not used in the machine learning model. This processed and clean data, ready to be sent to the machine learning tube, is securely placed in the ADLS (Azure Data Lake Storage) Gen2 location. Datajärvi is a fast storage option for temporary and permanent storage needs, and can be accessed directly with the ML function.
a. When the Azure function is lit. True, The Databricks Python notebook begins to gather new information.
b. The laptop cleans and processes this information and prepares it for the ML model.
c. It is then pushed to the Data Lake Storage service, from where the ML activity can retrieve it and run the model.
** The detailed steps for creating a Databricks workspace and processing data in Python using this workspace are explained in this tutorial – Convert data with Databricks Notebook.
We are now at the heart of the machine learning workflow – a place where magic happens! Azure Machine Learning offers two ways to build ML pipelines.
You have the option of either building, training, and testing ML models using either Python or the R SDK, or using Azure ML Studio to create a code-free drag drop tube
If you want to start from scratch and follow the steps to create a training and testing model for your ML model, a detailed example can be found in the archive below.
Regardless of the method chosen, the ADF handles the piping without prejudice. The pipeline reads data from the ADL storage account and executes training and prediction scripts for new data and updates the model with each run to fine-tune the trained algorithm. The result of this machine learning tube is a structured data set stored as a daily output file on the Azure Blob Storage device.
a. Connect to an ADL storage account to retrieve processed data.
b. Run the ML model with the data.
c. Upload output and forecast metrics to Azure Blob Storage.
** An alternative to using machine learning pipelines is to implement MLFlow directly in the Databricks environment. Detailed guides for this implementation can be found here – Follow Azure Databricks ML exams with MLflow
Azure Blob Storage is a cloud-based service that lets you create data lakes for your analytics needs and provides storage to build powerful cloud-based and mobile applications. Storage is scalable and pricing is inversely proportional to the amount of storage used. Blobs are built from scratch using SSD-based storage architectures and are recommended for use in serverless applications such as Azure Functions. The main reason I use Blob Storage is its seamless ability to connect to Power BI. A complete tutorial to accomplish this can be seen here.
a. Store analyzed data securely in separate spots.
b. Connect to Power BI to retrieve data for data visualization.
The best part about staying in the Microsoft Azure ecosystem is the availability of Power BI in the same stack. Power BI is a very powerful tool for creating reports and presenting insights from the model. Therefore, I personally end the workflow by keeping the Power BI control panel updated whenever the ADF piping is completed. It allows you to update data every day at a specific time, so all insights, spreadsheets, metrics, and visualizations are up-to-date and in sync with your computing workflow.
Storytelling is the most important part of the analysis process.
The workflow customer or consumer is never interested in the code that brings the forecasts to the table. They are only interested in getting answers to the questions they asked by submitting product claims. That is why my dictionary is called workflow complete only after the end result of the delivery is at hand. The result is a report that shows what the machine learning model predicted and what it has predicted each day, how the metrics have improved, and what improvements are needed. This is the art of decision making that machine learning gives us the power to accomplish.
a. Combine Power BI and Azure Blob Storage to retrieve data.
b. Set Power BI to update every day so all information is updated.
Alarms, monitoring and anomaly detection
Every DevOps or MLOps workflow is very boring, but the extreme task is alarm and monitoring. These screens must keep the data under control, avoid data accuracy, or be notified of anomalies or unexpected results detected at any stage of the analysis. Azure Data Factory (ADF) allows us to place multiple alerts on this pipeline directly from ADF monitoring panels.
Examples of alarms to set:
- Stability of data received upstream.
- Failure to prepare or execute takes longer than a given threshold.
- The results stored in Data Act (ADLS) vary widely from previous data.
- File sizes received upstream or created by the workflow are exceptionally large or small.
Alerts can be configured to notify email group aliases, security groups, or create events / bugs on your DevOps portal whenever anomalies or bugs are detected.