The AutoML application you built today has less than 200 lines of code on exactly 171 lines.

1.1. Technical stacks

The web application is built in Python using the following libraries:

  • streamlit – network frame
  • pandas – handles data frames
  • numpy numerical data processing
  • base64 – data to be coded for download
  • scikit-learn – perform optimization of hyperparameters and build a machine learning model

1.2. User interface

The web application has a simple interface consisting of two panels: (1) The left panel accepts the entered CSV data and parameter settings, while (2) the main panel displays the result consisting of printing the input data set frame, model performance meter, best parameters for hyperparameter tuning as well as the 3D outline of the tuned hyperparameters.

Screenshot of AutoML.

1.3. Introduction to AutoML

Let’s take a look at the web app according to the two screenshots below so you can feel the app you’re building.

1.3.1. An AutoML application using an example data set

The easiest way to try out a web application is to click on the set of sample data provided Press to use Example Dataset button on the main panel that loads the Diabetes dataset as an example file.

Screenshot of an AutoML application using an example data set.

1.3.2. An AutoML application that uses downloaded CSV data

Alternatively, you can also upload your own CSV datasets by either dragging and dropping the file directly into the upload box (as shown in the screenshot below) or by clicking Browse files button and select the input file to download.

Screenshot of an AutoML application using a CSV input file.

In both of the above screenshots, when either an example file or a downloaded CSV data set is delivered, the application prints the data set data frame, automatically builds multiple machine learning models using the included input training parameters to optimize hyperparameters, followed by print model performance metrics. Finally, the interactive 3D outline of the tuned hyperparameters is shown at the bottom of the main panel.

You can also test run the application by clicking the following link:

Let’s now dive into the internal workings of the AutoML application. As you can see, the entire application uses only 171 lines of code.

Note that all comments in the code (marked on a line that contains hash symbols) #) is used to improve the readability of the code by documenting what each block of code does.

Lines 1-10

Import the necessary libraries that consist of streamlit, pandas, numpy, base64, plotly and scikit-learn.

Lines 15-16

set_page_config() allows us to specify the title of the web page page_title=‘The Machine Learning Hyperparameter Optimization App’ as well as setting the page layout to full-width mode layout=’wide’ input argument.

Lines 19-25

Here we use the st.write () function in conjunction with the markup syntax, we write the title text of the web page on line 20 by making # tag in front of the title text The Machine Learning Hyperparameter Optimization App. In the following lines, we write a description of the web application.

Lines 29–58

These blocks of code are associated with left-panel input widgets that accept user-entered CSV information and template parameters.

  • Lines 29–33 – Line 29 prints the title text for the left sidebar panel st.sidebar.header() a function in which the sidebar of the function dictates the location of the input widget, which it should place in the left sidebar. Line 30 accepts the CSV data entered by the user st.sidebar.file_uploader() function. As we can see, there are 2 input arguments, the first of which is the text tag Upload your input CSV file while the second input argument type=[“csv”] restricts the acceptance of CSV files only. Lines 31-33 print a link to the example data set in Markdown syntax st.sidebar.markdown() function.
  • Line 36 – Prints the title text Set Parameters through st.sidebar.header() function.
  • Line 37 shows the slider bar st.sidebar.slider() a function where it allows the user to set the data sharing ratio simply by adjusting the slider. The first input argument prints the widget label text Data split ratio (% for Training Set) where the next 4 values ​​represent the minimum value, the maximum value, the default value, and the step size. Finally, the specified value is determined split_size variable.
  • Lines 39-47 show the input widgets for learning parameters, while lines 49-54 show the input widgets for general parameters. Like the explanation of line 37, these lines of code also use code st.sidebar.slider() as an input widget for accepting user-defined values ​​for model parameters. Lines 56-58 combine the user-defined value from the slider input into the merged format, where it then acts as the input GridSearchCV() function responsible for tuning hyperparameters.

Line 64

Subtitle text that says Dataset is added above the input data frame st.subheader() function.

Lines 69-73

This block of code encodes and decodes the performance of the model base64 library as a downloadable CSV file.

Lines 75–53

At a high level, this block of code is build_model() a custom function that takes input data and performs model building and hyperparameter tuning along with user-defined parameters.

  • Lines 76-77 – The entered data frame is separated into sections X (deletes the last column, which is a Y variable) and Y (select the last column in particular) variables.
  • Line 79 – Here w informs the user st.markdown() the function that the model is built on. Then, in line 80, the column name of the Y variable is printed inside the data box st.info() function.
  • Line 83 – Data sharing is performed train_test_split() function X and Y variables as input data, while a user-defined value for the split ratio is specified split_size a variable that takes its value from the slider described in line 37.
  • Lines 87–95 – Transmits a random forest model RandomForestRegressor() function that is specified rf variable. As you can see, all the parameters of the model as defined RandomForestRegressor() function takes its parameter values ​​from input widgets that are defined by the user, as described above on lines 29-58.
  • Lines 97-98 – Performs hyperparameter tuning.
    → Line 97 – Random forest model above rf the variable is specified as an input argument estimator parameter GridSearchCV() a function that performs tuning of hyperparameters. The value range of the hyperparameters to be examined in the tuning of the hyperparameters is defined param_grid a variable that in turn takes a value directly from a user-defined value obtained from the slider (lines 40-43) and preprocessed param_grid variable (lines 56-58).
    → Line 98 – The hyperparameter tuning process begins with input X_train and Y_train as input.
  • Line 100 – Print the Model Performance footer text st.subheader() function. The following lines then print the model’s performance metrics.
  • Line 102 – Best model of the hyperparameter tuning process saved grid the variable is used to make predictions X_test information.
  • Lines 103-104 – Prints the R2 score r2_score() function Y_test and Y_pred_test as an input argument.
  • Lines 106-107 – Prints the MSE score mean_squared_error() function Y_test and Y_pred_test as an input argument.
  • Lines 109-110 – Prints the best parameters and rounds to two decimal places. The best parameter values ​​are obtained grid.best_params_ and grid.best_score_ variables.
  • Line 112-113 – Line 112 prints the subheading Model Parameters through st.subheader() function. Line 113 prints the model parameters stored in the model grid.get_params() through st.write() function.
  • Lines 116-125 – Model performance indicators are obtained grid.cv_results_ and reformulated x, y and z.
    → Line 116 – We intend to selectively extract data grid.cv_results_ which is used to create a data frame containing 2 combinations of hyperparameters and their corresponding performance meters, which in this case are R2 points. Specially pd.concat() function is used to combine two hyperparameters (params) and a performance meter (mean_test_score).
    → Line 118 – Data formatting is now performed to prepare the data in a suitable format to create a shape graph. Specially groupby() function pandasthe library is used for the literal grouping of the data frame in column 2 (max_features and n_estimators), in which case the content of the first column (max_features) is merged.
  • Rows 120-122 – The data is now translated into a m ⨯ n matrix, so that the rows and columns match max_features and n_estimatorsrespectively.
  • Lines 123-125 – Finally, edit the data to match x, y and z variables, which are then used to form the outline.
  • Lines 128-146 – These blocks of code now create a 3D outline using a graph x, y and z variables plotly library.
  • Lines 149-152 – x, y and z the variables are then combined into a df data frame.
  • Line 153 – The model performance results stored in the variable grid_results are now available for download filedownload() custom function (lines 69-73).

Lines 156-171

At a high level, these blocks of code execute the logic of the application. This consists of two blocks of code. The first is if code block (lines 156-159) and the other is else code block (lines 160-171). Whenever a web application loads, it runs by default else code block when if the code block is activated when the entered CSV file is loaded.

For both code blocks, the logic is the same, the only difference is df data dataframe (whether it is from outgoing CSV data or sample data). Next, the content df the data frame is displayed st.write() function. Finally, the model building process begins build_model() custom function.

Now that we’ve coded the app, let’s continue launching it.

3.1. Create a conda environment

Let’s start by creating a new one conda environment (to ensure code reproducibility).

First, create a new one conda called the environment automl as follows on the terminal command line:

conda create -n automl python=3.7.9

Second, we log in automl in the neighborhood of

conda activate automl

3.2. Install the required libraries

Download first requirements.txt file

wget https://raw.githubusercontent.com/dataprofessor/ml-opt-app/main/requirements.txt

Second, install the libraries as shown below

pip install -r requirements.txt

3.3. Download application files

You can either download web application files hosted in the Data Professor GitHub repo, or you can also use the 171 lines of code above.

wget https://github.com/dataprofessor/ml-opt-app/archive/main.zip

Next, extract the contents of the file

unzip main.zip

Enter now main through the directory cd command

cd main

Now that you’re inside main directory, you should be able to see ml-opt-app.py file.

3.4. Launch the web application

Launch the application by typing the following commands on the terminal (that is, make sure that ml-opt-app.py file is in the current working directory):

streamlit run ml-opt-app.py

After a few seconds, the next message on the terminal.

> streamlit run ml-opt-app.pyYou can now view your Streamlit app in your browser.Local URL: http://localhost:8501
Network URL: http://10.0.0.11:8501

Finally, the browser should open and the application will appear.

Screenshot of a locally launched AutoML application.

You can also test AutoML from the following link:

Now that you have created an AutoML application as described in this article, what next? You may be able to modify the application for another machine learning algorithm. Additional features, such as feature priority drawing, can also be added to the application. The possibilities are endless and fun to customize the app! Drop a comment about how you have customized the app for your own projects.

LEAVE A REPLY

Please enter your comment!
Please enter your name here