To illustrate the process, we use a set of stroke prediction data. It is a set of information created by fedesoriano in Kaggle.

The data describe people with or without stroke. In addition, the data include indicators related to the disease. The material is part of a supervised learning problem because it consists of a label of whether a person can have a stroke or not.

You can use a data set here.

Captured by the author

Import the information now. To do this, we can use the panda library to process our data set. Here is the code for importing and previewing data:

As you can see from the data set above, the columns already exist in numeric format. They are age, hypertension and heart disease column. Also, some columns are not in numerical form, such as job type, gender, type of residence, and smoking station column.

We are currently focusing on non-numeric columns. Distinguish the data frame according to their data type. Here is the code for it and its result:

Once we have separated the data frame, the individual values ​​in each column are checked. You can use the .unique method to retrieve different values ​​from a column. Here is the conversion code and preview:

Of the above results, the two columns have two unique values. And the three columns have more than two unique values. Why does the number of unique values ​​matter? Because we code the column differently.

For a column with two separate values, we can encode the column directly. Although a column has more than two unique values, we use single-code encoding for it.

Encode labels with label coding

Once we know the properties of each column, we will now reformat the column. First, we reformat the columns with two separate values. They are the columns ever_married and residence_type. To do this, we can use the LabelEncoder object in scikit-learn to encode the columns.

Let’s now take the ever_married column. First, initialize the LabelEncoder object as follows:

We can then fit the object to the data like this:

Now we can convert the column to numeric format as follows:

Here is the result after code execution, which varies before, after, and vice versa during the encoding process:

Well done! We have coded the first column. Now encode the next column, which is the ‘residence_type’ column.

A reminder of the previous code seems to make the matching and conversion process work separately. In fact, we can combine the process as one with the .fit_transform method.

Here is the code and its results:

Encode labels with simple encoding

Nice! Let’s move on to columns with more than two separate values. There are three columns that contain more than two unique values. They are gender, job type, and smoking_status column. To process these columns, we use a technique called quick encoding.

What is one hot coding? This process encodes the column and converts it into a matrix. Where each column represents each separate value in the column and each cell determines where the value is or is not.

Here is a picture of the single-encoding encoding process:

Illustrated by the author

To accomplish this, we use the scikit-learn OneHotEncoder object to encode these columns.

Let’s take the gender column now. First, we format the OneHoteEncoder object like this:

Next, we can use the .fit_transform method to match and convert the data simultaneously. Here is the code:

Oops, here’s the mistake. If we read the error, it says that the value error is. The function needs a 2-dimensional table as input.

To convert a column shape, we can use the .reshape method to format the column. But we must first convert the column to a NumPy table. To do this, we can wrap the column with the np.array function.

Let’s repeat the process! Here is the code and result before, after, and inverse of the single hot coding process:

It works! This now applies to other columns, such as smoking_status and work_type. Here is the code and the results of the single-encoding encoding process:

Well done! Now you have coded all the columns.

Create an encoded data frame

Once we have coded these columns, we can create a data frame from it. A DataFrame object is initialized for each column to create a data frame. We then combine these columns as a single .concat method. Here is the code and the results of doing so:

Combine with numeric columns in a data frame

Glorious! These categorical columns are already in data frame format. Now combine them with numeric columns. Here is the code and the result:

The trick to wrapping the process

Wow, it’s a long process. In fact, there is a trick where you can do this with one line of code. You can use a function called .get_dummies from the pandas library for all this.

Remember the df_categorical variable, which contains all the columns of the category from the data frame. Here is the code that encodes the data frame and its result:

Now combine them into numeric columns:

Simple isn’t it? If you have a short time, the get_dummies function will help you right away!


Please enter your comment!
Please enter your name here