The longest part of the analysis / science tasks is to prepare and determine the data correctly. A the model only works the same as the data entered and there are a lot of changes that may need to be made to the data to be ready for model training. Over the years, I have put together a Concept page, which highlights many common tasks that researchers need to perform data preparation. I have listed a few of the examples below, but all of the examples can be found below link. I will continue to expand this link as I continue my learning journey with other common functions that I have used repeatedly during EDA or Feature Engineering.
Entry: All of these examples are in Python and mainly use the Pandas, Numpy, and Sci-Kit Learn libraries. MatPlotLib or Seaborn was used for visualization.
- Checking for missing values in the DataFrame
- Dropping a column
- Using a function on a column
- Draw the value of the column decreases
- Sort the DataFrame by column values
- Drop rows based on a column value
- Sequence coding
- DataFrame encoding with all categorical variables
- Additional resources
This block of code uses Panda functions isnull () and sum() gives a summary of the missing values in all columns of the data set.
Drop the column using pandas drop () function to drop the selected column several columns just add their names list which contains the column names.
Many feature engineering tasks require coding or data transfer, which can be done using traditional Python functions. Using pandas Use () You can use the function you create for an entire column, either to create a new column or to convert a selected column.
The general task of property design is to understand how balanced the data set is. For example, in a binary classification problem, if one class has nearly 90% and the other class is represented by 10% of the data points, this results in the model predicting the first class most of the time. To avoid this, it is necessary to visualize the number of response variables in particular. Pandat value_calculatesUse the () function to get the occurrence of each value in a column in a column and then plot () function allows you to visualize using this bar chart.
Sometimes, for data analysis, you want to visualize the columns in a specific order and add multiple columns sort values () function for your DataFrame.
If you ever want to set a subset according to the values in the second column, you can do so by capturing index a specific set of lines. By creating these kits you can use drop function to drop these specific rows / indexes you have identified.
Sequence coding is one of many ways to encode your categorical information. There are several encoding methods, such as One-Hot encoding and others, that I have linked to here. Sequence coding is used when you want to maintain the order of a categorical variable and if there is one naturally Order your column follows.
If your data set has only categorical columns, you may need to create a pipe / function to encode the entire data set. Note that before you want to use this function identify whether the order matters or not for each column you work with.
– – –
Data rotation is necessary to prepare data for model training / feeding. Python libraries such as Pandas, Numpy, and Sci-Kit Learn make it easy to edit and convert data as needed. With so many new ML algorithms coming into the field, it is still essential to understand how data is prepared for the model you use, whether it is a traditional model such as logistic regression or a domain name such as NLP, data preparation is essential.
I hope some of these examples have been helpful and saved time for those who perform any EDA or Feature Engineering on their particular data sets. Look Concept link For all the other examples I documented, this will be further updated. I have included other Feature Engineering resources and cheatsheets that I have found useful above. Feel free to contact me Linkedln or follow me Mid-range more for my writing. Share your thoughts or feedback, thanks for reading!