2️⃣. The column does not correspond to the train and test sets

When you want to train a model with the given employee information, we first divide the data set into train and test sets by keeping the test busy so that our model will never see it.

from sklearn.model_selection import train_test_splitX = df.drop('MonthlyIncome', axis=1)
y = df['MonthlyIncome']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)

The next step would be to code the category variables in the exercise set and test set.

pd.get_dummies(X_train)

As expected, both Gender and EducationField attributes are coded as numeric quantities. We now apply the same process to the test data.

pd.get_dummies(X_test)

Wait! There is a column conflict in the training and test series. This means that the number of columns in the training set is not the same as the number of columns in the test set, and this throws an error in the modeling process.

Solution 1: Handle unknown by using .reindex and .fillna()

One way to correct this inconsistency in classes would be to store the columns obtained after coding the exercise coding in a list. Then encode the test set in the usual way and align both data sets with the columns of the coded exercise set. Let’s understand it with code:

# Dummy encoding Training set
X_train_encoded = pd.get_dummies(X_train)
# Saving the columns in a list
cols = X_train_encoded.columns.tolist()
# Viewing the first three rows of the encoded dataframe
X_train_encoded[:3]
Image by the author

Now code the test set, then realign the training and test columns and fill in all the missing values ​​with zero.

X_test_encoded = pd.get_dummies(X_test)
X_test_encoded = X_test_encoded.reindex(columns=cols).fillna(0)
X_test_encoded
Image by the author

As you can see, both datasets now have the same number of columns,

LEAVE A REPLY

Please enter your comment!
Please enter your name here