This project uses various machine learning techniques to predict house prices a Kaggle dataset / 79 explanatory variables. The data cover sales of individual residential properties in Ames, Iowa from 2006-2010. The aim of the project is to forecast the prices of 1,459 homes as accurately as possible and to identify the features that have the greatest impact on the selling price.

Part 1: Data retrieval, cleaning and pre – processing

The first step was to read documentation and contextualize the data. The history of real estate has been cyclical, and the housing sales described in this dataset take place in the midst of the 2008 housing crisis. The following box shows sales prices broken down by month and year:

Interestingly, there is very little evidence here of a price bubble or collapse, and the seasonality is generally very small. We can move on to further analysis.

In the following plot, we see that the overall quality shows a strong linear correlation with the selling price.

The above-ground residential area also has a somewhat linear relationship to the selling price, as shown in the dispersion diagram below. The documents state that there are five discrepancies in the material, three of which are partial sales and two unusually large. I removed the two dots circled in orange because they differed significantly.

Property selection (and deselection)

Now that I’m stuck with the linear relationships of some variables to the selling price, I know that linear regression can be a good candidate to fit the model. The next step is to try to reduce some of the dimensions of the data set. I look at the heat map to identify variables that correlate strongly with each other and delete as many unnecessary columns as possible.

For example, the “GarageCars” feature correlates strongly with “GarageArea.” We can drop it because it doesn’t provide any new information. I left “GarageYrBuilt”, “TotRmsAbvGrd”, “1stFlSF” and “2ndFlSF” for similar reasons.

Missing value entry

Another important issue to consider was the 29 columns in the data set that contain the missing values. Of these, 20 are related to the house’s optional “bonus features,” such as a pool, fireplace, or fence. I calculated zeros or “nothing” for these.

The 8 remaining columns with missing values ​​were categorical attributes dominated by one value (i.e., most “SaleType” values ​​were “Normal.” I calculated the status of these columns. Then I dropped “Utilities” because all values ​​were the same except for one.

Finally, the last column requiring billing was “LotFrontage”. 16.7% of these values ​​were missing. I was considering dropping the variable, but I figured that the facade of the lot could greatly affect the visual appeal of the home and decided to calculate the neighborhood space instead.

Feature design

I created the following new features:

  • House age year of better interpretability built
  • YearsSinceRemod- due to the renewed better interpretability of the year
  • Total bathrooms combining 4 columns with complete, half baths and basement baths
  • PorchSF combining square materials from different porch styles

Coding of categorical variables

14 categorical variables were coded sequentially. Most of these were quality scores that were easy to assign numerical values ​​to.

For linear models, I dummified the remaining categorical variables, including the month and year sold.

Variable transformations

To increase the normality of some variables, strongly skewed I used a log-dependent dependent variable.

I also used log transforms for explanatory variables with a degree of skew of at least 0.5.

Part 2: Model fitting and evaluation

I divided the refined data set into train and test divisions and analyzed the models using 10-fold cross-validation. I used the mean error of the square to evaluate performance.

Lasso-penalized linear regression had the best performance when the RMSE was 0.1201 after the model was tuned to alpha.

Next, I looked at the forecast chart and remnants.

Visually, both lines show a relatively efficient pattern. We can move on to estimating the odds to get more insights.

From the multiplier curve, we can see that the most important factor in determining the selling price is the above-ground Living Space, which makes sense. After that, there is general condition.

In addition, six multipliers relate to the neighborhood, indicating that location is an important issue for Ames buyers.

One unusual surprise is the largest negative factor that appears as a commercial zoning classification. This may require further investigation, especially when the coefficient is disproportionately high.


The main samples of the model analysis are as follows:

  • A bigger house is not always a better house, but in Ames it is probably a more expensive house. Therefore, homeowners can significantly increase the value of their home by building a home extension if possible.
  • Neighborhoods have a significant impact on housing prices. Therefore, home buyers who want to save money on a more comfortable house should consider neighborhoods like Edwards and Old Town.

Opportunities for further analysis

My final predictions were scored at Kaggle’s 25th highest percentage point. It is possible to improve the points by further fine-tuning the model parameters.

In addition, we can get even better results by stacking or mixing some models. The compromise is that this approach adds a layer of complexity and complicates the interpretation of the model.

Finally, there are limitless possibilities in feature design. We can try to combine some categorical variables, perform different transformations on oblique variables, and add / drop different combinations until the end of time.

More information and code available at Github


Please enter your comment!
Please enter your name here