Baby’s First Linear Regression Model

Jacqueline Flanigan
4 min readAug 29, 2021

In my previous post I’ve talked the big talk about wanting to learn more about getting into Data Science and so now I have to walk the walk. Here I will describe the next project I took on that demonstrated my understanding of a linear regression model.

For those who are still just reading about Data Science and don’t want to get too technical, no worries! I’ll only be describing my steps in layman terms and leaving the technical parts in my GitHub (which I will link down below). To begin, I’ll introduce what a linear regression model actually is in a simplistic way: it is a predictive analysis model that uses a linear approach to reflect the relationship between variables.

That sentence may send shivers down your spine if you haven’t encountered these models before but once you begin exploring different projects or examples, a lot of this will begin to make more sense. Without further ado, I’ll tell you about my first linear regression model that I have completed.

The scenario I chose for this project was building a model that would help potential homebuyers in the King County, WA area. Buying a home can be a very stressful and difficult process, especially for first time buyers (as I found out myself not too long ago). Here with the model I created, I filtered in the information from the King County House Sales dataset to predict prices of houses within that area that belonged to a certain grade and condition ranges.

Since the dataset contained quite a healthy amount of options to choose from, I narrowed down the columns to simplify my model. I ended up dropping a few columns (mostly about surrounding areas rather than the individual houses) as well as making the price range to be anything up to and including one million dollars. I can already hear you saying, “ONE MILLION DOLLARS?!”. Yes, one million dollars. I know, the poor person in me also cried a little, but hey, it’s all about location, right?

Moving on, I focused on fine tuning the grades and conditions that the houses were in as most likely people want to purchase a house that doesn’t need a lot of repairs or remodeling done for their first home. Therefore, using the condition and grading system provided by the King County area, the grading scale was at least a grade of 6 (where there are a few lower quality materials and more simplistic designs) to a highest of 10 (higher quality of features and usually more square footage). As for the conditions, the houses had to be at least a 3 or in words, an average home that needs only minor repairs.

Now an important step when creating a model and taking in information is to keep in mind that not everything can be compared on the same scale. For instance, if a property has a waterfront, it was given a one and that can’t necessarily be compared to square footage which will be significantly higher than that. This is when the transformations come into play. Transforming data becomes an essential step so that this information can be compared without skewing the results of the model. I myself did so for this model, which included log transformations and dummy variables. It is also worth noting that checking for interactions is important to ensure that if there was an interaction in a model, that it has a positive influence on it.

With all of the above information in mind, I preceded to finetune my coding to get the best the results that I could. I tested and retested my model with both the stats and linear regression formula, taking in account which changes I had made to see what kind of influence it had on my results. I checked on my p-values (the p standing for probability) to make sure that none of them were above 0.05 and also on the RMSE (root mean squared error). The reasons why these are important is because these insights inform you on whether a variable is actually impactful to your model. The RMSE specifically shows the difference between predicted values and the model’s actual values, which can be helpful in indicating that your model is running accordingly. See, the thing is with models in the data science world is that they should be able to be used by anyone and get the results you had. Models aren’t really helpful to people if they aren’t able to get predictable and accurate outcomes.

After all this, I can hear you asking how well the model I made actually performs. Don’t you know it’s rude to ask a lady that? Don’t worry, I forgive you this time since you’re probably just curious. In the end, after the steps mentioned above (in addition to some that weren’t to simplify my process into easier reading), with the final model it was shown that 44.7% of variations in price could be explained by the other independent variables. In addition, RMSE for our model showed that the predicted difference between values and it’s actual values were quite close with Train RMSE equaling roughly 143,595.06 dollars and the Test RMSE coming out to be around 143,352.60 dollars. What this all means is that the model I created is fairly accurate in predicting home prices for new time buyers in the King County area of Washington.

If you’d like to see the steps I took to get these results more in depth, feel free to check out my GitHub here:

https://github.com/JacquelineFlanigan/HousingInsights

Thanks for reading and hope to see you back for my next adventure in Data Science!

--

--