Data Science by Example— a case study of Airbnb Seattle

Andrew Sithole
5 min readApr 12, 2021

Data science is a very broad topic with vast applications and because of that, everyone has their own view of what it is about. Despite the varying definitions, most people will agree on the goals of data science which include — learning more about the data we have and creating predictive models.

You might be asking, how can I understand such a broad topic? Well, lucky for you, data scientists across the world have come up with the “cross-industry standard process for data mining” (CRISP DM) to help us standardise the processes that are practiced in data science. In this post, we will look at the airbnb dataset for Seattle and try to answer some questions of interest by applying the CRISP DM model. If you wish to check out the code I use in this post, check out this Github repo.

The CRISP DM process
the CRISP DM process. src: wikimedia

The CRISP DM stipulates that the life cycle of data science usually follows 6 steps as shown above:
1. Understand the business
2. Understand the data
3. Prepare the data
4. Model the data
5. Evaluate the results
6. Deploy

It is important to note that although the CRISP DM has 6 steps, not all questions of interest will need us to apply all six steps.

1. Understand the business

You probably know what airbnb is and how it does business. If you do not know, it is an online marketplace for lodging, primarily home-stays for vacation rentals, and tourism activities. I’d encourage you to check out these youtube videos that talk about it. First video and second one

2. Understand the data

The airbnb dataset has got 3 CSV files which contain data about listings, reviews and calendar. The listings data set has got 3818 rows and 92 columns. A look at the listings’ continuous columns gives the statistical table below.

A histogram of the above columns would give us more insight into our data. You can zoom the image below to view the stats.

Some of the observations we can make from the data are that:
1. The properties get very few reviews per month
2. Most of the reviews are good
3. Most listings have nothing in the license field
4. Most listings allow very few guests

We can draw many other conclusions but for now let’s stop here.

Now that we have looked at the data, we may have noticed some issues with it. These are the interesting issues I have noticed:
1. Price fields don’t show up when I list of continuous variables. That is possibly caused by the $ sign.
2. Some fields like license contain nan values. They need to be cleaned up

I will now seek to get answers about some questions that popped in my head. I will be applying some steps of the CRISP DM process for each question as needed. The 3 questions I have about Airbnb in Seattle are:

  1. Which Seattle neighbourhoods are the earn the highest revenue?
  2. Which accommodation size receives the highest number of bookings?
  3. How well can we predict listing price?

Which Seattle neighbourhoods are the earn the highest revenue?

We do not have a revenue column in our dataset so I will assume that each review was a stay at a listing. This would give us an estimated number of stays at each listing, but we do not know how long each stay was. To estimate this I will use the minimum number of days. With this I can calculate the estimated revenue for each neighbourhood as illustrated below.

  1. Clean up my data by removing the $ sign from the price values

2. Calculate the estimated revenue for each listing using the number of reviews as the number of stays. I will sum up the revenues for each listing then get the average per neighbourhood. Below is the list showing the neighbourhoods with the highest estimated listing revenues.

Highest earning neighbourhoods

Which accommodation size receives the highest number of bookings?

Again, assuming that each review was a successful booking, we can get the proportions of bookings plotted against the number of people a property can accommodate.

Distribution of listings by number of people they can accommodate

From this plot, we learn that properties that accommodate 2 people account for over half of the total bookings. This means that if an investor wants to build properties, they should look into building properties that can accommodate 2 people per room as they have the highest number of bookings.

How well can we predict listing price?

Well this one is the most interesting because its where we create the model. We first clean up the data by imputing values for categorical variables. We then split the data into training and test data. After this, we instantiate the linear regression model and fit our training data into the model. We then test the model by predicting our test data and this process gave the output below. This shows us that there is some form of symmetry in our values although it’s not accurate to the dot.

The R squared score for the data was 0.5926903081943571 for the training dataset, and the test data gave an almost similar R squared score of 0.5975240944346538. The weird thing is that the test data got a higher score than the training data. This could be because the outliers are better represented in the training data than the test data.

The root mean square error of the model was 3446.6221734160763, which along with the r-squared value suggests that the model could be improved without the risk of overfitting. The improvement could be done by adding more variables that we didn’t analyse.

To better visualise the data, I created a chart that shows the actual values against the predicted values on the same planes. This was inspired by this post.

Conclusion

We have seen how we can get more insights and how we can predict variables in data science using the CRISP DM as our process guideline. If you wish to play with the data or see what others are doing with it, please check out Kaggle. If you want to follow my code, please check out my Github here.

--

--

Andrew Sithole

A curious software developer. I'm keen to learn new stuff-- come lets learn together