WebScraping - SF Housing Data
Mar 21, 2023
Born to an architect, I have always had an inherent curiosity to understand house designs and layouts. I find joy in exploring different architectures across the city, and during this time, I have found various beautiful neighborhoods that one would assume to be highly expensive. However, even within those small areas, I was surprised to find a wide range of house prices. This intrigued me to understand what drives prices in the city I've grown to call my home.
The San Francisco housing market can be unpredictable and also highly sought after, given the prime location of several neighborhoods. Being a tech hub, many people come to San Francisco for jobs (high employment), creating start-ups (talented workforce, VCs), favorable climate, and culture. It is also known to be the most expensive market in the US. The housing market in metropolitan areas is inundated with high variations in prices based on numerous factors making it harder for players in the market (realtors, homeowners, buyers, tenants, banks, government, etc.) to track this information and quote the optimal prices.
The goal of this project is to create a web scraping tool using the Beautiful Soup library in Python to extract data of houses sold in San Francisco in the last ten months from Trulia.com and store it in MongoDB. This includes features like no. of bedrooms, house size, price, years built, apartment features, and more. The tool can be used by market stakeholders to track this information. There are several insights they can gain from this data, like the specific premium properties in the city, the price premium paid per zip code, and creating SAAS tools for customers. These tools can include a dashboard to track housing metrics over time, recommender systems to match customers with houses based on their requirements, or machine learning models to predict prices.