So, is it a good price?

With my oldest daughter heading off to college soon, we’ve realized that our family car doesn’t need to be as large as it used to be. We’ve had a great relationship with our local CarMax over the years, and we appreciate their no-haggle pricing model. My wife had her eyes set on a particular model: a 2019 Volvo XC90 T6 Momentum. The specific car she found was listed at $35,998, with 47,000 miles on the odometer.

But is the price good or bad? As a hacker/data scientist, I knew could get the data to make an informed decision but doing analysis at home is a great way to learn and use new technologies. The bottom line is that the predicted price would be $40,636 or 11.4% higher than the CarMax asking price. If I compare to the specific trim, the price should be $38,666. So the price is probably fair. Now how did I come up with that number?

# Calculations

Armed with Python and an array of web scraping tools, I embarked on a mission to collect data that would help me determine a fair value for our new car. I wrote a series of scripts to extract relevant information, such as price, age, and cost from various websites. This required a significant amount of Python work to convert the HTML data into a format that could be analyzed effectively.

Once I had amassed a good enough dataset (close to 200 cars), I began comparing different statistical techniques to find the most accurate pricing model. In this blog post, I’ll detail my journey through the world of logistic regression and compare it to more modern data science methods, revealing which technique ultimately led us to the fairest car price.

First, I did some basic web searching. According to Edmunds, the average price for a 2019 Volvo XC90 T6 Momentum with similar mileage is between $33,995 and $43,998 and my $35,998 falls within this range.

As for how the Momentum compares to other Volvo options and similar cars, there are a few things to consider. The Momentum is one of four trim levels available for the 2019 XC902. It comes with a number of standard features, including leather upholstery, a panoramic sunroof, and a 9-inch touchscreen infotainment system. Other trim levels offer additional features and options.

The 2019 Volvo XC90 comes in four trim levels: Momentum, R-Design, Inscription, and Excellence. The R-Design offers a sportier look and feel, while the Inscription adds more luxury features. The Excellence is the most luxurious and expensive option, with seating for four instead of seven. The Momentum is the most basic.

In terms of similar cars, some options to consider might include the Audi Q7 or the BMW X5. Both of these SUVs are similarly sized and priced to the XC90.

To get there, I had to do some web scraping, data cleaning, and built a basic logistic regression model, as well as other modern data science methods. To begin my data collection journey, I decided (in 2 seconds) to focus on three primary sources: Google’s search summary, Carvana, and Edmunds.

My first step was to search for Volvo XC90 on each of these websites. I then used the Google Chrome toolbar to inspect the webpage’s HTML structure and identify the `<div>`

element containing the desired data. By clicking through the pages, I was able to copy the relevant HTML and put this in a text file, enclosed within `<html>`

and `<body>`

tags. This format made it easier for me to work with the BeautifulSoup Python library, which allowed me to extract the data I needed and convert it into CSV files.

Since the data from each source varied, I had to run several regular expressions on many fields to further refine the information I collected. This process ensured that the data was clean and consistent, making it suitable for my upcoming analysis.

Finally, I combined all the data from the three sources into a single CSV file. This master dataset provided a solid foundation for my pricing analysis and allowed me to compare various data science techniques in order to determine the most accurate and fair price for the 2019 Volvo XC90 T6 Momentum.

In the following sections, I’ll delve deeper into the data analysis process and discuss the different statistical methods I employed to make our car-buying decision.

First, data from carvana looked like this:

<div class="tk-pane full-width"> <div class="inventory-type carvana-certified" data-qa="inventory-type">Carvana Certified </div> <div class="make-model" data-qa="make-model"> <div class="year-make">2020 Volvo XC90</div> </div> <div class="trim-mileage" data-qa="trim-mileage"><span>T6 Momentum</span> • <span>36,614 miles</span></div> </div> <div class="tk-pane middle-frame-pane"> <div class="flex flex-col h-full justify-end" data-qa="pricing"> <div data-qa="price" class="flex items-end font-bold mb-4 text-2xl">$44,990</div> </div> </div>

In this code snippet, I used the BeautifulSoup library to extract relevant data from the saved HTML file, which contained information on Volvo XC90 listings. The script below searches for specific `<div>`

elements containing the year, make, trim, mileage, and price details. It then cleans up the data by removing unnecessary whitespace and commas before storing it in a dictionary. Finally, the script compiles all the dictionaries into a list and exports the data to a CSV file for further analysis.

I could then repeat this process with Google to get a variety of local sources.

One challenge from the Google results, was that I had a lot of data in the images (they were base64 encoded) so wrote a bash script to clean up the tags using sed (pro tip: learn awk and sed)

When working with the Google search results, I had to take a slightly different approach compared to the strategies used for Carvana and Edmunds. Google results did not have a consistent HTML structure that could be easily parsed to extract the desired information. Instead, I focused on identifying patterns within the text format itself to retrieve the necessary details. By leveraging regular expressions, I was able to pinpoint and extract the specific pieces of information, such as the year, make, trim, mileage, and price, directly from the text. My scrape code is below.

Scraping Edmunds required both approaches of using formatting and structure.

All together, I got 174 records of used Volvo XC90s, I could easily get 10x this since the scripts exist and I could mine craigslist and other places. With the data I have, I can use R to explore the data:

# Load the readxl package library(readxl) library(scales) library(scatterplot3d) # Read the data from data.xlsx into a data frame df <- read_excel("data.xlsx") df$Price<-as.numeric(df$Price)/1000 # Select the columns you want to use df <- df[, c("Title", "Desc", "Mileage", "Price", "Year", "Source")] # Plot Year vs. Price with labeled axes and formatted y-axis plot(df$Year, df$Price, xlab = "Year", ylab = "Price ($ '000)", yaxt = "n") # Don't plot y-axis yet # Add horizontal grid lines grid() # Format y-axis as currency axis(side = 2, at = pretty(df$Price), labels = dollar(pretty(df$Price))) abline(lm(Price ~ Year, data = df), col = "red")

This code snippet employs the `scatterplot3d()`

function to show a 3D scatter plot that displays the relationship between three variables in the dataset. Additionally, the `lm()`

function is utilized to fit a linear regression model, which helps to identify trends and patterns within the data. To enhance the plot and provide a clearer representation of the fitted model, the `plane3d()`

function is used to add a plane that represents the linear regression model within the 3D scatter plot.

model <- lm(Price ~ Year + Mileage, data = df) # Plot the data and model s3d <- scatterplot3d(df$Year, df$Mileage, df$Price, xlab = "Year", ylab = "Mileage", zlab = "Price", color = "blue") s3d$plane3d(model, draw_polygon = TRUE)

So, we can now predict the price of 2019 Volvo XC90 T6 Momentum with 47K miles, which is $40,636 or 11.4% higher than the CarMax asking price of $35,998.

# Create a new data frame with the values for the independent variables new_data <- data.frame(Year = 2019, Mileage = 45000) # Use the model to predict the price of a 2019 car with 45000 miles predicted_price <- predict(model, new_data) # Print the predicted price print(predicted_price)

# Other Methods

Ok, so now let’s use “data science”. Besides linear regression, there are several other techniques that I can use to take into account the multiple variables (year, mileage, price) in your dataset. Here are some popular techniques:

**Decision Trees**: A decision tree is a tree-like model that uses a flowchart-like structure to make decisions based on the input features. It is a popular method for both classification and regression problems, and it can handle both categorical and numerical data.

**Random Forest**: Random forest is an ensemble learning technique that combines multiple decision trees to make predictions. It can handle both regression and classification problems and can handle missing data and noisy data.

**Support Vector Machines (SVM)**: SVM is a powerful machine learning algorithm that can be used for both classification and regression problems. It works by finding the best hyperplane that separates the data into different classes or groups based on the input features.

**Neural Networks**: Neural networks are a class of machine learning algorithms that are inspired by the structure and function of the human brain. They are powerful models that can handle both numerical and categorical data and can be used for both regression and classification problems.

**Gradient Boosting**: Gradient boosting is a technique that combines multiple weak models to create a stronger one. It works by iteratively adding weak models to a strong model, with each model focusing on the errors made by the previous model.

All of these techniques can take multiple variables into account, and each has its strengths and weaknesses. The choice of which technique to use will depend on the specific nature of your problem and your data. It is often a good idea to try several techniques and compare their performance to see which one works best for your data.

I’m going to use random forest and a decision tree model.

# Random Forest

# Load the randomForest package library(randomForest) # "Title", "Desc", "Mileage", "Price", "Year", "Source" # Split the data into training and testing sets set.seed(123) # For reproducibility train_index <- sample(1:nrow(df), size = 0.7 * nrow(df)) train_data <- df[train_index, ] test_data <- df[-train_index, ] # Fit a random forest model model <- randomForest(Price ~ Year + Mileage, data = train_data, ntree = 500) # Predict the prices for the test data predictions <- predict(model, test_data) # Calculate the mean squared error of the predictions mse <- mean((test_data$Price - predictions)^2) # Print the mean squared error cat("Mean Squared Error:", mse)

The output from the random forest model you provided indicates that the model has a mean squared error (MSE) of 17.14768 and a variance explained of 88.61%. A lower MSE value indicates a better fit of the model to the data, while a higher variance explained value indicates that the model can explain a larger portion of the variation in the target variable.

Overall, an MSE of 17.14768 is reasonably low and suggests that the model has a good fit to the training data. A variance explained of 88.61% suggests that the model is able to explain a large portion of the variation in the target variable, which is also a good sign.

However, the random forest method shows a predicted cost of $37,276.54.

I also tried cross-validation techniques to get a better understanding of the model’s overall performance (MSE 33.890). Changing to a new technique such as a decision tree model, turned MSE into 50.91. Logistic regression works just fine.

# Adding the Trim

However, I was worried that I was comparing the Momentum to the higher trim options. So to get the trim, I tried the following prompt in Gpt4 to translate the text to one of the four trims.

don't tell me the steps, just do it and show me the results. given this list add, a column (via csv) that categorizes each one into only five categories Momentum, R-Design, Inscription, Excellence, or Unknown

That worked perfectly and we can see that we have mostly Momentums.

Excellence | Inscription | Momentum | R-Design | Unknown | |
---|---|---|---|---|---|

Count | 0 | 68 | 87 | 8 | 9 |

Percent | 0.00% | 39.53% | 50.58% | 4.65% | 5.23% |

And this probably invalidates my analysis as Inscriptions (in blue) do have clearly higher prices:

We can see the average prices (in thousands). In 2019 Inscriptions cost less than Momentums? That is probably a small n problems since we only have 7 Inscriptions and 16 Momentum’s in our data set for 2019.

Year | R-Design | Inscription | Momentum |
---|---|---|---|

2014 | $19.99 | NA | NA |

2016 | $30.59 | $32.59 | $28.60 |

2017 | $32.79 | $32.97 | $31.22 |

2018 | $37.99 | $40.69 | $33.23 |

2019 | NA | $36.79 | $39.09 |

2020 | NA | $47.94 | $43.16 |

So, if we restrict our data set smaller, what would the predicted price of the 2019 Momentum be? Just adding a filter and running our regression code above we have **$38,666** which means we still have a good/reasonable price.

# Quick Excursion

One last thing I’m interested in: does mileage or age matter more. Let’s build a new model.

# Create Age variable df$Age <- 2023 - df$Year # Fit a linear regression model model <- lm(Price ~ Mileage + Age, data = df) # Print the coefficients summary(model)$coef

Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|

(Intercept) | 61.34913 | 0.69084 | 88.80372 | 2.28E-144 |

Mileage | -0.00022 | 2.44E-05 | -8.83869 | 1.18E-15 |

Age | -2.75459 | 0.27132 | -10.1525 | 3.15E-19 |

Based on the regression results, we can see that both Age and Mileage have a significant effect on Price, as their p-values are very small (<0.05). However, we can also see that Age has a larger absolute t-score (-10.15) than Mileage (-8.84), indicating that Age may have a slightly greater effect on Price than Mileage. Additionally, the estimates show that for every one-year increase in Age, the Price decreases by approximately 2.75 thousand dollars, while for every one-mile increase in Mileage, the Price decreases by approximately 0.0002 thousand dollars (or 20 cents). That is actually pretty interesting.

This isn’t that far off. According to the US government, a car depreciates by an average of $0.17 per mile driven. This is based on a five-year ownership period, during which time a car is expected to be driven approximately 12,000 miles per year, for a total of 60,000 miles.

In terms of depreciation per year, it can vary depending on factors such as make and model of the car, age, and condition. However, a general rule of thumb is that a car can lose anywhere from 15% to 25% of its value in the first year, and then between 5% and 15% per year after that. So on average, a car might depreciate by about 10% per year.

# Code

While initially in the original blog post, I moved all the code to the end.