## Quick(ish) Price Check on a Car

So, is it a good price?

With my oldest daughter heading off to college soon, we’ve realized that our family car doesn’t need to be as large as it used to be. We’ve had a great relationship with our local CarMax over the years, and we appreciate their no-haggle pricing model. My wife had her eyes set on a particular model: a 2019 Volvo XC90 T6 Momentum. The specific car she found was listed at \$35,998, with 47,000 miles on the odometer.

But is the price good or bad? As a hacker/data scientist, I knew could get the data to make an informed decision but doing analysis at home is a great way to learn and use new technologies. The bottom line is that the predicted price would be \$40,636 or 11.4% higher than the CarMax asking price. If I compare to the specific trim, the price should be \$38,666. So the price is probably fair. Now how did I come up with that number?

# Calculations

Armed with Python and an array of web scraping tools, I embarked on a mission to collect data that would help me determine a fair value for our new car. I wrote a series of scripts to extract relevant information, such as price, age, and cost from various websites. This required a significant amount of Python work to convert the HTML data into a format that could be analyzed effectively.

Once I had amassed a good enough dataset (close to 200 cars), I began comparing different statistical techniques to find the most accurate pricing model. In this blog post, I’ll detail my journey through the world of logistic regression and compare it to more modern data science methods, revealing which technique ultimately led us to the fairest car price.

First, I did some basic web searching. According to Edmunds, the average price for a 2019 Volvo XC90 T6 Momentum with similar mileage is between \$33,995 and \$43,998 and my \$35,998 falls within this range.

As for how the Momentum compares to other Volvo options and similar cars, there are a few things to consider. The Momentum is one of four trim levels available for the 2019 XC902. It comes with a number of standard features, including leather upholstery, a panoramic sunroof, and a 9-inch touchscreen infotainment system. Other trim levels offer additional features and options.

The 2019 Volvo XC90 comes in four trim levels: Momentum, R-Design, Inscription, and Excellence. The R-Design offers a sportier look and feel, while the Inscription adds more luxury features. The Excellence is the most luxurious and expensive option, with seating for four instead of seven. The Momentum is the most basic.

In terms of similar cars, some options to consider might include the Audi Q7 or the BMW X5. Both of these SUVs are similarly sized and priced to the XC90.

To get there, I had to do some web scraping, data cleaning, and built a basic logistic regression model, as well as other modern data science methods. To begin my data collection journey, I decided (in 2 seconds) to focus on three primary sources: Google’s search summary, Carvana, and Edmunds.

My first step was to search for Volvo XC90 on each of these websites. I then used the Google Chrome toolbar to inspect the webpage’s HTML structure and identify the `<div>` element containing the desired data. By clicking through the pages, I was able to copy the relevant HTML and put this in a text file, enclosed within `<html>` and `<body>` tags. This format made it easier for me to work with the BeautifulSoup Python library, which allowed me to extract the data I needed and convert it into CSV files.

Since the data from each source varied, I had to run several regular expressions on many fields to further refine the information I collected. This process ensured that the data was clean and consistent, making it suitable for my upcoming analysis.

Finally, I combined all the data from the three sources into a single CSV file. This master dataset provided a solid foundation for my pricing analysis and allowed me to compare various data science techniques in order to determine the most accurate and fair price for the 2019 Volvo XC90 T6 Momentum.

In the following sections, I’ll delve deeper into the data analysis process and discuss the different statistical methods I employed to make our car-buying decision.

First, data from carvana looked like this:

```<div class="tk-pane full-width">
<div class="inventory-type carvana-certified" data-qa="inventory-type">Carvana Certified
</div>
<div class="make-model" data-qa="make-model">
<div class="year-make">2020 Volvo XC90</div>
</div>
<div class="trim-mileage" data-qa="trim-mileage"><span>T6 Momentum</span> • <span>36,614
miles</span></div>
</div>
<div class="tk-pane middle-frame-pane">
<div class="flex flex-col h-full justify-end" data-qa="pricing">
<div data-qa="price" class="flex items-end font-bold mb-4 text-2xl">\$44,990</div>
</div>
</div>```

In this code snippet, I used the BeautifulSoup library to extract relevant data from the saved HTML file, which contained information on Volvo XC90 listings. The script below searches for specific `<div>` elements containing the year, make, trim, mileage, and price details. It then cleans up the data by removing unnecessary whitespace and commas before storing it in a dictionary. Finally, the script compiles all the dictionaries into a list and exports the data to a CSV file for further analysis.

I could then repeat this process with Google to get a variety of local sources.

One challenge from the Google results, was that I had a lot of data in the images (they were base64 encoded) so wrote a bash script to clean up the tags using sed (pro tip: learn awk and sed)

When working with the Google search results, I had to take a slightly different approach compared to the strategies used for Carvana and Edmunds. Google results did not have a consistent HTML structure that could be easily parsed to extract the desired information. Instead, I focused on identifying patterns within the text format itself to retrieve the necessary details. By leveraging regular expressions, I was able to pinpoint and extract the specific pieces of information, such as the year, make, trim, mileage, and price, directly from the text. My scrape code is below.

Scraping Edmunds required both approaches of using formatting and structure.

All together, I got 174 records of used Volvo XC90s, I could easily get 10x this since the scripts exist and I could mine craigslist and other places. With the data I have, I can use R to explore the data:

```# Load the readxl package
library(scales)
library(scatterplot3d)

# Read the data from data.xlsx into a data frame

df\$Price<-as.numeric(df\$Price)/1000

# Select the columns you want to use
df <- df[, c("Title", "Desc", "Mileage", "Price", "Year", "Source")]

# Plot Year vs. Price with labeled axes and formatted y-axis
plot(df\$Year, df\$Price, xlab = "Year", ylab = "Price (\$ '000)",
yaxt = "n")  # Don't plot y-axis yet

grid()

# Format y-axis as currency
axis(side = 2, at = pretty(df\$Price), labels = dollar(pretty(df\$Price)))

abline(lm(Price ~ Year, data = df), col = "red")```

This code snippet employs the `scatterplot3d()` function to show a 3D scatter plot that displays the relationship between three variables in the dataset. Additionally, the `lm()` function is utilized to fit a linear regression model, which helps to identify trends and patterns within the data. To enhance the plot and provide a clearer representation of the fitted model, the `plane3d()` function is used to add a plane that represents the linear regression model within the 3D scatter plot.

```model <- lm(Price ~ Year + Mileage, data = df)

# Plot the data and model
s3d <- scatterplot3d(df\$Year, df\$Mileage, df\$Price,
xlab = "Year", ylab = "Mileage", zlab = "Price",
color = "blue")
s3d\$plane3d(model, draw_polygon = TRUE)```

So, we can now predict the price of 2019 Volvo XC90 T6 Momentum with 47K miles, which is \$40,636 or 11.4% higher than the CarMax asking price of \$35,998.

```# Create a new data frame with the values for the independent variables
new_data <- data.frame(Year = 2019, Mileage = 45000)

# Use the model to predict the price of a 2019 car with 45000 miles
predicted_price <- predict(model, new_data)

# Print the predicted price
print(predicted_price)```

# Other Methods

Ok, so now let’s use “data science”. Besides linear regression, there are several other techniques that I can use to take into account the multiple variables (year, mileage, price) in your dataset. Here are some popular techniques:

Decision Trees: A decision tree is a tree-like model that uses a flowchart-like structure to make decisions based on the input features. It is a popular method for both classification and regression problems, and it can handle both categorical and numerical data.

Random Forest: Random forest is an ensemble learning technique that combines multiple decision trees to make predictions. It can handle both regression and classification problems and can handle missing data and noisy data.

Support Vector Machines (SVM): SVM is a powerful machine learning algorithm that can be used for both classification and regression problems. It works by finding the best hyperplane that separates the data into different classes or groups based on the input features.

Neural Networks: Neural networks are a class of machine learning algorithms that are inspired by the structure and function of the human brain. They are powerful models that can handle both numerical and categorical data and can be used for both regression and classification problems.

Gradient Boosting: Gradient boosting is a technique that combines multiple weak models to create a stronger one. It works by iteratively adding weak models to a strong model, with each model focusing on the errors made by the previous model.

All of these techniques can take multiple variables into account, and each has its strengths and weaknesses. The choice of which technique to use will depend on the specific nature of your problem and your data. It is often a good idea to try several techniques and compare their performance to see which one works best for your data.

I’m going to use random forest and a decision tree model.

# Random Forest

```# Load the randomForest package
library(randomForest)

# "Title", "Desc", "Mileage", "Price", "Year", "Source"

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Fit a random forest model
model <- randomForest(Price ~ Year + Mileage, data = train_data, ntree = 500)

# Predict the prices for the test data
predictions <- predict(model, test_data)

# Calculate the mean squared error of the predictions
mse <- mean((test_data\$Price - predictions)^2)

# Print the mean squared error
cat("Mean Squared Error:", mse)
```

The output from the random forest model you provided indicates that the model has a mean squared error (MSE) of 17.14768 and a variance explained of 88.61%. A lower MSE value indicates a better fit of the model to the data, while a higher variance explained value indicates that the model can explain a larger portion of the variation in the target variable.

Overall, an MSE of 17.14768 is reasonably low and suggests that the model has a good fit to the training data. A variance explained of 88.61% suggests that the model is able to explain a large portion of the variation in the target variable, which is also a good sign.

However, the random forest method shows a predicted cost of \$37,276.54.

I also tried cross-validation techniques to get a better understanding of the model’s overall performance (MSE 33.890). Changing to a new technique such as a decision tree model, turned MSE into 50.91. Logistic regression works just fine.

However, I was worried that I was comparing the Momentum to the higher trim options. So to get the trim, I tried the following prompt in Gpt4 to translate the text to one of the four trims.

```don't tell me the steps, just do it and show me the results.
given this list add, a column (via csv) that categorizes each one into only five categories Momentum, R-Design, Inscription, Excellence, or Unknown```

That worked perfectly and we can see that we have mostly Momentums.

And this probably invalidates my analysis as Inscriptions (in blue) do have clearly higher prices:

We can see the average prices (in thousands). In 2019 Inscriptions cost less than Momentums? That is probably a small n problems since we only have 7 Inscriptions and 16 Momentum’s in our data set for 2019.

So, if we restrict our data set smaller, what would the predicted price of the 2019 Momentum be? Just adding a filter and running our regression code above we have \$38,666 which means we still have a good/reasonable price.

# Quick Excursion

One last thing I’m interested in: does mileage or age matter more. Let’s build a new model.

```# Create Age variable
df\$Age <- 2023 - df\$Year

# Fit a linear regression model
model <- lm(Price ~ Mileage + Age, data = df)

# Print the coefficients
summary(model)\$coef```

Based on the regression results, we can see that both Age and Mileage have a significant effect on Price, as their p-values are very small (<0.05). However, we can also see that Age has a larger absolute t-score (-10.15) than Mileage (-8.84), indicating that Age may have a slightly greater effect on Price than Mileage. Additionally, the estimates show that for every one-year increase in Age, the Price decreases by approximately 2.75 thousand dollars, while for every one-mile increase in Mileage, the Price decreases by approximately 0.0002 thousand dollars (or 20 cents). That is actually pretty interesting.

This isn’t that far off. According to the US government, a car depreciates by an average of \$0.17 per mile driven. This is based on a five-year ownership period, during which time a car is expected to be driven approximately 12,000 miles per year, for a total of 60,000 miles.

In terms of depreciation per year, it can vary depending on factors such as make and model of the car, age, and condition. However, a general rule of thumb is that a car can lose anywhere from 15% to 25% of its value in the first year, and then between 5% and 15% per year after that. So on average, a car might depreciate by about 10% per year.

# Code

While initially in the original blog post, I moved all the code to the end.

## Basement Framing with the Shopbot

Framing around bulkheads is painful. It is hard to get everything straight and aligned. I found the Shopbot to be very helpful. There are three problems I was trying to solve: (1) Getting multiple corners straight across 30 feet, (2) nearly no time and (3) basic pine wood framing would sag over a 28″ run.

In fairness, the cuts did take a lot of time (about 2.5 hours of cutting), but I could do other work while the ShopBot milled out the pieces. I also had several hours of prep and installation, so I’m definitely slower than a skilled carpenter would be, but probably faster off by using this solution. Plus, I think I’m definitely more straight and accurate. I especially need this, because my lack of skill means that I don’t have the bag of tricks available to deal with non-straight surfaces.

First, Autodesk Revit makes drawing ducts easy as part of an overall project model. The problem was the way the ducts were situated, the team working on the basement couldn’t simply make a frame that went all the way to the wall, because of an annoying placed door.

I was able to make a quick drawing in the model and print out frames on the shopbot. They only had to be aligned vertically which was easy to do with the help of a laser level.

These were easy to cut out while I also had to make some parts for my daughters school project.

## Review: History of the World in Six Glasses by Tom Standage

I love history, but raw history can be a bit boring and so I look for books that peer into the past with a different lens or narrative. Here, Tom Standage tells a popular history of the world through six beverages: beer, wine, spirits, coffee, tea and Coca Cola. Full of the anecdotes and stories that liven up an otherwise dry subject, I especially appreciated the new perspective added to the background of the otherwise unrecognized history behind my drinks. The fact that water is so essential to our survival provides the necessary justification to put our drinks at the center of history. By introducing each beverage chronologically, he allows each beverage to tell the story of a period through local stories, global processes, and connections.

One of the first conclusions was that our beverages are much more than a means to satisfy our thirst or sweet tooth. The six glasses surveyed contained medicines, currency, social equators, revolutionary substances, status indicators, and nutritional supplements.

While a good book and an engaging read, I wouldn’t say my worldview was challenged or much expanded by this book. Books like this would make a fascinating magazine article (like one of those crazy long articles in the Atlantic), and I feel the story of each glass was stretched to fill a book. To save you the time, I tried to hit the highlights below and allow you to read something much more interesting like Sapiens or Abundance (review forthcoming).

# Beer

In both cultures [Egypt and Mesopotamia], beer was a staple foodstuff without which no meal was complete. It was consumed by everyone, rich and poor, men and women, adults and children, from the top of the social pyramid to the bottom. It was truly the defining drink of these first great civilizations.” Page 30

Standage begins by discussing the history of beer while presenting the story of the domestication of cereal grains, the development of farming, early migrations, and the development of river valley societies in Egypt and Mesopotamia. He talks of beer as a discovery rather than an invention, and how it was first used alternately as a social drink with a shared vessel, as a form of edible money, and as a religious offering. As urban water supplies became contaminated, beer also became a safer drink. Beer became equated with civilization and was the beverage of choice from cradle to the grave. By discussing global processes such as the increase of agriculture, urban settlement, regional trade patterns, the evolution of writing, and health and nutrition, Standage provides the needed global historical context for the social evolution of beer.

# Wine

Thucydides: “the peoples of the Mediterranean began to emerge from barbarism when they learned to cultivate the olive and the vine.” (52-53)

Standage introduces wine through a discussion of early Greek and Roman society. Wine is initially associated with social class as it was exotic and scarce, being expensive to transport without breakage. The masses drank beer. Wine conveyed power, prestige, and privilege. Wine then came to embody Greek culture and became more widely available. It was used not only in the Symposium, the Greek drinking parties, but also medicinally to clean wounds and as a safer drink than water. Roman farmers combined Greek influence with their own farming background through viticulture, growing grapes instead of grain which they imported from colonies in North Africa. It became a symbol of social differentiation and a form of conspicuous consumption where the brand of the wine mattered. With the fall of the Roman Empire, wine continued to be associated with Christianity and the Mediterranean. Global processes highlighted here include the importance of geography, climate and locale, long distance trade, the rise and fall of empires, the movement of nomadic peoples, and the spread of religion.

# Spirits

“Rum was the liquid embodiment of both the triumph and the oppression of the first era of globalization.” (Page 111)

First, I needed this book to force me to consider the difference between beer, wine and spirits. Here is how I keep it straight:

As far as I can tell, there are three big divisions in the world of adult beverages: beers, wines, and spirits. These typically contain between 3{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} and 40{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} Alcohol by volume (ABV).

Beer (Alcohol content: 4{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4}-6{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} ABV generally) and Wine (Alcohol content: 9{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4}-16{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} ABV) are alcoholic beverages produced by fermentation.

Beer is generally composed of malted barley and/or wheat and wine is made using fermented grapes. Simple enough. Also remember that Ales are not Lagers. Ale yeasts ferment at warmer temperatures than do lager yeasts. Ales are sometimes referred to as top fermented beers, as ale yeasts tend to locate at the top of the fermenter during fermentation, while lagers are referred to as bottom-fermenting by the same logic.

Beer and Wine have low alcohol content. (And I only drink these.) So, while being alcoholic drinks they aren’t included in the general definition of ‘Liquor’, which is just a term for drinks with ABV’s higher than 16 or so percent.

To be clear, fermentation is a metabolic process that converts sugar to acids, gases or alcohol. It occurs in yeast and bacteria, but also in oxygen-starved muscle cells, as in the case of lactic acid fermentation. Fermentation is also used more broadly to refer to the bulk growth of microorganisms on a growth medium, often with the goal of producing a specific chemical product.

By the way, it was news to me that Champagne is just a specific variant of wine. More specifically, Champagne is a sparkling (carbonated) wine produced from grapes grown in the Champagne region of France following rules that demand secondary fermentation of the wine in the bottle to create carbonation.

Now, back to this section of the book. Whisky, Rum, Brandy, Vodka, Tequila are all what we call ‘Spirits’ or ‘Liquor’ and they can really crank up the ABV.

Spirits (aka Liquor or Distilled beverage) are beverages prepared using distillation. Distillation is just further processing of fermented beverage to purify and remove any diluting components like water. This increases the proportion of their alcohol content and that’s why they are also commonly known as ‘Hard Liquor’. Distilled beverages like whisky may have up to 40{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} ABV. (wow)

This was strange for me. I’ve always considered wine production to be the highest art in beverage production, but you can think of distilled spirits as a more “refined” counterpart of the more “crude” fermented beverages.

Standage focuses less on the basic content above, and gives us the history that got us here. He introduces the fact that the process of distillation originated in Cordoba by the Arabs to allow the miracle medicine of distilled wine to travel better. He talks of how this idea was spread via the new printing press, leading to the development of whiskey and, later, brandy. Much detail is provided on the spirits, slaves, and sugar connection where rum was used as a currency for slave payment. Sailors drank grog (watered-down rum), which helped to alleviate scurvy.

He argues that rum was the first globalized drink of oppression. Its popularity in the colonies, where there were few other alcoholic beverage choices, led to distilling in New England. This, he argues, began the trade wars which resulted in the molasses act, the sugar act, the boycotts of imports, and a refusal to pay taxes without representation. Indeed, he wonders whether it was rum rather than tea that started the American Revolution. He also discusses the impact of the whiskey rebellion. The French fur traders’ use of brandy, the British use of rum, and the Spanish use of pulque all point to how spirits were used to conquer territory in the Americas. Spirits became associated not only with slavery, but also with the exploitation and subjugation of natives on five continents as colonies and mercantilist economic theory was pursued.

For completeness, I wanted to summarize the difference between the different spirits out there.

Vodka is the simplest of spirits and consists almost entirely of water and ethanol. It’s distilled many times to a very high proof, removing almost all impurities, and then watered down to desired strength. Since just about all impurities are removed, I was surprised to find out that it can be made from just about anything. Potatoes, grain, or a mixture are most common. Flavored vodkas are made by adding flavors and sugars after the fact when the liquor is bottled.

Whiskey (which includes Scotches, Rye, and Bourbons) is specifically made from grain and is aged in wood casks. The grain is mixed with water and fermented to make beer and then distilled. (Yes, whiskey is first beer, surprise to me.) The liquor comes out of the still white and is very much like vodka. The color is imparted by aging in wood casks. Different types of whiskey are separated by the grain they are made of, how they are aged, and specific regional processes. Scotches are from Scotland, made mostly with barley, are smokey from the way the barley is kiln dried. Bourbons are made from at least half corn and are aged in charred barrels which impart caramel and vanilla flavors. Rye is made from rye, and there are plenty more variations.

Gin, like the others made with grain, starts is life as beer, which is then distilled to a high proof like vodka. Aromatic herbs including juniper berries and often gentian, angelica root, and a host of secret flavorings depending on the brand, are added to the pure spirit. The liquor is then distilled again. The second distillation leaves behind heavy bitter molecules which don’t vaporize readily, capturing only the lighter aromatics.

Rum is made by fermenting and distilling cane sugar. Traditionally made from less refined sugar, it contains aromas of the sugar cane. Originally it was an inadvertent by product of making sugar as runoff from the refinery quickly fermented. Like whiskey, some rums are aged, giving them an amber color. And, like other sprits there are regional variations with slightly different processes.

Brandy is a distilled spirit from fruits. Most commonly grapes.

Agave liquors, including tequila, mezcal, and sotol, are made from fermented sugars from the agave, a relative of aloes.

# Coffee (my favorite beverage)

Europe’s coffeehouses functioned as information exchanges for scientists, businessmen, writers and politicians. Like modern web sites.. (Page 152)

Standage presents the history of coffee from its origins in the Arab world to Europe, addressing the initial controversy that the beverage generated in both locations. As a new and safe alternative to alcoholic drinks and water, some argued that it promoted rational enquiry and had medicinal qualities. Women felt threatened by it, however, arguing that due to its supposed deleterious effect on male potency, “The whole race is in danger of extinction.” Coffeehouses were places where men gathered to exchange news where social differences were left at the door. Some establishments specialized in particular topics such as the exchange of scientific and commercial ideas. Governments tried to suppress these institutions, since coffeehouses promoted freedom of speech and an open atmosphere for discussion amongst different classes of people–something many governments found threatening.

I had a weak appreciation for Coffee’s economic impact. Whole empires were built on coffee and coffeehouses formed the first stock exchanges. The Arabs had a monopoly on beans, while the Dutch were middlemen in the trade and then set up coffee plantations in Java and Suriname. The French began plantations in the West Indies and Haiti.

# Tea

The story of tea is the story of imperialism, industrialization and world domination one cup at a time. (Page 177)

# Coke

To my mind, I am in this damn mess as much to help keep the custom of drinking Cokes as I am to help preserve the million other benefits our country blesses its citizens with . . . (Page 253)

Similar to the other drinks Standage discusses, I was surprised to learn that Coca cola was initially a medicinal beverage. Soda water could be found in the soda fountains in apothecaries as early as 1820. John Pemberton in Atlanta Georgia in 1886 developed a medicinal concoction using French wine, coca (from the Incas), and kola extract. However, he needed a non-alcoholic version because of the temperance movement, and thus Coca-Cola was born. Thanks to advertising and marketing using testimonials, a distinctive logo, and free samples, the syrup became profitable when added to existing soda fountains. By 1895 it was a national drink. Legal controversy forced it to let go of medicinal claims and left it as “delicious and refreshing.” Further challenges to the drink included the end of Prohibition, the Great Depression, and the rise of Pepsi.

With World War II, America ended isolationism and sent out 16 million servicemen with Coke in their hands. Coke sought to increase soldier morale by supplying a familiar drink to them abroad. To cut down on shipping costs, only the syrup was shipped, and bottling plants were set up wherever American servicemen went. Quickly, Coke became synonymous with patriotism. After the war, there were attacks of Coca-colonization by French communists in the midst of the Cold war. The company responded by arguing that “coca cola was the essence of capitalism” representing a symbol of freedom since Pepsi had managed to get behind the “iron curtain.” Ideological divides continued as Coca Cola was marketed in Israel and the Arab world became dominated by Pepsi. Coca Cola represents the historical trend of the past century towards increased globalization, and its history raises reader awareness of global processes of industrialization, mass transportation, mass consumerism, global capitalism, conflict, the Cold war, and ideological battles.

# Water?

Standage concludes the book by posing the question of whether water will be the next drink whose story will need to be told. He cites not only the bottled water habit of the developed world, but the great divide in the world being over access to safe water. He also notes water’s role as the root of many global conflicts.

## Code beats Bureaucracy: Tax Form Automation With Ruby and FDF

The City of Kettering decided to tell me they wanted my Schedule E’s from 2007 to 2012 and to fill out an income tax return for each of these years. We have a rental house there, and had no idea we needed to file a local tax return. I hate manual data entry and wanted to fill out my forms using ruby and pdftk. Yes, this is rube goldberg at its finest, but I work a lot with PDFs and wanted to learn how to do this quickly. I’ve decided that PDF programmatic management is one of those modern skills like typing that I need to master, and I’ve already made an investment in Ruby. (Just learning to use the python script PDFconcat is a great lesson in how a little learning can save a lot of time.)

I started with (random) data in this form, which represents a yearly loss on my rental house. I was able to pull up my schedule E’s since we have been paperless since 2002. I use yep to assign tags for all my files so I could pull them up quickly. Data below is made up, but in the same format as the real data.

```2007|10
2008|12
2009|22
2010|20
2011|107
2012|388
```

And I need to populate that in [this form](wget http://dev.ci.kettering.oh.us/wp-content/uploads/2013/06/TAX_2013-Kettering-Individual-Return-No-Dates.pdf)

```wget http://dev.ci.kettering.oh.us/wp-content/uploads/2013/06/TAX_2013-Kettering-Individual-Return-No-Dates.pdf
```

Here is a log of my attempt (in order to keep me focused on this and do it as fast as possible).

### Start: 14:44 on Sunday PM

Several google queries — identified that I wanted to use pdftk and nguyen, a very lightweight library that fill PDF forms using XFDF/FDF with pdftk.

I had to install an older version of ruby (1.9.3-p448) and then clone the repo:

```rvm install ruby-1.9.3-p448
git clone git@github.com:joneslee85/nguyen.git
```

### 14:54

Wow, the form is done pretty crappily:

```irb(main):002:0> require '../../lib/nguyen'
=> true
irb(main):003:0> p = Nguyen::PdftkWrapper.new 'pdftk'
=> #<Nguyen::PdftkWrapper:0x007fa72d88def8 @pdftk="pdftk", @options={}>
irb(main):005:0> d = Nguyen::Pdf.new('tax.pdf', p)
=> #<Nguyen::Pdf:0x007fa72b126928 @path="tax.pdf", @pdftk=#<Nguyen::PdftkWrapper:0x007fa72d88def8 @pdftk="pdftk", @options={}>>
irb(main):006:0> d.fields
=> ["Occupation", "Occupation_2", "undefined", "undefined_2", "undefined_3", "undefined_4", "undefined_6", "undefined_7", "undefined_8", "undefined_9", "undefined_10", "undefined_11", "undefined_12", "undefined_14", "undefined_15", "undefined_16", "undefined_17", "undefined_18", "undefined_19", "Date", "Date_2", "Date_3", "undefined_21", "undefined_22", "undefined_23", "NAME_2", "ADDRESS", "ADDRESS_2", "undefined_24", "AMOUNTA", "AMOUNTB", "undefined_25", "undefined_26", "undefined_27", "undefined_28", "undefined_29", "undefined_30", "undefined_31", "undefined_32", "undefined_33", "Address", "l100", "l101", "l102", "l103", "l105", "l106", "undefined_5", "t101", "t102", "t103", "t104", "NAME", "t105", "t106", "t107", "t108", "t109", "t110", "t111", "t112", "l200", "l201", "l202", "l203", "t113", "t114", "cb1", "cb2", "cb3", "cb4", "t1", "undefined_13", "l1", "l104", "b1", "b2"]
```

### 15:02

Boom! You can figure out acrobat form names through Forms -> Edit. Looking at this, I now feel good about writing a script because there is so much duplication. Here is a list of the fields I need to fill (dummy data below):

• “TAX YEAR” -> current_year
• cb2 -> true
• t1 -> “Not aware”
• cb3 -> true
• Address -> “123 Main Street, Alexandria, VA 22304”
• l100 -> “123-45-1111”
• Occupation -> “USAF”
• “City of Income” -> “Alexandria, VA”
• l101 -> “245-28-2822”
• Occupation_2 -> “Physical Therapist”
• City of Income_2 -> “Alexandria, VA”
• “Phone Number” -> “571-281-2822”
• “undefined_4” -> amount_of_loss
• undefined_5 -> amount_of_loss
• l102 -> 0
• undefined_10 -> 0
• undefined_11 -> 0
• l103 -> 0
• l106 -> 0
• Date -> Date.now()
• Date_2 -> Date.now()
• NAME -> “Kettering Rental House”
• t106 -> “Kettering, OH 45202”
• l200 -> amount_of_loss
• undefined_24 -> amount_of_loss

### 16:20 frustrated — can’t get ruby syntax to work with here doc

This was just silly. I should have known how to load an array of text . . .

Pretty cool.

## Setting up the Aeon Labs Aeotec Z-Wave Smart Energy Monitor

I struggled for awhile trying to set up the Aeon Labs Aeotec Z-Wave Smart Energy Monitor to monitor my electricity. The manual or any instructions were difficult to find online.

The first article that was absolutely necessary explained how to pair the device. After reading this article, pairing was pretty trivial.

Great details in the developer’s manual

part number: DSB09104-ZWUS

the manufacturer is also marginally useful.

the ‘manual’

the amazon page

Key advise is to wait after installation. I can’t get anything from Watts, but I can read each clamp regularly. While I look into this later, you can still see what is going on:

If you want to get WordPress to accept automatic updates and are running your own server, you want the flexibility to not go through these steps every time an update arrives. You also want good security. I had to do this recently on several sites and thought I would share my notes.

## Defense Acquisition Certification

Here is a post that I hope is helpful to others out there who can be paralyzed from taking action to getting professional acquisition certifications. What is the official name and background of this program? The official name is the Acquisition Professional Development Program. The Acquisition Professional Development Program (APDP) promotes the development and sustainment of a professional acquisition workforce in the Air Force. It is DoD wide. You need it because certain jobs will require you to have it. Good acquisition organizations take this seriously, because it is an easy way to weed folks out from future jobs.

### Where are the best places for information? Here are the links I found useful:

• AF acquisition careers You can find an overview of the program and the useful sites here.
• What are the requirements for each level?Follow the guidelines for your discipline here: dap.dau.mil.
• How do I know what level I am? Go to Acquisition Career Management System but you might need to go to (afpc secure ) first. The purpose is to go to My Civilian APDP Record and
• What is the continuous learning requirement? 80 points over two years.

## Continuous Learning Status My status is “CURRENT”. My last suspense was 2012-07-25 (for what?) POINTS TO DATE: 34 (what does this mean?) SUSPENSE: 2014-07-25 (this requires attention — what does that mean)

### Current plan? I need to take the following: * Log 103 * Sys 101 (just as a pre-req) * Sys 202 * Sys 203 * CLE 003

```Welcome TIMOTHY - here is a summary of your progress toward earning 80 Continuous Learning points (CLPs) every 24 months:

The Personnel System shows that you are in an Acquisition Coded position, and you are required to earn 80 CL points within 24 months.

Currently, your CL suspense date is:     7/25/2014
ACQNow CL points earned this period:     34
Points needed:                                    46

If you do not have any upcoming CL events scheduled, you might consider the following methods of earning points to help you meet the goal:
```

### What/where is a list of different types of certification levels you can get?

• Contracting
• Systems Engineering
• Financial Management
• Program Management
• Information Technology
• Logistics
• Scientific Research and Development
• Test and Evaluation
• Production, Manufacturing & Quality Assurance

### So I need to get certified in Systems Engineering

#### Level 1 (Done)

• Acq 101 (done)
• Sys 101 (done)
• CLE 001
• CLM 017

#### Level 2

• ACQ 201 A/B (done)
• LOG 103 (20 CLP) (working now)
• SYS 202 (9 CLP) (done)
• SYS 203 (36 CLP)
• CLE 003

• 2 Year Experience, BS

#### Level 3

• SYS 302
• CLE 012
• CLE 068
• CLL 008

• 4 year experience