All those wireless Mice! (from Taiwan)

If you are like us, you get too many mice and keyboards lying around. With 5 or six devices, and an equal number of receivers you can just plug stuff in to find what matches. That assumes the peripherals are working and haven’t been lost. Can we interrogate the usb receivers and discover their associated hardware?

First, the findings: I tried four random USB receivers and found out their purpose and matching peripherals (Amazon Basics and Logitech) via windows powershell commands I share below. The interesting thing is that each one was made by a different company from the North part of Taiwan. These are huge companies that I’ve never heard of, each taking parts from the TSMC ecosystem and putting them into devices. The Amazon Basics companies were Pixart Imaging, Inc, MOSART Semiconductor Corporation and Chicony Electronics.

Pixart Imaging, Inc

If you want to follow the tech details below, I’m eager to show how I figured out what these black boxes are for the three different receivers.

All three of these companies were close to each other on the north side of Taiwan.

What are the numbers that identify these receivers?

First, some background. Each device has a ClassGuid, a unique identifier assigned to a device class in the Windows operating system. A device class is a group of devices that have similar characteristics and perform similar functions. The ClassGuid provides a way for Windows to recognize and manage different types of devices connected to the system.

You look into something small like this, the well goes deep. The logitech part had external serial numbers and government spectrum identifiers, the Amazon Basics parts required me to use windows PowerShell to learn about them.

The ClassGuid was originally defined by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE).

Today, the UUID specification is maintained by the Internet Engineering Task Force (IETF) as part of the Request for Comments (RFC) series. The current UUID specification can be found in RFC 4122, titled “A Universally Unique IDentifier (UUID) URN Namespace.”

RFC 4122 defines the structure and generation algorithms for UUIDs, ensuring that the generated identifiers are unique across different systems and over time. It covers various UUID versions that use different methods for generating unique values, such as time-based, random-based, or name-based algorithms.

Each device class in Windows has its own ClassGuid, which is typically a hexadecimal string enclosed in braces (e.g. {4d36e967-e325-11ce-bfc1-08002be10318}). The ClassGuid is used to associate a device driver with the corresponding device class, allowing the operating system to communicate with the device and perform necessary functions.

For example, when you connect a USB device to your computer, Windows uses the ClassGuid to determine which driver to use to communicate with the device. If the device is a Human Interface Device (HID), such as a mouse or keyboard, Windows will use the HID driver associated with the {745a17a0-74d3-11d0-b6fe-00a0c90f57da} ClassGuid. If the device is a USB mass storage device, Windows will use the storage driver associated with the {4d36e967-e325-11ce-bfc1-08002be10318} ClassGuid.

The ClassGuid for USB receivers, or USB dongles, will depend on the type of device.

For example, if you have a USB Wi-Fi adapter or a Bluetooth dongle, their ClassGuids will be either:

  • USB Wi-Fi adapter: {4d36e972-e325-11ce-bfc1-08002be10318} (Network adapters)
  • Bluetooth dongle: {e0cbf06c-cd8b-4647-bb8a-263b43f0f974} (Bluetooth radios)

However, if you have a USB receiver for a wireless mouse or keyboard, its ClassGuid will be that of a Human Interface Device (HID), which is {745a17a0-74d3-11d0-b6fe-00a0c90f57da}.

First Dongle, by Logitech

First, you can look at the device and read a couple numbers (only by taking pictures with my iPhone and zooming in, then using their character recognition).

Looking at the outside I see FCC ID JNZCU0010 refers to a specific device authorized by the Federal Communications Commission (FCC) in the United States. The FCC ID is an alphanumeric code that is assigned to devices that have been tested and comply with the FCC’s requirements for radio frequency interference, safety, and other regulatory standards.

The FCC ID JNZCU0010 belongs to the Logitech Unifying Receiver, which is a wireless USB receiver that connects multiple compatible Logitech devices, such as keyboards and mice, to a single computer using a 2.4 GHz wireless connection. This receiver is particularly useful for reducing the number of USB ports occupied by input devices and for providing a seamless connection experience between compatible Logitech peripherals.

You can find more information about the device by searching the FCC’s equipment authorization database with the FCC ID. The database typically contains documents related to the device’s testing, internal and external photos, and user manuals. You can access the database here: https://www.fcc.gov/oet/ea/fccid

The “IC” code, in this case, “IC: 4418A-CU0010”, refers to an Industry Canada (IC) certification number for the device. Industry Canada is the Canadian government agency responsible for regulating and certifying radio frequency devices, ensuring that they meet the necessary requirements and do not cause harmful interference.

Similar to the FCC ID in the United States, the IC certification number identifies a specific device tested and approved for use in Canada. The IC number “4418A-CU0010” is associated with the same Logitech Unifying Receiver that the FCC ID JNZCU0010 refers to.

So, both the FCC ID JNZCU0010 and the IC number 4418A-CU0010 identify the Logitech Unifying Receiver, confirming that it has been tested and certified for use in both the United States and Canada. This is enough to learn all I need about this receiver. A quick Bing/Google can tell you it’s compatible with: It is compatible with all Logitech Unifying products, which include mice and keyboards that have a Unifying logo displayed on them. You can connect up to 6 compatible keyboards and mice to one computer with a single Unifying receiver.

Active Interrogation for the unmarked receivers

While I’m a linux user, I’ve learned a bit of Windows PowerShell since that is where the hardware is directly connected. Let’s use windows to find out about the black boxes that have no markings that Amazon uses. Hint, it’s going to take us into the crazy Taiwanese CMOS ecosystem.

This command uses the “Get-PnpDevice” cmdlet to get a list of all the Plug and Play (PnP) devices currently connected to your computer. The “-PresentOnly” parameter limits the results to only devices that are currently connected. The “Where-Object” cmdlet is then used to filter the results to only devices with a ClassGuid that begins with ‘{‘, which is the format of all ClassGuids.

The “Select-Object” cmdlet is used to select the “ClassGuid”, “FriendlyName”, “Manufacturer”, “Description”, and “DeviceID” properties of each device. The “-Unique” parameter ensures that each device is only listed once.

When you run this command, you will see a list of all the ClassGuids associated with their respective devices. The output will include the device’s FriendlyName, which is a human-readable name for the device; the Manufacturer, which is the company that produced the device; the Description, which is a brief description of the device’s function; and the DeviceID, which is a unique identifier for the device. This additional information should give you a better understanding of what each connected device is and what its purpose is.

You can paste these commands into powershell and learn a lot:

Get-PnpDevice -PresentOnly | Where-Object {$_.ClassGuid -like '{*}'} | Select-Object ClassGuid, FriendlyName, Manufacturer, Description, DeviceID -Unique

or to find what we are looking for:

Get-PnpDevice -PresentOnly | Where-Object {($_.ClassGuid -eq '{745a17a0-74d3-11d0-b6fe-00a0c90f57da}') -or ($_.ClassGuid -eq '{4d36e96f-e325-11ce-bfc1-08002be10318}' -and $_.Description -like '*USB Receiver*')} | Select-Object ClassGuid, FriendlyName, Manufacturer, Description, DeviceID -Unique

This command queries Win32_USBControllerDevice for USB devices, and then retrieves detailed information on each device. The output will display the PNPDeviceID and the Description for each USB device.

You can then check the output for the relevant device and see if the model number (CU0010) is mentioned in the device description or the PNPDeviceID. If the model number is not explicitly mentioned, you might be able to find a unique identifier for the device that can help you confirm the model number through an online search or by checking the device documentation.

So if we run:

Get-WmiObject Win32_USBControllerDevice -Impersonation Impersonate | Foreach-Object { [Wmi]$_.Dependent } | Select-Object PNPDeviceID, Description

So I plug in a mystery receiver (bad idea) and that produces this output on my machine. Note the VID (vendor id) and PID (part id). You can look up parts based on that.

USB\VID_3938&PID_1059\6&208E681&0&3                USB Composite Device
USB\VID_3938&PID_1059&MI_00\7&E30261B&0&0000       USB Input Device
HID\VID_3938&PID_1059&MI_00\8&3483BAC7&0&0000      HID Keyboard Device
USB\VID_3938&PID_1059&MI_01\7&E30261B&0&0001       USB Input Device
HID\VID_3938&PID_1059&MI_01&COL01\8&C190CFE&0&0000 HID-compliant mouse
HID\VID_3938&PID_1059&MI_01&COL02\8&C190CFE&0&0001 HID-compliant consumer control device
HID\VID_3938&PID_1059&MI_01&COL03\8&C190CFE&0&0002 HID-compliant system controller
HID\VID_3938&PID_1059&MI_01&COL04\8&C190CFE&0&0003 HID-compliant vendor-defined device
HID\VID_3938&PID_1059&MI_01&COL05\8&C190CFE&0&0004 HID-compliant device

The  USB Implementers Forum (USB-IF) assigns vendor ID 3938 to MOSART Semiconductor Corporation, a Taiwan-based company that designs and develops integrated circuits (ICs) such as consumer ICs, PC peripheral ICs, and wireless consumer ICs.

Through some googling, I can find ou this is connected to my amazon basics keyboard:

KeyValue
BrandAmazon Basics
Item model numberKS1-US
Operating SystemWindows 7
Item Weight1.05 pounds
Product Dimensions5.61 x 1.13 x 17.83 inches
Item Dimensions LxWxH5.61 x 1.13 x 17.83 inches
ColorBlack
Batteries2 AAA batteries required. (included)
ManufacturerAmazon Basics
ASINB07WV5WN7B
Country of OriginChina
Date First AvailableNovember 11, 2019
Product Details

It’s strange that most components are made in Taiwan, but Country of Origin is listed as China.

A random dongle

Let’s try the next one, I connect a usb receiver with absolutely zero markings on it and find:

USB\VID_04F2&PID_1825\6&208E681&0&4           USB Input Device
HID\VID_04F2&PID_1825&COL01\7&295CC939&0&0000 HID-compliant mouse
HID\VID_04F2&PID_1825&COL02\7&295CC939&0&0001 HID-compliant vendor-defined device

The new device appears to be a USB Input Device with VID (Vendor ID) “04F2” and PID (Product ID) “1825”. VID “04F2” belongs to Chicony Electronics, a company that manufactures computer peripherals such as keyboards and mice. So where is  群光電子股份有限公司 located?

Just another 50 story building in a far away place.

In this case, the USB Input Device seems to be a multi-function device, as it includes both an HID-compliant mouse and an HID-compliant vendor-defined device. The device might be a mouse with additional features or a keyboard with an integrated touchpad. I had to set it aside and marvel and the world’s CMOS supply chain.

Trying my last device:

USB\VID_093A&PID_2510\5&7993A9C&0&2 - USB Input Device
HID\VID_093A&PID_2510&COL01\6&5B1E42D&1&0000 - HID-compliant mouse
HID\VID_093A&PID_2510&COL02\6&5B1E42D&1&0001 - HID-compliant vendor-defined device
HID\VID_093A&PID_2510&COL03\6&5B1E42D&1&0002 - HID-compliant consumer control device
HID\VID_093A&PID_2510&COL04\6&5B1E42D&1&0003 - HID-compliant system controller

So the vendor ID is Pixart Imaging, Inc and the part number is for an “optical mouse”.

is a Taiwan-based company founded in 1998 that specializes in designing and manufacturing CMOS (Complementary Metal-Oxide-Semiconductor) image sensors and related imaging application products. These components are commonly used in various devices, such as optical mice, digital cameras, webcams, and other consumer electronics.

One of the key products that Pixart Imaging is known for is its optical navigation sensors used in computer mice. These sensors replaced traditional mechanical ball-tracking mechanisms, allowing for more accurate and responsive cursor movement. The company’s optical sensors are widely used by different mouse manufacturers due to their high performance, low power consumption, and cost-effectiveness.

In addition to optical sensors, Pixart Imaging also offers a range of other products, such as capacitive touch controllers, fingerprint identification modules, and gesture recognition solutions. These components are utilized in a variety of applications, including smartphones, tablets, wearables, and IoT devices.

What are each of these entries?

  1. USB\VID_093A&PID_2510\5&7993A9C&0&2 – USB Input Device: This is a generic USB input device that can be used to transmit data between the connected device and the computer. It could represent a variety of peripherals, such as keyboards or game controllers.
  2. HID\VID_093A&PID_2510&COL01\6&5B1E42D&1&0000 – HID-compliant mouse: This entry represents a Human Interface Device (HID) compliant mouse. It follows the HID protocol to communicate with the computer, allowing for plug-and-play functionality and seamless interaction between the user and the computer.
  3. HID\VID_093A&PID_2510&COL02\6&5B1E42D&1&0001 – HID-compliant vendor-defined device: This entry represents a device that adheres to the HID protocol, but its specific function is defined by the vendor. It could be a custom input device or a specialized peripheral designed for a particular purpose.
  4. HID\VID_093A&PID_2510&COL03\6&5B1E42D&1&0002 – HID-compliant consumer control device: This is a Human Interface Device that is specifically designed for consumer electronics control, such as multimedia keyboards, remote controls, or other devices used to control media playback or volume.
  5. HID\VID_093A&PID_2510&COL04\6&5B1E42D&1&0003 – HID-compliant system controller: This entry represents a device that follows the HID protocol and serves as a system controller. It could be a device like a power management controller, system monitoring device, or another type of controller that helps manage various aspects of a computer system.

So, I was able to quickly find out this matched my Amazon Basics Mouse.

Time to catch my flight.

Quick(ish) Price Check on a Car

So, is it a good price?

With my oldest daughter heading off to college soon, we’ve realized that our family car doesn’t need to be as large as it used to be. We’ve had a great relationship with our local CarMax over the years, and we appreciate their no-haggle pricing model. My wife had her eyes set on a particular model: a 2019 Volvo XC90 T6 Momentum. The specific car she found was listed at $35,998, with 47,000 miles on the odometer.

But is the price good or bad? As a hacker/data scientist, I knew could get the data to make an informed decision but doing analysis at home is a great way to learn and use new technologies. The bottom line is that the predicted price would be $40,636 or 11.4% higher than the CarMax asking price. If I compare to the specific trim, the price should be $38,666. So the price is probably fair. Now how did I come up with that number?

Calculations

Armed with Python and an array of web scraping tools, I embarked on a mission to collect data that would help me determine a fair value for our new car. I wrote a series of scripts to extract relevant information, such as price, age, and cost from various websites. This required a significant amount of Python work to convert the HTML data into a format that could be analyzed effectively.

Once I had amassed a good enough dataset (close to 200 cars), I began comparing different statistical techniques to find the most accurate pricing model. In this blog post, I’ll detail my journey through the world of logistic regression and compare it to more modern data science methods, revealing which technique ultimately led us to the fairest car price.

First, I did some basic web searching. According to Edmunds, the average price for a 2019 Volvo XC90 T6 Momentum with similar mileage is between $33,995 and $43,998 and my $35,998 falls within this range.

As for how the Momentum compares to other Volvo options and similar cars, there are a few things to consider. The Momentum is one of four trim levels available for the 2019 XC902. It comes with a number of standard features, including leather upholstery, a panoramic sunroof, and a 9-inch touchscreen infotainment system. Other trim levels offer additional features and options.

The 2019 Volvo XC90 comes in four trim levels: Momentum, R-Design, Inscription, and Excellence. The R-Design offers a sportier look and feel, while the Inscription adds more luxury features. The Excellence is the most luxurious and expensive option, with seating for four instead of seven. The Momentum is the most basic.

In terms of similar cars, some options to consider might include the Audi Q7 or the BMW X5. Both of these SUVs are similarly sized and priced to the XC90.

To get there, I had to do some web scraping, data cleaning, and built a basic logistic regression model, as well as other modern data science methods. To begin my data collection journey, I decided (in 2 seconds) to focus on three primary sources: Google’s search summary, Carvana, and Edmunds.

My first step was to search for Volvo XC90 on each of these websites. I then used the Google Chrome toolbar to inspect the webpage’s HTML structure and identify the <div> element containing the desired data. By clicking through the pages, I was able to copy the relevant HTML and put this in a text file, enclosed within <html> and <body> tags. This format made it easier for me to work with the BeautifulSoup Python library, which allowed me to extract the data I needed and convert it into CSV files.

Since the data from each source varied, I had to run several regular expressions on many fields to further refine the information I collected. This process ensured that the data was clean and consistent, making it suitable for my upcoming analysis.

Finally, I combined all the data from the three sources into a single CSV file. This master dataset provided a solid foundation for my pricing analysis and allowed me to compare various data science techniques in order to determine the most accurate and fair price for the 2019 Volvo XC90 T6 Momentum.

In the following sections, I’ll delve deeper into the data analysis process and discuss the different statistical methods I employed to make our car-buying decision.

First, data from carvana looked like this:

<div class="tk-pane full-width">
    <div class="inventory-type carvana-certified" data-qa="inventory-type">Carvana Certified
    </div>
    <div class="make-model" data-qa="make-model">
        <div class="year-make">2020 Volvo XC90</div>
    </div>
    <div class="trim-mileage" data-qa="trim-mileage"><span>T6 Momentum</span> • <span>36,614
            miles</span></div>
</div>
<div class="tk-pane middle-frame-pane">
    <div class="flex flex-col h-full justify-end" data-qa="pricing">
        <div data-qa="price" class="flex items-end font-bold mb-4 text-2xl">$44,990</div>
    </div>
</div>

In this code snippet, I used the BeautifulSoup library to extract relevant data from the saved HTML file, which contained information on Volvo XC90 listings. The script below searches for specific <div> elements containing the year, make, trim, mileage, and price details. It then cleans up the data by removing unnecessary whitespace and commas before storing it in a dictionary. Finally, the script compiles all the dictionaries into a list and exports the data to a CSV file for further analysis.

I could then repeat this process with Google to get a variety of local sources.

One challenge from the Google results, was that I had a lot of data in the images (they were base64 encoded) so wrote a bash script to clean up the tags using sed (pro tip: learn awk and sed)

When working with the Google search results, I had to take a slightly different approach compared to the strategies used for Carvana and Edmunds. Google results did not have a consistent HTML structure that could be easily parsed to extract the desired information. Instead, I focused on identifying patterns within the text format itself to retrieve the necessary details. By leveraging regular expressions, I was able to pinpoint and extract the specific pieces of information, such as the year, make, trim, mileage, and price, directly from the text. My scrape code is below.

Scraping Edmunds required both approaches of using formatting and structure.

All together, I got 174 records of used Volvo XC90s, I could easily get 10x this since the scripts exist and I could mine craigslist and other places. With the data I have, I can use R to explore the data:

# Load the readxl package
library(readxl)
library(scales)
library(scatterplot3d)

# Read the data from data.xlsx into a data frame
df <- read_excel("data.xlsx")

df$Price<-as.numeric(df$Price)/1000

# Select the columns you want to use
df <- df[, c("Title", "Desc", "Mileage", "Price", "Year", "Source")]

# Plot Year vs. Price with labeled axes and formatted y-axis
plot(df$Year, df$Price, xlab = "Year", ylab = "Price ($ '000)",
     yaxt = "n")  # Don't plot y-axis yet

# Add horizontal grid lines
grid()

# Format y-axis as currency
axis(side = 2, at = pretty(df$Price), labels = dollar(pretty(df$Price)))

abline(lm(Price ~ Year, data = df), col = "red")
Armed with this data, we can assign a logistic regression model.

This code snippet employs the scatterplot3d() function to show a 3D scatter plot that displays the relationship between three variables in the dataset. Additionally, the lm() function is utilized to fit a linear regression model, which helps to identify trends and patterns within the data. To enhance the plot and provide a clearer representation of the fitted model, the plane3d() function is used to add a plane that represents the linear regression model within the 3D scatter plot.

model <- lm(Price ~ Year + Mileage, data = df)

# Plot the data and model
s3d <- scatterplot3d(df$Year, df$Mileage, df$Price,
                     xlab = "Year", ylab = "Mileage", zlab = "Price",
                     color = "blue")
s3d$plane3d(model, draw_polygon = TRUE)

So, we can now predict the price of 2019 Volvo XC90 T6 Momentum with 47K miles, which is $40,636 or 11.4% higher than the CarMax asking price of $35,998.

# Create a new data frame with the values for the independent variables
new_data <- data.frame(Year = 2019, Mileage = 45000)

# Use the model to predict the price of a 2019 car with 45000 miles
predicted_price <- predict(model, new_data)

# Print the predicted price
print(predicted_price)

Other Methods

Ok, so now let’s use “data science”. Besides linear regression, there are several other techniques that I can use to take into account the multiple variables (year, mileage, price) in your dataset. Here are some popular techniques:

Decision Trees: A decision tree is a tree-like model that uses a flowchart-like structure to make decisions based on the input features. It is a popular method for both classification and regression problems, and it can handle both categorical and numerical data.

Random Forest: Random forest is an ensemble learning technique that combines multiple decision trees to make predictions. It can handle both regression and classification problems and can handle missing data and noisy data.

Support Vector Machines (SVM): SVM is a powerful machine learning algorithm that can be used for both classification and regression problems. It works by finding the best hyperplane that separates the data into different classes or groups based on the input features.

Neural Networks: Neural networks are a class of machine learning algorithms that are inspired by the structure and function of the human brain. They are powerful models that can handle both numerical and categorical data and can be used for both regression and classification problems.

Gradient Boosting: Gradient boosting is a technique that combines multiple weak models to create a stronger one. It works by iteratively adding weak models to a strong model, with each model focusing on the errors made by the previous model.

All of these techniques can take multiple variables into account, and each has its strengths and weaknesses. The choice of which technique to use will depend on the specific nature of your problem and your data. It is often a good idea to try several techniques and compare their performance to see which one works best for your data.

I’m going to use random forest and a decision tree model.

Random Forest

# Load the randomForest package
library(randomForest)

# "Title", "Desc", "Mileage", "Price", "Year", "Source"

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
train_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Fit a random forest model
model <- randomForest(Price ~ Year + Mileage, data = train_data, ntree = 500)

# Predict the prices for the test data
predictions <- predict(model, test_data)

# Calculate the mean squared error of the predictions
mse <- mean((test_data$Price - predictions)^2)

# Print the mean squared error
cat("Mean Squared Error:", mse)

The output from the random forest model you provided indicates that the model has a mean squared error (MSE) of 17.14768 and a variance explained of 88.61%. A lower MSE value indicates a better fit of the model to the data, while a higher variance explained value indicates that the model can explain a larger portion of the variation in the target variable.

Overall, an MSE of 17.14768 is reasonably low and suggests that the model has a good fit to the training data. A variance explained of 88.61% suggests that the model is able to explain a large portion of the variation in the target variable, which is also a good sign.

However, the random forest method shows a predicted cost of $37,276.54.

I also tried cross-validation techniques to get a better understanding of the model’s overall performance (MSE 33.890). Changing to a new technique such as a decision tree model, turned MSE into 50.91. Logistic regression works just fine.

Adding the Trim

However, I was worried that I was comparing the Momentum to the higher trim options. So to get the trim, I tried the following prompt in Gpt4 to translate the text to one of the four trims.

don't tell me the steps, just do it and show me the results.
given this list add, a column (via csv) that categorizes each one into only five categories Momentum, R-Design, Inscription, Excellence, or Unknown

That worked perfectly and we can see that we have mostly Momentums.

ExcellenceInscriptionMomentumR-DesignUnknown
Count0688789
Percent0.00%39.53%50.58%4.65%5.23%
Frequency and Count of Cars

And this probably invalidates my analysis as Inscriptions (in blue) do have clearly higher prices:

Plot of Price By Year

We can see the average prices (in thousands). In 2019 Inscriptions cost less than Momentums? That is probably a small n problems since we only have 7 Inscriptions and 16 Momentum’s in our data set for 2019.

YearR-DesignInscriptionMomentum
2014$19.99NANA
2016$30.59$32.59$28.60
2017$32.79$32.97$31.22
2018$37.99$40.69$33.23
2019NA$36.79$39.09
2020NA$47.94$43.16
Average Prices by Trim (in thousand dollars)

So, if we restrict our data set smaller, what would the predicted price of the 2019 Momentum be? Just adding a filter and running our regression code above we have $38,666 which means we still have a good/reasonable price.

Quick Excursion

One last thing I’m interested in: does mileage or age matter more. Let’s build a new model.

# Create Age variable
df$Age <- 2023 - df$Year

# Fit a linear regression model
model <- lm(Price ~ Mileage + Age, data = df)

# Print the coefficients
summary(model)$coef
EstimateStd. Errort valuePr(>|t|)
(Intercept)61.349130.6908488.803722.28E-144
Mileage-0.000222.44E-05-8.838691.18E-15
Age-2.754590.27132-10.15253.15E-19
Impact of Different Variables

Based on the regression results, we can see that both Age and Mileage have a significant effect on Price, as their p-values are very small (<0.05). However, we can also see that Age has a larger absolute t-score (-10.15) than Mileage (-8.84), indicating that Age may have a slightly greater effect on Price than Mileage. Additionally, the estimates show that for every one-year increase in Age, the Price decreases by approximately 2.75 thousand dollars, while for every one-mile increase in Mileage, the Price decreases by approximately 0.0002 thousand dollars (or 20 cents). That is actually pretty interesting.

This isn’t that far off. According to the US government, a car depreciates by an average of $0.17 per mile driven. This is based on a five-year ownership period, during which time a car is expected to be driven approximately 12,000 miles per year, for a total of 60,000 miles.

In terms of depreciation per year, it can vary depending on factors such as make and model of the car, age, and condition. However, a general rule of thumb is that a car can lose anywhere from 15% to 25% of its value in the first year, and then between 5% and 15% per year after that. So on average, a car might depreciate by about 10% per year.

Code

While initially in the original blog post, I moved all the code to the end.

Carvana Scrape Code

Cleaner Code

Google Scrape Code

Edumund’s Scrape Code

Otis on Leadership

In Shawshank redemption Otis “Red” goes before the parole board with promises to be better. He tries to play by the rules and repeatedly gets denied. Finally, he has had enough and lets them have it. He is done with their game, they can keep him there forever.  He speaks with the tired wisdom of an old man desperate to speak sense to his younger self, bereft of the hope that a future is possible. He doesn’t care, and now they approve his release.

As amusing as this story is, I’m convinced it speaks to a deeper truth. The road to excellence in leadership doesn’t end with learning what to care about, but it definitely starts with a decision to follow your convictions over learning to do what others in power want.

When I was at DARPA, a PM’s goal was to get to the tech council and get the funding to make our idea happen. In several startups, I’ve marched a similar path to get funding. Each time, I had to navigate a maze of intermediaries, each wanting to hear specific things before I could get to the decision maker. The temptation was always present to win their approval by making my main focus to learn and deliver what they wanted to hear at the expense of my core vision. Keeping this temptation in check always helped in the end and I started to really know that conviction matters more than compliance.

Ok, that’s DARPA and startups. But most of my time has been in government and big companies. Big companies don’t just have intermediaries, they have systems, processes and whole organizations that test our compliance versus conviction trade. Conversations in every company start to change as everyone pays attention to where the winds are going. Conviction is still there, but it becomes bounded by what the boss or bosses emphasize. This is a natural consequence of what leadership means. We set the emphasis of our team and that includes culture and values. Shouldn’t we want our workforce to adopt what we are projecting?

Not at the expense of core individual convictions. To get a flavor of how this can lead us off the high road, a senior leader at a former company made (the fine topic of) female empowerment a core platform of his leadership and bombarded LinkedIn with his progress in this area at the expense of any other vision or message. I took pitches every day from vendors and vendor pitches started to include slides at the beginning that highlighted their commitment to female empowerment. This is how you end up with technology pitches that didn’t emphasize technology. While it was great that the boss was speaking his conviction, it was sad to watch the ecosystem around the company pander and step outside their prime value.

Tech vendor’s sometimes do this even when they focus on technology. When a topic becomes hot, say blockchain or machine learning, you start to hear lots of references to high concept phrases. Do you do X? Oh, yes we know X very well. Do you integrate with Y? oh yes. In my research work I get to bump into real thought leaders and it’s a completely different story. They question my question: “why would you do X?”. They often disagree with what I’m saying and point out my misunderstanding. I like these conversations. I like these people. They have different incentives, but they get my call back.

The best conversations are not banal agreements. Listen to a couple on their first date as they try to please each other. It’s funny watching them try and agree. It’s also a boring conversation.

Then watch the verbal tennis match of two long time friends disagreeing. “No, that’s not the best, this is . . . You’re crazy, this is . . .” In such disagreements there is life and learning and love. They deeply care about each other, but they aren’t focused on pleasing the other person. They have transitioned to something greater.

There just isn’t room for multiple things at the top of your priorities. If you focus on playing the game and optimizing the system to your advantage, you not only hurt your chances of success, but you risk any gains you make leaving you empty and not really adding up to any real change. On the other hand, if you really bottom out your convictions and decide what you really want to do, you have to take on the system. The system will fight you and may beat you down. The collective goals of that system will differ from yours and people will defend their equities in ways that give you headaches, sleepless nights and may even break you.

However, some people, and in the right culture, the right people, will watch. They will know that you care more about the impact of your principles than personal gain. They may not agree with you, but they will respect you. And when you succeed your success will have meaning and will take root. It may grow as others are inspired by your conviction and the truth of your principles.

One of my heroes, John Boyd, said you have to make a fundamental decision to the question: do you want to do something or do you want to be somebody? The magic of this is you really can have both, but you have to pick the right door. Both doors will lead to frustration. The choice to be somebody will make you an expert in what people want to hear and where the system is going. It will feed your ego with each win and teach you how to navigate a system with the right partnerships, the right things to do and the right things to believe. Each year you will risk becoming less and less of the person you once were, even if the organization rewards you.

The choice to do something driven by your convictions alone may pit you against the world. John Boyd never became a general. He never developed executive presence and took on the assignments that would get his name at the top of the promotion lists. He lived in a small apartment and his Facebook and LinkedIn page would have been boring and unnoticed. (Hint: I’m pretty sure he wouldn’t have one.) He did change the art of war and gave us our F-16, A-10, and built the science of aerial combat. He made a real impact on national security, but his lasting impact was on his acolytes who went on to change the Air Force and the DoD. His impact lives on.

The choice can’t be more clear. Integrity, conviction and meaning over advancement and ego. It’s Cincinnatus returning to his farm. It’s  General Marshall telling president Roosevelt that general Eisenhower was the better man to lead D-day. It’s George Washington refusing to be crowned King. It’s Socrates taking the cup of hemlock. It’s Martin Luther walking down the streets of Wittenberg with a piece of paper in his hand that will change the world.

The greatest parable is Solomon’s test for the two mothers who both claimed parentage of the same child: who cares more about the baby? When you truly care, you can’t lose. Without Solomon’s wisdom, the true mother would have suffered greatly. She wouldn’t have her baby to touch, to teach and to watch grow. But she could take great joy in the child’s life. There would be a chance for the truth to break free and make all things right.

Even if the other mother “won”, she would live a lie. It would be a parenthood devoid of meaning, filled with guilt and deception. Not living the lie is what bringing your full self means. It means being fully aware of the trades you will make, who you are and what you stand for.

This is all simple when described here, who wouldn’t choose the greater good when it’s described this way? But this gets complicated in Monday’s staff meeting or in your strategy review. Who will you be when you aren’t focused on the questions above? Every day you make decisions that together comprise a life. If you come to forks in the road and you haven’t been intentional about your principles and practiced applying them, then your path will be filled with extra obstacles to find meaning and you risk ending in a tale of sound and fury, in the end signifying nothing. That’s a heavy price to pay for the trophies you get from a happy system.

Making PDFs editable

I automate everything and PDFs are so easy to fill in automatically. It’s a digital world and we just live in it.

It’s so frustrating to get a “protected form” that doesn’t allow for appending documents, automated form completion or even filling out fields. It’s silly because all of these forms are printable. Many years ago, I just created an official form from scratch myself because I couldn’t find it online. Without checking a signature, forms can’t be secure. For example, one could easily write a program that takes a picture and recreates the form. Remember, in the digital world something is either secure from math (crypto) or it isn’t. The security in the transaction is provided by authenticating the user’s email. Even better, let’s move past forms and use the authentication in a website or mobile app.

Let’s do the right thing and move past forms caught in the middle ground of insecure and not usable. I think folks lock forms to prevent changes to the document, don’t do that.

Below is the code to fix this. You will need to use linux or windows subsystem for linux. While I can’t think of an evil use case, use this responsibly. (My goal in doing this is always to fill in data into a form.)

Use a little bash to print with ghostscript. Protip: this isn’t “hacking”, this is just printing.

filename=$1
outfile="${filename%.*}_clean.pdf"
echo $outfile
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=$outfile -c .setpdfwrite -f $1

New Jersey School Heat Map

This morning before church, I wrote some code to build a contour plot of schools in New Jersey. The “hot” reddish regions have the best schools and the cooler regions don’t.

Tech Stuff: How it’s made

This wasn’t that easy. Since I had all the code from my previous post, it should have been very straightforward to make another contour map. However, to get this map to close, I had four tech challenges to overcome: the GreatSchools API omitted schools when searching from geo-coördinates, a number of locations didn’t have any schools, and I had to mix python and ruby code in a way that exchanged values.

Fixing GreatSchools Omitted Schools

One of the biggest challenges came from the Great Schools API. While the GS nearby API provides schools within a specified radius of a geographic position, I noticed that schools weren’t showing up within the radius specified. Fortunately, I was familiar with Google’s Directions API, which provides a reverse geocode feature that provides a town name for a given spot. This was far from straightforward since google provides multiple layers of data for each coordinate. On top of that, town names can be locality, postal_town, administrative_area_level_2, administrative_area_level_1. This necessitated the following code from Geoff Boeing:

def parse_city(geocode_data):
    if (not geocode_data is None) and ('address_components' in geocode_data):
        for component in geocode_data['address_components']:
            if 'locality' in component['types']:
                return component['long_name']
            elif 'postal_town' in component['types']:
                return component['long_name']
            elif 'administrative_area_level_2' in component['types']:
                return component['long_name']
            elif 'administrative_area_level_1' in component['types']:
                return component['long_name']
    return None

Fixing locations with no schools

Additionally, many points were not associated with a school. While bodies of water naturally had no school, the Great Schools API failed to report any school for many locations. Since these data didn’t exist, I populated the matrix with NumPy null values or “NaNs”. These showed up as blank regions on the map and wrecked havoc with the contour lines. To fix this, I interpolated in two dimensions using NumPy’s ma.masked_invalid feature followed by SciPy’s interpolate.griddata capability with cubic interpolation. (Wow, Python has a lot of math available on demand.)

Mixed Python and Ruby

The final challenge was connecting to the Great Schools API. I could connect with Python’s default tools, and parse the result with lxml, but that meant building each request. Fortunately, there was a ruby wrapper ready for use, but I had to call it from Python. Python’s subprocess with check_output did the trick, but due to the fickle nature of the API, I had to add a bunch of exception handling.

I’m hoping to make a map like this for northern Virginia soon.

https://gist.github.com/tbbooher/27092a4aa3d6f013e0e07a558c21f5fb

Links

Tax time automation

Happy New Year! Bring in the new year with some code to transform USAA bank statements into a set of transactions for 2016. It is a little tricky because there are some 2015 transactions in the list and USAA puts transactions on two lines.

To use, just save the bank statement pdf as text and run this script. Then you can open the resultant CSV in excel. You will need to have a basic version of ruby installed.

July 4th Home Project: Thermostat Bracket

This post is about how to use design tools to save time and build nice stuff at home using computer controlled machines (CNC). In addition to describing my design process, I’ve also included the helpful references I found along the way.

Our old thermostat was too small for our wall. I could have replaced the drywall, but I needed a starter project to help me understand 3D CNC work with wood. Replacing the drywall would have taken a good bit of time because of the lath behind the old thermostat. The bracket took a long time because I had to learn wood CNC and spent way too long finishing the surface. In the end, this was a great use of a wood CNC machine. It would have been difficult to get the corners right and route out the pocket inside. Additionally, I could prototype and correct parts of my design with the rapid iteration that CNC machines provided.

We have a programmable thermostat with z-wave, the 2gig CT100 Z-Wave Touch Screen Programmable Thermostat. It works perfectly and is easy to control with our Mi Casa Verde VeraLite Home Controller. This gives us the ability to set our temperature from our phones or do nest-like things like learn our patterns and adjust temperature. We can also set up multiple thermostats to regulate temperature throughout the different regions of our house.

In case you are working with the CT100 or the VeraLite, you might find the following links helpful:

Design

I designed the bracket in Fusion 360. I’m still figuring out how to use Fusion, but it is a computer-aided design application for creating 3D digital prototypes including design, visualization and simulation. Fusion 360 is easy to use and it provides the ability to go from design, render, analysis and production in one tool. Most important, it is free for hobbyists.

The design was pretty straightforward. It is a one inch offset with fillets that matched the radius of the CT100. One problem with CNC routing is that I tend to design features that take advantage of the CNC features and this tends to lead to more curves. I just had to get the measurements right. I shouldn’t need to do this, but I used a laser cutter to cut out the frame from a piece of cardboard to check the fit. I’m glad I did, because I hadn’t accounted for some of the curves and the opening was too small. In general, I love using the laser-cutter to prototype designs. The prototype let me see how the final design would look on the wall. This would have been helpful to test different designs. Chrissy and I tend to like 18th-century English and 19th-century neoclassic millwork, but I didn’t put too much thought into this design, partly because I could change it so easily.

Here is the final, dimensioned, design:

Screenshot 2016-07-03 10.16.37

Construction

I found a piece of scrap plywood at TechShop that I cut on the ShopBot buddy.

ShopBot Buddy

To cut the workpiece I used the 1/4″ High Speed Steel Two Flute Downcut. You can see the purchase page here. As this was my first cut, I had to understand the definitions and the different cutter parameters to build the tool in fusion.

For the High Speed Steel Two Flute Downcut I have the parameters are:

  • CED: 1/4
  • CEL: 1
  • SHK: 1/4
  • OAL: 3

Here are some terms that helped me:

CED: CED is abbreviated for cutting edge diameter or the width of the cut the tool should make through the work piece. CED has a tolerance in thousandths of an inch or .xxx decimal places.

CEL: CEL is abbreviated for cutting edge length and is the maximum thickness of the material it can cut. CEL has a tolerance in hundredths of an inch or .xx decimal places.

SHK: SHK is abbreviated for shank diameter and is the nominal size of the shank which should match the collet size of the spindle the tool will be used in. SHK has tolerance in the ten-thousandths of an inch or .xxxx decimal places.

OAL: OAL is abbreviated for overall length and is the total nominal length of the tool from end to end. OAL has a tolerance in hundredths of an inch or .xx decimal places.

HSS: High Speed Steel, typical applications in Non-Abrasive Plastic, Solid Wood & Aluminum where keen edges perform best. High Speed Steel tools perform well in hand routing applications where a tough core is necessary to prevent tool breakage.

Carbide Tipped: Used for a variety of applications in composite woods, hardwoods, abrasive plastics and composites plastics to hard aluminum. Limited by geometry in some applications due to the brazed nature of the tool. Carbide Tipped tools work well in hand routing applications due to the tough HSS core and hard carbide cutting edges.

Solid Carbide: Typically used for widest variety of applications including man-made board, solid wood, abrasive plastics, and some aluminum’s. Solid Carbide does not deflect allowing the tool to be fed at higher feedrates than PCD or straight insert cutters decreasing cycle times. s typically. Solid tools also have major edge keenness advantage thought only possible in HSS until a few years ago.

Chipload: Chipload is simply defined as the thickness of a chip which is formed during the machining of a material. Chipload is critical because if the chip is the proper size, the chip will carry away the heat promoting long tool life. If the chip is too small, the heat is transferred to the cutting tool causing prematurely dulling. Too high of a chipload will cause an unsatisfactory edge finish, or part movement.

The most important reason to understand cutter parameters, is to set the correct feed rates, which is a combination of rpm and cutting speed. In order to get this right, I consulted this reference from ShopBot and read up on end mills in general at makezine. I also was able to incorporate some information from destiny tool that was helpful to verify my settings.

These links also helped:
* hardwood cutting data from Onsrud
* A great video tutorial from ShopBot

After understanding endmills, I had to get everything in the right format for the machine. I found the open sbp reference to be very helpful and the command reference also taught me how to understand the resultant g-code.

I summarized my research below:

Table

Name SB# Onsrud Series Cut Chip Load per leading edge Flutes Feed rate (ips) RPM Max Cut
1/4″ Downcut Carbide End Mill 13507 57-910 1 x D 0.005-0.007 2 3.0-4.2 18,000

You can see the final product here:

20160703_180847

CY2014 Quarter 1 Financial Review

Chrissy and I review our spending on a quarterly basis. Updating every 90 days isn’t too long to correct mistakes and remember purchases, but it also allows for the busy multi-week sprints that life presents us. While we have used every financial management program available, I’ve found the most straightforward and flexible solution is to download historical transactions into Excel where I can assign categories and do the type of analysis you can see below. This works for me because I have complete control. All the other solutions I used (MS Money, Quicken, Mint, GNU Wallet) introduce errors that have required lots of time to fix (or that can’t be fixed), but more importantly they constrain me to their interface and I got used to exporting information into tools that could flexibly answer my questions.

My basic workflow is to download statements from all our bank accounts and credit cards in put them all into one spreadsheet, where I ensure a consistent list of categories. I can do this quickly by filtering and sorting as most of our expenses are cyclical. Once everything is in the right format, I use lots of Excel SUMIF and SUMIFS functions to produce reports.

My purpose of doing a financial review is intended to accomplish the following:

  • Quality check (Are we getting paid the right amounts? Any incorrect expenses?)
  • Spending feedback (Are we overpaying in any categories? Anything we need to reign in?)
  • Tax Production

While my tax production and quality check was very helpful to me, I wanted to share the results of the spend analysis in case my reports might be useful to others.

Spending feedback

In summary, we had a small rise in our overall Grocery and Dining out categories, but the major cost drivers were:

  • Ellie’s 12 cavities were very expensive (no dental insurance)
  • We bought a new espresso machine (major purchase for us)
  • We bought a new car
  • We went crazy on clothes
  • Committed (again) to Army Navy Country Club

spending-by-category

Where are we spending?

This doesn’t have a real effect on our spending, but I thought this was interesting. We don’t have saving/investments in here, this is just “spending”. I treated stuff like insurance, taxes, medical, fees, haircuts, etc as “cost of life” — things I feel we can’t avoid and don’t really have discretion in spending. Some other stuff that might fit this category (power bill) gets lumped into household (as does home maintenance and mortgage). I would love to do some more analysis and compare our spending to this article.

pie

Daily Feedback

The plot below has categories on the Y-axis and days on the bottom. Intensity of color is the spend amount. I used matlab to produce this plot. I like it because the colormap used filters everything in way that comes out like a log scale — and that tells me what is a big deal and what is noise. The interesting dynamic is the frequency/magnitude trade that happens with spending dynamics: medical is in seldom/big chunks while grocery expenses are a constant but smaller expense.

imagesc_daily_spending

You can see that our daily spending has a huge variance: The spending had a standard deviation that was twice our average spending — big purchases had a pronounced effect. I explore four levels of spending: discretionary (dining out), some and limited discretion (haircuts, medical) and non-discretionary (mortgage, tax) at the bottom.

DailySpendingBar

Weekly Feedback

Click on the below to see full size

WeeklySpending

So how much can we control this?

If I break down spending into four categories:

  • Committed — We have to pay it (i.e. Mortgage)
  • Limited Discretion — We can commit extra time to reduce it (i.e. Home and Car Maintenance)
  • Some Discretion — We can make choices to decrease our quality of purchase (i.e. Groceries)
  • Total Discretion — We can do without this if we have to (i.e. Dining Out/New Clothes)

It turns out that a third of our expenses are committed where about a quarter each apply to limited and some discretion. Roughly 20{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} of our expenses are totally discretionary and 70{aaa01f1184b23bc5204459599a780c2efd1a71f819cd2b338cab4b7a2f8e97d4} of our expenses could be changed if we had to. The takeaway for me is to focus on eliminating the stuff we pay for but don’t enjoy (fees) and the things that don’t bring joy/reward for their cost.

Excel Sorting and Grouping

I had two tables downloaded from Amazon:

Items

Order Date  Order ID    Title   Category
1/26/14 102-4214073-2201835     Everyday Paleo Family Cookbook
1/13/14 115-8766132-0234619     Awesome Book A
1/13/14 115-8766132-0234619     Awesome Book B

and

Orders

Order Date  Order ID    Subtotal
1/6/14  102-6956821-1091413 $43.20
1/13/14 115-8766130-0234619 $19.42
1/16/14 109-8688911-2954602 $25.86

I’m building our Q1 2014 taxes and needed rows in the following format:

1/13/14 115-8766132-0234619 $22.43 Awesome Book A, Awesome Book B

In order to do this without using SQL, I did the following. If columns B corresponds to Order Id and C corresponds to the item Title, then I put the following formula in column N3

=+IF(B3=B2,N2 & " | " &C2,C3)

and in column O3 a column which might be named: “last before change?”:

=+IF(B3=B4,"", TRUE)

Then I could easily sort out the unwanted values. Done. Still, I would like to better automate this. Any thoughts appreciated.