Autoscoring with Matlab

While I qualified a long time ago on the M9, I never really learned to be a good shot. Now, I’m trying to learn to shoot well and wanted to automatically score my targets, keep the data, and get better.

There are apps to do this, but none of them did what I wanted. One app by Thomas Gabrowski and Justa Mili works off a photo of the target to automatically calculate the score. They also have the capability to analyze shooting groups with Windage, Elevation, Mean Radius and Extreme Spread. They have capabilities to keep track of your previous shooting sessions and monitor progress. The App costs $17.

Developing my own costs my time, but gives me flexibility to work with my targets and my system. It’s also a thing I do: grab challenges that will teach me something. It’s served me well and Matlab makes this all accessible and fun. The other thing is that apps never work exactly right. What if I want the raw data so I can calculate if I’m aiming high or low over time? All this code is on github at

I told my children that two key skills every digital citizen needs are the ability to process text and images. By processing text, I’m able to tell any story from any reports in digital form. This is often bank statements, hotel stays, fitness stuff or uber rides. By processing images, I’m able to understand and report on things that are happening around me.

In looking around, I found this thesis filled with some good ideas. I reached out to the author and discussed the merits of edge detection vs template matching. He didn’t have his code available. There were several papers but none were really that helpful. It was easier to start building than to spend a lot of time reading other’s approaches.

I knew there would be three steps to this: (1) registering all images to the standard, fixed, image for consistent distance, (2) finding the bullet holes/center and (3) measuring the distances from the center each hole.

Image Registration

This was harder than I thought since most registration is for two similar images. I was used to the ease of Photoshop for rapid registration. It turns out it is a hard problem to register images of different pictures of what are really different scenes, even though the structure is common. Most image registration problems are pictures of the same scene that have been taken at different angles or distances. The picture below makes this clear:

Reference and Real Image

I found two approaches that worked for image registration. The first approach was to extract the red circle and then make the circles match. Here I had to calculate and align the centers, and rescale one image to the size of the other. Color thresholding and imfindcircle were quite useful.

For the more general case, I had to use fitgeotrans which takes the pairs of control points, movingPoints and fixedPoints, and uses them to infer the geometric transformation. It does this by taking the pairs of control points, movingPoints and fixedPoints, and uses them to infer the geometric transformation. After doing this I had a set of images that were all the same size, and all in the same orientation — with bullet holes.

Registered Images

Finding the bullet holes

I was able to use this matlab post to learn that I could sample some colors in photoshop, convert the image to HSV and find shades of gray using some code from Theodoros Giannakopoulos.

The next thing I had to do was create the ability to find the center. I did this by recognizing that the center X is red and pretty distinctive — ideal for template matching using normalized cross-correlation matlab has a great description of how this works here. With this accomplished, I can find the center in a few lines, by going off this template:


All together, I’m able to compute the measurements to make a picture like this (note the green circle in the middle on the X):


With the image registered, the center defined and all holes discovered, I could easily calculate a score of a mean distance to the bullseye.


The problem was that I couldn’t get good consistency. The shadows were a problem on some images, on others, shots very close to one another caused confusion. It turned out that I was really good at quickly seeing the holes, better than a template matching problem. Note that when I saved the image, I updated a xls file and saved the scores as EXIF data so the image had the exact locations of the holes that I could pull out later if needed. The code below works awesome and is ideal for my solution. Best of all, I learned a lot about how to manipulate and extract data from images.


So, is my shooting getting better? Not yet. In the plot below you can see my score is increasing, and the stDev of my shots is increasing as well. Now, the data aren’t uniform since I had the target at 5m and now have it at 6.5m on Oct 8. Sept 12 was with a suppressed 22 at 5m. Oct 8 was 9mm. Anyway, it’s better to know from data than to be guessing. I’m chalking this up to an improved technique that is taking some time to adjust to.

Common Mathmatical Libraries: Computation, History, and Trust

A common complaint among engineers is that modern computers obscure the details of the complicated math they produce. We cringe when children are exalted as “knowing computers” from a young age. Bill Gosper and Richard Greenblatt had perhaps the maximum benefit to “learn computers”. They didn’t have wolfram alpha or even spreadsheets, they had to understand arithmetic logic units and figure out how to optimize code to get basic math operations to run on the TRS-80. Today, there is a desire to re-learn hardware through arduino or other low-level hardware, but most folks start very high on the stack. That is a shame, because under the hood is a rich and exciting legacy of numerical computing software that forms the basis of the trust we place in computers.

My trust in computers started to erode when Mark Abramson gave a lecture on the possible errors of floating point arithmetic. All programmers are aware of the problems possible by a division by zero or a narrowing conversion that loses information. However, other times the cause can be the futile attempt of a software developer to round a floating-point number. This was shocking: one of the most basic operations in math, a thing that we learn to do before we can ride a bike, eludes the combined efforts of the finest engineers over the last 30 years. To me, this was something intuitively nonsensical: if computers can’t consistently and reliability round numbers, how can we trust them to do anything?

This turns out to be one of those things like the halting problem that proves it is impossible for a computer to understand computer code. Floating points are represented internally in the IEEE-754 specification. Just like the impossible realization of a random number1, there is no such thing as a true fractional number in a computer. The key thing is to recognize that floating types do not represent numbers exactly. Internally the value is not a continuous range of numbers; instead it is represented as an exponent multiplied by an arithmetical series. For example:

\frac{1}{2}^1 + \frac{1}{2}^2 + \frac{1}{2}^3 + \ldots + \frac{1}{2}^{30}

From the above, you can see that in any range of floating point numbers there are gaps. Between any two floating point numbers there is a difference at the granularity of the smallest element in the arithmetical series (here $\frac{1}{2}^{30}$). If a number falls in such a gap, the difference between the real number is the approximation error. This leads to a fascinating and esoteric subject, that is best covered by others.

In grad school, my faith in computers only decreased when I realized the challenges associated with Eigenvalue problems, Singular value decomposition and LU/Cholesky/QR/… decompositions. If you want to understand how these are handled, a rough understanding of high performance computing libraries is necessary. These libraries are important because of the issues described above. It is very important (and very hard) to have both an efficient and correct algorithm for matrix decomposition.

I spent my teens coding in BASIC, my college years in MATLAB, and professionally I’ve used lots of Julia, FORTRAN, Ruby, Python, C, Haskell, and x86 or MIPS assembly. As I’ve used different languages, I kept coming back to familiar libraries that every language used with strange names like LAPACK, BLAS, FFTW, and ATLAS.

MATLAB will always be the language that I call home. It is probably the only language I was paid at work to write production code with. In building and running large models, I had to get into the internals of MATLAB optimization and learned that MATLAB was basically a wrapper around excellent matrix optimization libraries. As Cleve Moler writes, the basic motivation for MATLAB was to make LINPACK, EISPACK easy to use. Finally, on a New Year’s holiday, I had the chance to pull together a quick summary of the main math libraries I’ve come across: LAPACK, BLAS, ATLAS and FFTW.

LAPACK – Linear Algebra Package

LAPACK is the most common library for numerical linear algebra. I’ve used its routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition. LAPACK is written in FORTRAN and is powerful and popular since it handles both real and complex matrices in both single and double precision (subject to the caveats above).

LAPACK is the successor of EISPACK and LINPACK, both libraries for linear algebra algorithms, Developed in the early 70s (credits to Jack Dongarra, Jim Bunch, Cleve Moler, Pete Stewart). LINPACK is still used as benchmark for the most powerful supercomputers. The EISPACK and LINPACK software libraries were designed for early supercomputers (think the CDC-7600, Cyber 205, and Cray-1). These machines featured multiple functional units pipelined for increased performance. The CDC-7600 was basically a high-performance scalar computer, while the Cyber 205 and Cray-1 were early vector computers.2 One drawback to early “vector-based” architectures is the absence of optimized locality in data access. Consequently, EISPACK and LINPACK were subject to low performance on computers with deep memory hierarchies which became popular in the late 80s.

LAPACK was designed to specifically reimplement the algorithms as “block-based” to incorporate locality by Jim Demmel, Jack Dongarra and many others. (See a great history here.) Many computers have cache memory that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. LAPACK was able to use this knowledge to run efficiently on shared memory and vector supercomputers. More recently, the ScaLAPACK software library extends the use of LAPACK to distributed memory concurrent supercomputers and all routines have been redesigned for distributed memory MIMD (multiple instruction, multiple data) parallel computers.

BLAS – Basic Linear Algebra Subroutines

LAPACK implemented on top of BLAS, a collection of low-level matrix and vector arithmetic operations. BLAS performs basic operations such as “multiply a vector by a scalar”, “multiply two matrices and add to a third matrix”. As a programmer, this is the type of thing I would definitely either get wrong or implement inefficiently. By contrast, LAPACK is a collection of higher-level linear algebra operations. LAPACK has routines for matrix factorizations (LU, LLt, QR, SVD, Schur, etc) that are used to answer more complex questions such as “find the eigenvalues of a matrix”, or potentially expensive operations like “find the singular values of a matrix”, or “solve a linear system”. LAPACK is built on top of the BLAS; many users of LAPACK only use the LAPACK interfaces and never need to be aware of the BLAS at all. LAPACK is generally compiled separately from the BLAS, and can use whatever highly-optimized BLAS implementation you have available.

In this sense, Basic Linear Algebra Subprograms (BLAS) is more of a specification than a specific software package. It prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. Due to the academic focus in this area, they are the de facto standard for low-level routines for linear algebra libraries. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. Modern BLAS implementations take advantage of special floating point hardware such as vector registers or Single instruction, multiple data instructions.

BLAS is often divided into three main areas:

  • BLAS1: vector-vector operations (e.g., vector sum)
  • BLAS2: matrix-vector operations (e.g., matrix-vector product)
  • BLAS3: matrix-matrix operations (mainly matrix-matrix product)

ATLAS — Automatically Tuned Linear Algebra Software

ATLAS is a modern attempt to make a BLAS implementation with higher performance and more portability. As excellent as BLAS is, it has to be specifically compiled and optimized for different hardware. ATLAS is a portable and reasonably good implementation of the BLAS interfaces, that also implements a few of the most commonly used LAPACK operations. ATLAS defines many BLAS operations in terms of some core routines and then tries to automatically tailor the core routines to have good performance. For example, it performs a search to choose good block sizes that may depend on the computer’s cache size and architecture. It also accounts for differing implementations by providing tests to see if copying arrays and vectors improves performance.3

FFTW — Fastest Fourier Transform in the West (FFTW)

For an example of a much more specific library, FFTW is a software library for computing discrete Fourier transforms (DFTs).4 According to by regular benchmarks, FFTW is known as the fastest free software implementation of the Fast Fourier transform. It can compute transforms of real and complex-valued arrays of arbitrary size and dimension in $O(n \log n)$ time.

The magic of FFTW is to choose the optimal algorithm from a wide array of options. It works best on arrays of sizes with small prime factors, with powers of two being optimal and large primes being worst case (but still $O(n \log n)$). For example, to decompose transforms of composite sizes into smaller transforms, it chooses among several variants of the Cooley–Tukey FFT algorithm, while for prime sizes it uses either Rader’s or Bluestein’s FFT algorithm. Once the transform has been broken up into subtransforms of sufficiently small sizes, FFTW uses hard-coded unrolled FFTs for these small sizes that were produced by code generation. It goes meta.

For more detail from some real experts, check out Numerical Analysis: Historical Developments in the 20th Century by Authors C. Brezinski, L. Wuytack

  1. It is impossible to get a truly random number from a computer. Even if the sequence never repeats, which is not guaranteed for random numbers, another run of the program with the same inputs will produce the same results. So, someone else can reproduce your random numbers at a later time, which means it wasn’t really random at all. This causes all kinds of problems, but the opposite case would cause many more. 
  2. In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. 
  3. For example, it may be advantageous to copy arguments so that they are cache-line aligned so user-supplied routines can use SIMD instructions. Today, most commodity CPUs implement architectures that feature instructions for a form of vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). 
  4. Yay! FFTW was developed by Matteo Frigo and Steven G. Johnson at the Massachusetts Institute of Technology. 

The Hierarchical Dirichlet Process Hidden Semi-Markov Model

In my work at DARPA, I’ve been exposed to hidden Markov models in applications as diverse as temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, and bioinformatics. My background is in stochastic modeling and optimization, and hidden Markov models are a fascinating intersection between my background and my more recent work with machine learning. Recently, I’ve come across a new twist on the Markov model: the Hierarchical Dirichlet Process Hidden Markov Model.

What is a Markov model?

Say in DC, we have three types of weather: (1) sunny, (2) rainy and (3) foggy. Lets assume for the moment that the doesnt change from rainy to sunny in the middle of the day. Weather prediction is all about trying to guess what the weather will be like tomorrow based on a history of observations of weather. If we assume the days preceding today will give you a good weather prediction for today we need the probability for each state change:

$$ P(w_n | w_{n-1}, w_{n-2},\ldots, w_1) $$

So, if the last three days were sunny, sunny, foggy, we know that the probability that tomorrow would be rainy is given by:

$$ P(w_4 = \text{rainy}| w_3 = \text{foggy}, w_2 = \text{sunny}, w_1 = \text{sunny}) $$

This all works very well, but the state space grows very quickly. Just based on the above, we would need $3^4$ histories. So fix this we make the Markov Assumption that everything really depends on the previous state alone, or:

$$ P(w_n | w_{n-1}, w_{n-2},\ldots, w_1) \approx P(w_n| w_{n-1}) $$

which allows us to calculate the joint probability of weather in one day given we know the weather of the previous day:

$$ P(w_1, \ldots, w_n) = \prod_{i=1}^n P(w_i| w_{i-1})$$

and now we only have nine numbers to characterize statistically.

What is a hidden Markov model?

In keeping with the example above, suppose you were locked in a room and asked about the weather outside and the only evidence you have is that that ceiling drips or not from the rain outside. We are still in the same world with the same assumptions and the probability of each state is still given by:

$$ P(w_1, \ldots, w_n) = \prod_{i=1}^n P(w_i| w_{i-1})$$

but we have to factor that the actual weather is hidden from you. We can do that using Bayes’ rule where $u_i$ is true if the ceiling drips on day $i$ and false otherwise:

$$P(w_1, \ldots, w_n)| u_1,\ldots,u_n)=\frac{P(u_1,\ldots,u_n | w_1, \ldots, w_n))}{P(u_1,\ldots,u_n)}$$

Here the probability $P(u_1,\ldots,u_n)$ is the prior probability of seeing a particular sequence of ceiling leak events ${True,False,True}$. With this, you can answer questions like:

Suppose the day you were locked in it was sunny. The next day the ceiling leaked. Assuming that the prior probability of the caretaker carrying an umbrella on any day is 0.5, what is the probability that the second day was rainy?

So if Markov models consider states that are directly visible to the observer, the state transition probabilities are the only parameters. By contrast, in hidden Markov models (HMMs) the state is not directly visible, but the output, dependent on the state, is visible. A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (or hidden) states. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. In this context, ‘hidden’ refers to the state sequence through which the model passes, not to the parameters of the model; the model is still referred to as a ‘hidden’ Markov model even if these parameters are known exactly.

OK, so what is a Hierarchical Dirichlet Process Hidden Semi-Markov Model?

Hidden Markov models are generative models where the joint distribution of observations and hidden states, or equivalently both the prior distribution of hidden states (the transition probabilities) and conditional distribution of observations given states (the emission probabilities) are modeled. Instead of implicitly assuming a uniform prior distribution over the transition probabilities, it is also possible to create hidden Markov models with other types of prior distributions. An obvious candidate, given the categorical distribution of the transition probabilities, is the Dirichlet distribution, which is the conjugate prior distribution of the categorical distribution.

In fact, it is possible to use a Dirichlet process in place of a Dirichlet distribution. This type of model allows for an unknown and potentially infinite number of states. It is common to use a two-level Dirichlet process, similar to the previously described model with two levels of Dirichlet distributions. Such a model is called a hierarchical Dirichlet process hidden Markov model, or HDP-HMM for short or it is also called the “Infinite Hidden Markov Model”.

The Hierarchical Dirichlet Process Hidden Markov Model (HDP-HMM) is a natural Bayesian nonparametric extension of the traditional HMM. The single parameter of this distribution (termed the concentration parameter) controls the relative density or sparseness of the resulting transition matrix. By using the theory of Dirichlet processes it is possible to integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence.

This is really cool. If you formulate a HMMs with a countably infinite number of hidden states,
you would have infinitely many parameters in the state transition matrix. The key idea is that the theory of Dirichlet processes can implicitly integrate out all but the three parameters which define the prior over transition dynamics. It is also possible to use a two-level prior Dirichlet distribution, in which one Dirichlet distribution (the upper distribution) governs the parameters of another Dirichlet distribution (the lower distribution), which in turn governs the transition probabilities. The upper distribution governs the overall distribution of states, determining how likely each state is to occur; its concentration parameter determines the density or sparseness of states. Such a two-level prior distribution, where both concentration parameters are set to produce sparse distributions, might be useful for example in unsupervised part-of-speech tagging, where some parts of speech occur much more commonly than others; learning algorithms that assume a uniform prior distribution generally perform poorly on this task. The parameters of models of this sort, with non-uniform prior distributions, can be learned using Gibbs sampling or extended versions of the expectation-maximization algorithm.

So how can we use this?

A common problem in speech recognition is segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. This problem is often called speaker diarization. This is particularly challenging since you don’t know the number of people participating in the meeting and modified HDP-HMMs have been very effective at achieving state-of-the-art speaker diarization results.

Other interesting applications of HDP-HMMs have been modeling otherwise intractable linear dynamical systems which describing dynamical phenomena as diverse as human motion, financial time-series, maneuvering targets, and the dance of honey bees. (See this paper for more details. Results have shown that HDP-HMM can identify periods of higher volatility in the daily returns on the IBOVESPA stock index (Sao Paulo Stock Exchange). Most interesting to me was the application to using HDP-HMMs on a set of six dancing honey bee sequences aiming to segment the sequences into distinct dances.

You can see some other cool motion capture examples here.

One Time Pad Cryptography

This was much harder than it should have been. While this is the certainly the most trivial post on crypto-math on the webs, I wanted to share my MATLAB xor code in the hope that I save someone else’s time. It is a basic property of cryptography that a one time pad must be used only once. A example like this makes it very concrete:

Suppose you are told that the one time pad encryption of the message “attack at dawn” is 09e1c5f70a65ac519458e7e53f36 (the plaintext letters are encoded as 8-bit ASCII and the givenciphertext is written in hex). What would be the one time pad encryption of the message “attack at dusk” under the same OTP key?

Let $m_0$ be the message “attack at dawn”, then $m_1$ is the message “attack at dusk”, and $c_1$, $c_2$ the corresponding ciphertexts. If $p$ is the one-time pad (OTP) that encrypts the message, we then have:

$$ c_0 = m_0 \oplus p$$

So we can obtain the one-time pad by performing an XOR of the ciphertext with the plaintext:

$$ p = c_0 \oplus m_0 \text{.}$$

This enables us to encrypt the new message without using the OTP explicitly:

$$c_1 = m_1 \oplus p = m_1 \oplus \left(c_0 \oplus m_0 \right) = c_0 \oplus (m_0 \oplus m_1) \text{.}$$

You could truncate down to the only characters that are different, but since I’m writing a script for this, I didn’t bother.

In python this would be super short:

def strxor(s1,s2):
return ''.join(chr(ord(a) ^ ord(b)) for a,b in zip(s1,s2))

strxor(strxor("6c73d5240a948c86981bc294814d".decode('hex'), "attack at dawn"), "attack at dusk").encode('hex')


But, the government won’t allow me to have python on my laptop and I need to use matlab.

Some Helpful Links

  • My course slides:
  • Another Solution:
  • Some good general info:

Gradient Descent

It has been awhile since I’ve studied optimization, but gradient descent is always good to brush up on. Most optimization involves derivatives. Often known as the method of steepest descent, gradient descent works by taking steps proportional to the negative of the gradient of the function at the current point.

Mathematically, gradient descent is defined by an algorithm for two variables, say as repeated updates of:

$$ \theta_j := \theta_j – \alpha \frac{\partial J (\theta_0 , \theta_1) }{\partial \theta_j } $$

from the hypothesis:

$$ \begin{aligned}
h_\theta(x) = \sum_{i=1}^n \theta_i x_i = \theta^{T}x.
\end{aligned} $$

The is the learning rate. If is very large, we will take some huge steps downhill, small would mean baby steps downhill. If is too small, it might take too long to get our answer. If alpha is too large, we might step past our solution and never converge. Besides the pitfall of picking an , you have to have a cost function with existing derivatives (continuous) and a convex function.

Often for linear regression, we use batch gradient descent. In machine learning terminology, batch gradient descent uses all training examples. This means that for every step of gradient descent, we compare all residuals in the final least squares calculation.

Why is this relevant to machine learning? Of course there is always an analytical solution to any model, but this might involve a very large matrix inversion and be computationally infeasible.

$$ \theta = (X^T X)^{-1} X^T y $$

Gradient descent gets used because it is a numerical method that generally works and is easy to implement and a is very generic optimization technique. Also, analytical solutions are strongly connected to the model, so implementing them can be inefficient if you plan to generalize/change your models in the future. They are sometimes less efficient then their numerical approximations, and sometimes there are simply harder to implement.

To sum up, gradient descent is preferable over an analytical solution if:

  • you are considering changes in the model, generalizations, adding some more complex terms/regularization/modifications
  • you need a generic method because you do not know much about the future of the code and the model
  • analytical solution is more expensive computationaly, and you need efficiency
  • analytical solution requires more memory or processing power/time than you have or want to use
  • analytical solution is hard to implement and you need easy simple code

FlexEvents: WOD 3 Simulation

During a recent CrossFit competition, flex on the mall, we had to accomplish a team chipper in which we were able to pick the order of team members in order to minimize our workout time. The workout was simple: 50 burpees, 40 over the box jumps, and 30 kettle-bell snatches. Each exercise had to be accomplished in serial: no one could start an exercise until their previous team member had finished. This meant everyone waited while the first person started. As simple as this was, I got confused when figuring the optimal order for team members: should the slowest go first or last?

At the time, I thought the best strategy would be to have the fastest go first, to prevent the scenario where the fast folks were waiting and unable to help the team. I was focused on the idea that you didn’t want anyone to be waiting — so clear out the fast people first. My partners wanted the slowest to go first because the slower participants could rest, go super slow and not affect the final score.

While I was wrong, and they were correct, this didn’t make sense at the time, because I was viewing the whole thing as a linear operation where order didn’t matter, but waiting on someone would definitely slow down the overall time. It turns out that if you put the slowest last, no one is waiting, but the clock is rising totally based upon the slowest time, when you otherwise could have obscured their slow time. The workout came down to the following critical path: the time it took to do everyone’s burpees plus the last participant’s time.

However, this is only true if the time it took to do burpees was significantly more than the other events and there was not a significant difference in fitness between team members. After the competition, I wanted to understand the dynamics of this workout and build a quick model to understand where these assumptions held true and hone my intuition for stuff like this.

It turns out the worst thing to do is have the slowest person go last. The reason why is really simple: you are putting them in the critical path. In fact, an optimal strategy is to have the slowest always go first. If I assume all four members have different speeds, and made an assumption like this based on expected values of their workout times. Let’s assume four members have the following expected completion times (in notional units), where person 1 was the fastest and each successive participant was slower in all events.

Note: These are totally made up numbers and have nothing to do with our team . . . says the man whose wife now routinely kicks his scores to the curb.

person 1 2 3 4
burpees 9 10 15 17
box jumps 7 8 13 15
KB snatch 5 7 11 13


In this case, I wrote some Matlab to look through all (4! = 24) permutations. (Remember: 1 = fastest, 4 is slowest.)

CrossFit Order

This is such a simple event that you can see the basic building blocks without much introspection: 21 is the fastest time to complete the sequence and 45 is the slowest. If the team were all comprised of person 1 they could complete the whole thing in 48. As this fictional team is, burpees alone will account for 51 regardless of order, but if the fastest goes at the end, you can get away with only adding 15 on to the total time, versus adding 28 if the slowest person goes last.

Several things here. It is so simple to work the math of this with variables instead it probably should be done. A better analysis could show you how much variation you could tolerate between athletes before the problem dynamics above don’t apply. Maybe another lunch break for that.

OK, on some more thought, I thought I would look at several scenarios and show optimal strategies.

A team member is much slower on burpees, but faster on the other events.

In this case, it still makes sense for her to go last. The burpees are linear, but speed in the bottom two events determines the order. It helps me to assign a color to each team member and see all 24 strategies on one plot. The lightest shade is member one who has a burpee time of 40 compared to 4 for the others, but is faster in the other two events than the others. In the plot below, the fastest combinations are at the top where you can see that all the fastest outcomes have member one going last.

burpees slower on 1

Burpee times equal KB snatch times

So if their event times look like this where everyone’s KB snatch times equal their burpee times.

burpee times:    1   2   3   4
box jump times:  2   2   2   2
KB snatch times: 1   2   3   4

Then all outcomes are equal and you get a truly random pattern where all outcomes are equal regardless of order. So in this case it doesn’t matter what order you do the workout in, even though 4 is much slower than 1. Interesting and counter-intuitive.


My code is below for any of those interested.

Fun with Bessel Functions

Well, I certainly forget things faster than I learn them. Today is a quick review of Bessel functions and their applications to signal processing.

The Bessel functions appear in lots of situations (think wave propagation and static potentials), particularly those that involve cylindrical symmetry. While special types of what would later be known as Bessel functions were studied by Euler, Lagrange, and the Bernoullis, the Bessel functions were first used by F. W. Bessel to describe three body motion, with the Bessel functions appearing in the series expansion on planetary perturbation

First, I think they should be called Bernoulli-Bessel functions both because that sounds more pompous and because they were discovered by Daniel Bernoulli and generalized by Friedrich Bessel. While they sound (and can be) complicated, they are the canonical solutions of Bessel’s differential equation:

$$ x^2 \frac{d^2 y}{dx^2} + x \frac{dy}{dx} + (x^2 – \alpha^2)y = 0 $$

for an arbitrary complex number (where denotes the order of the Bessel function). The most important cases are for as an integer or half-integer. Since all math ties together, I find it pretty cool that Bessel functions appear in the solution to Laplace’s equation in cylindrical coordinates.

Although and − produce the same differential equation that a real does, it is conventional to define different Bessel functions for these two values in such a way that the Bessel functions are mostly smooth functions of .

Bessel functions of the first kind:

Bessel functions of the first kind, known as , are solutions of Bessel’s differential equation that are finite at the origin () for integer or positive , and diverge as x approaches zero for negative non-integer . It is possible to define the function by its Taylor series expansion around .

$$ J_\alpha(x) = \sum_{m=0}^\infty \frac{(-1)^m}{m! \, \Gamma(m+\alpha+1)} {\left(\frac{x}{2}\right)}^{2m+\alpha} $$

where is the gamma function, a shifted generalization of the factorial function to non-integer values.

Bessel functions of the second kind :

The Bessel functions of the second kind, denoted by are solutions of the Bessel differential equation that have a singularity at the origin.

For non-integer , is related to by:

$$ Y_\alpha(x) = \frac{J_\alpha(x) \cos(\alpha\pi) – J_{-\alpha}(x)}{\sin(\alpha\pi)} $$

In the case of integer order n, the function is defined by taking the limit as a non-integer tends to :

$$ Y_n(x) = \lim_{\alpha \to n} Y_\alpha(x). $$

There is also a corresponding integral formula (for Re(x) > 0),

$$ Y_n(x) =\frac{1}{\pi} \int_0^\pi \sin(x \sin\theta – n\theta) \, d\theta – \frac{1}{\pi} \int_0^\infty \left[ e^{n t} + (-1)^n e^{-n t} \right] e^{-x \sinh t} \, dt.$$

is necessary as the second linearly independent solution of the Bessel’s equation when is an integer. But can be considered as a ‘natural’ partner of .

When is an integer, moreover, as was similarly the case for the functions of the first kind, the following relationship is valid:

$$ Y_{-n}(x) = (-1)^n Y_n(x).\,$$

Both and are holomorphic functions of x on the complex plane cut along the negative real axis. When is an integer, the Bessel functions are entire functions of x. If x is held fixed, then the Bessel functions are entire functions of .

Bessel Filters

In electronics and signal processing, a Bessel filter is a type of linear filter with a maximally flat group delay. The Bessel filter is used because a low pass filter is characterized by transfer function. The denominator of the Bessel filter is a reverse Bessel polynomial. Bessel filters are often used in audio crossover systems. Analog Bessel filters are characterized by almost constant group delay across the entire passband, thus preserving the wave shape of filtered signals in the passband.

A low pass active filter with a Bessel response is used when the filter needs to exhibit minimum differential delay between the various frequency components of interest contained within the input signal being filtered. In essence this means that the fundamental frequency of say an applied squarewave experiences the same input-to-output delay as the other harmonics within the filter’s pass-band. This results in a high degree of fidelity of the output signal relative to the input signal.

Sampling Exploration

I needed to review the Nyquist–Shannon sampling theorem. The coolest thing about Nyquist is that it expresses the sample-rate in terms of the function’s bandwidth and leads to a formula for the mathematically ideal interpolation algorithm.

What is sampling?

Sampling is nothing more than converting a signal into a numeric sequence. Shannon states:

If a function x(t) contains no frequencies higher than B hertz, it is completely determined by giving its ordinates at a series of points spaced 1/(2B) seconds apart.

This is really simple: A sufficient sample-rate is therefore samples/second, or anything larger. The two thresholds, and , are respectively called the Nyquist rate and Nyquist frequency.

I can’t understand something unless I explore with it. So, I defined the following code.

First we have to make the following definitions for a given sample-rate of :

  • $$ \scriptstyle T \stackrel{\mathrm{def}}{=}\ 1/f_s $$ represents the interval between samples.

Say we have a simple sign wave modulated at a frequency of 1/8.

Simple Sin Wave

If we explore, we can get:



When the bandlimit is too high (or there is no bandlimit), the reconstruction exhibits imperfections known as aliasing.

Let be the Fourier transform of bandlimited function :

$$ X(f) \stackrel{\mathrm{def}}{=}\ \int_{-\infty}^{\infty} x(t) \ e^{- i 2 \pi f t} \ {\rm d}t $$


$$ X(f) = 0 \quad \text{for all} |f| > B $$

The Poisson summation formula shows that the samples, x(nT), of function x(t) are sufficient to create a periodic summation of function X(f). The result is:

$$ X_s(f)\ \stackrel{\mathrm{def}}{=} \sum_{k=-\infty}^{\infty} X\left(f – k f_s\right) = \sum_{n=-\infty}^{\infty} \underbrace{T\cdot x(nT)}_{x[n]}\ e^{-i 2\pi n T f}, $$

This function is also known as the discrete-time Fourier transform.

Questions I’m after:

  • How is it that the samples of several different sine waves can be identical, when at least one of them is at a frequency above half the sample rate?

You can see my source code here:

Wind speed calculation

I want to know the impact of wind on endurance.

So we have something like this:

This gives us 8 cases of wind direction:

From this we can infer that going to the center from one of the 8 directions.

So an overall flight path would look like this:

If each grid cell is a distance (s) in size, each time a diagonal transition is made the distance would be (sqrt{2} s). As an aircraft enters each cell, they encounter the wind-star as shown above with the expected value of the wind facing one of the eight possible directions. In general, if (W) is the wind velocity in direction (alpha), from grid direction (g in [1,8]) we have to find the expected value for the wind in direction (cos(alpha_{g,w})), by integrating over all values of (w).

V_{g,w} = sum_{1..12} cos(alpha_{g,w_{i}}) dot P_i

With these bins:

h3. Wind Directions ( w_{i} )
|Wind Direction|Angle|

h3. Grid Direction (g)
|Grid Direction||