A common complaint among engineers is that modern computers obscure the details of the complicated math they produce. We cringe when children are exalted as “knowing computers” from a young age. Bill Gosper and Richard Greenblatt had perhaps the maximum benefit to “learn computers”. They didn’t have wolfram alpha or even spreadsheets, they had to understand arithmetic logic units and figure out how to optimize code to get basic math operations to run on the TRS-80. Today, there is a desire to re-learn hardware through arduino or other low-level hardware, but most folks start very high on the stack. That is a shame, because under the hood is a rich and exciting legacy of numerical computing software that forms the basis of the trust we place in computers.

My trust in computers started to erode when Mark Abramson gave a lecture on the possible errors of floating point arithmetic. All programmers are aware of the problems possible by a division by zero or a narrowing conversion that loses information. However, other times the cause can be the futile attempt of a software developer to round a floating-point number. This was shocking: one of the most basic operations in math, a thing that we learn to do before we can ride a bike, eludes the combined efforts of the finest engineers over the last 30 years. To me, this was something intuitively nonsensical: if computers can’t consistently and reliability round numbers, how can we trust them to do anything?

This turns out to be one of those things like the halting problem that proves it is impossible for a computer to understand computer code. Floating points are represented internally in the IEEE-754 specification. Just like the impossible realization of a random number^{1}, there is no such thing as a true fractional number in a computer. The key thing is to recognize that floating types do not represent numbers exactly. Internally the value is not a continuous range of numbers; instead it is represented as an exponent multiplied by an arithmetical series. For example:

$$

\frac{1}{2}^1 + \frac{1}{2}^2 + \frac{1}{2}^3 + \ldots + \frac{1}{2}^{30}

$$

From the above, you can see that in any range of floating point numbers there are gaps. Between any two floating point numbers there is a difference at the granularity of the smallest element in the arithmetical series (here $\frac{1}{2}^{30}$). If a number falls in such a gap, the difference between the real number is the approximation error. This leads to a fascinating and esoteric subject, that is best covered by others.

In grad school, my faith in computers only decreased when I realized the challenges associated with Eigenvalue problems, Singular value decomposition and LU/Cholesky/QR/… decompositions. If you want to understand how these are handled, a rough understanding of high performance computing libraries is necessary. These libraries are important because of the issues described above. It is very important (and very hard) to have both an efficient and correct algorithm for matrix decomposition.

I spent my teens coding in BASIC, my college years in MATLAB, and professionally I’ve used lots of Julia, FORTRAN, Ruby, Python, C, Haskell, and x86 or MIPS assembly. As I’ve used different languages, I kept coming back to familiar libraries that every language used with strange names like LAPACK, BLAS, FFTW, and ATLAS.

MATLAB will always be the language that I call home. It is probably the only language I was paid at work to write production code with. In building and running large models, I had to get into the internals of MATLAB optimization and learned that MATLAB was basically a wrapper around excellent matrix optimization libraries. As Cleve Moler writes, the basic motivation for MATLAB was to make LINPACK, EISPACK easy to use. Finally, on a New Year’s holiday, I had the chance to pull together a quick summary of the main math libraries I’ve come across: LAPACK, BLAS, ATLAS and FFTW.

# LAPACK – Linear Algebra Package

LAPACK is the most common library for numerical linear algebra. I’ve used its routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It also includes routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition. LAPACK is written in FORTRAN and is powerful and popular since it handles both real and complex matrices in both single and double precision (subject to the caveats above).

LAPACK is the successor of EISPACK and LINPACK, both libraries for linear algebra algorithms, Developed in the early 70s (credits to Jack Dongarra, Jim Bunch, Cleve Moler, Pete Stewart). LINPACK is still used as benchmark for the most powerful supercomputers. The EISPACK and LINPACK software libraries were designed for early supercomputers (think the CDC-7600, Cyber 205, and Cray-1). These machines featured multiple functional units pipelined for increased performance. The CDC-7600 was basically a high-performance scalar computer, while the Cyber 205 and Cray-1 were early vector computers.^{2} One drawback to early “vector-based” architectures is the absence of optimized locality in data access. Consequently, EISPACK and LINPACK were subject to low performance on computers with deep memory hierarchies which became popular in the late 80s.

LAPACK was designed to specifically reimplement the algorithms as “block-based” to incorporate locality by Jim Demmel, Jack Dongarra and many others. (See a great history here.) Many computers have cache memory that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. LAPACK was able to use this knowledge to run efficiently on shared memory and vector supercomputers. More recently, the ScaLAPACK software library extends the use of LAPACK to distributed memory concurrent supercomputers and all routines have been redesigned for distributed memory MIMD (multiple instruction, multiple data) parallel computers.

# BLAS – Basic Linear Algebra Subroutines

LAPACK implemented on top of BLAS, a collection of low-level matrix and vector arithmetic operations. BLAS performs basic operations such as “multiply a vector by a scalar”, “multiply two matrices and add to a third matrix”. As a programmer, this is the type of thing I would definitely either get wrong or implement inefficiently. By contrast, LAPACK is a collection of higher-level linear algebra operations. LAPACK has routines for matrix factorizations (LU, LLt, QR, SVD, Schur, etc) that are used to answer more complex questions such as “find the eigenvalues of a matrix”, or potentially expensive operations like “find the singular values of a matrix”, or “solve a linear system”. LAPACK is built on top of the BLAS; many users of LAPACK only use the LAPACK interfaces and never need to be aware of the BLAS at all. LAPACK is generally compiled separately from the BLAS, and can use whatever highly-optimized BLAS implementation you have available.

In this sense, Basic Linear Algebra Subprograms (BLAS) is more of a specification than a specific software package. It prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. Due to the academic focus in this area, they are the *de facto* standard for low-level routines for linear algebra libraries. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. Modern BLAS implementations take advantage of special floating point hardware such as vector registers or Single instruction, multiple data instructions.

BLAS is often divided into three main areas:

- BLAS1: vector-vector operations (e.g., vector sum)
- BLAS2: matrix-vector operations (e.g., matrix-vector product)
- BLAS3: matrix-matrix operations (mainly matrix-matrix product)

# ATLAS — Automatically Tuned Linear Algebra Software

ATLAS is a modern attempt to make a BLAS implementation with higher performance and more portability. As excellent as BLAS is, it has to be specifically compiled and optimized for different hardware. ATLAS is a portable and reasonably good implementation of the BLAS interfaces, that also implements a few of the most commonly used LAPACK operations. ATLAS defines many BLAS operations in terms of some core routines and then tries to automatically tailor the core routines to have good performance. For example, it performs a search to choose good block sizes that may depend on the computer’s cache size and architecture. It also accounts for differing implementations by providing tests to see if copying arrays and vectors improves performance.^{3}

# FFTW — Fastest Fourier Transform in the West (FFTW)

For an example of a much more specific library, FFTW is a software library for computing discrete Fourier transforms (DFTs).^{4} According to by regular benchmarks, FFTW is known as the fastest free software implementation of the Fast Fourier transform. It can compute transforms of real and complex-valued arrays of arbitrary size and dimension in $O(n \log n)$ time.

The magic of FFTW is to choose the optimal algorithm from a wide array of options. It works best on arrays of sizes with small prime factors, with powers of two being optimal and large primes being worst case (but still $O(n \log n)$). For example, to decompose transforms of composite sizes into smaller transforms, it chooses among several variants of the Cooley–Tukey FFT algorithm, while for prime sizes it uses either Rader’s or Bluestein’s FFT algorithm. Once the transform has been broken up into subtransforms of sufficiently small sizes, FFTW uses hard-coded unrolled FFTs for these small sizes that were produced by code generation. It goes meta.

For more detail from some real experts, check out Numerical Analysis: Historical Developments in the 20th Century by Authors C. Brezinski, L. Wuytack

- It is impossible to get a truly random number from a computer. Even if the sequence never repeats, which is not guaranteed for random numbers, another run of the program with the same inputs will produce the same results. So, someone else can reproduce your random numbers at a later time, which means it wasn’t really random at all. This causes all kinds of problems, but the opposite case would cause many more. ↩
- In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors, compared to scalar processors, whose instructions operate on single data items. ↩
- For example, it may be advantageous to copy arguments so that they are cache-line aligned so user-supplied routines can use SIMD instructions. Today, most commodity CPUs implement architectures that feature instructions for a form of vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). ↩
- Yay! FFTW was developed by Matteo Frigo and Steven G. Johnson at the Massachusetts Institute of Technology. ↩