## The Middle Way – a.k.

A few years ago we spent some time implementing a number of the sorting, searching and set manipulation algorithms from the standard C++ library in JavaScript. Since the latter doesn't support the former's abstraction of container access via iterators we were compelled to restrict ourselves to using native `Array` objects following the conventions of its methods, such as `slice` and `sort`.
In this post we shall take a look at an algorithm for finding the centrally ranked element, or median, of an array, which is strongly related to the `ak.nthElement` function, and then at a particular use for it.

## Testing Our Students – a.k.

Last time we saw how we can use the chi-squared distribution to test whether a sample of values is consistent with pre-supposed expectations. A couple of months ago we took a look at Student's t-distribution which we can use to test whether a set of observations of a normally distributed random variable are consistent with its having a given mean when its variance is unknown.

## Time For A Chi Test – a.k.

A few months ago we explored the chi-squared distribution which describes the properties of sums of squares of standard normally distributed random variables, being those that have means of zero and standard deviations of one.
Whilst I'm very much of the opinion that statistical distributions are worth describing in their own right, the chi-squared distribution plays a pivotal role in testing whether or not the categories into which a set of observations of some variable quantity fall are consistent with assumptions about the expected numbers in each category, which we shall take a look at in this post.

## A Jolly Student’s Tea Party – a.k.

Last time we took a look at the chi-squared distribution which describes the behaviour of sums of squares of standard normally distributed random variables, having means of zero and standard deviations of one.
Tangentially related is Student's t-distribution which governs the deviation of means of sets of independent observations of a normally distributed random variable from its known true mean, which we shall examine in this post.

## Chi Chi Again – a.k.

Several years ago we saw that, under some relatively easily met assumptions, the averages of independent observations of a random variable tend toward the normal distribution. Derived from that is the chi-squared distribution which describes the behaviour of sums of squares of independent standard normal random variables, having means of zero and standard deviations of one.
In this post we shall see how it is related to the gamma distribution and implement its various functions in terms of those of the latter.

## Adapt Or Try – a.k.

Over the last few months we have been looking at how we might approximate the solutions to ordinary differential equations, or ODEs, which define the derivative of one variable with respect to another with a function of them both. Firstly with the first order Euler method, then with the second order midpoint method and finally with the generalised Runge-Kutta method.
Unfortunately all of these approaches require the step length to be fixed and specified in advance, ignoring any information that we might use to adjust it during the iteration in order to better trade off the efficiency and accuracy of the approximation. In this post we shall try to automatically modify the step lengths to yield an optimal, or at least reasonable, balance.

## A Kutta Above The Rest – a.k.

We have recently been looking at ordinary differential equations, or ODEs, which relate the derivatives of one variable with respect to another to them with a function so that we cannot solve them with plain integration. Whilst there are a number of tricks for solving such equations if they have very specific forms, we typically have to resort to approximation algorithms such as the Euler method, with first order errors, and the midpoint method, with second order errors.
In this post we shall see that these are both examples of a general class of algorithms that can be accurate to still greater orders of magnitude.

## Finding The Middle Ground – a.k.

Last time we saw how we can use Euler's method to approximate the solutions of ordinary differential equations, or ODEs, which define the derivative of one variable with respect to another as a function of them both, so that they cannot be solved by direct integration. Specifically, it uses Taylor's theorem to estimate the change in the first variable that results from a small step in the second, iteratively accumulating the results for steps of a constant length to approximate the value of the former at some particular value of the latter.
Unfortunately it isn't very accurate, yielding an accumulated error proportional to the step length, and so this time we shall take a look at a way to improve it.

## Out Of The Ordinary – a.k.

Several years ago we saw how to use the trapezium rule to approximate integrals. This works by dividing the interval of integration into a set of equally spaced values, evaluating the function being integrated, or integrand, at each of them and calculating the area under the curve formed by connecting adjacent points with straight lines to form trapeziums.
This was an improvement over an even more rudimentary scheme which instead placed rectangles spanning adjacent values with heights equal to the values of the function at their midpoints to approximate the area. Whilst there really wasn't much point in implementing this since it offers no advantage over the trapezium rule, it is a reasonable first approach to approximating the solutions to another type of problem involving calculus; ordinary differential equations, or ODEs.

## Will They Blend? – a.k.

Last time we saw how we can create new random variables from sets of random variables with given probabilities of observation. To make an observation of such a random variable we randomly select one of its components, according to their probabilities, and make an observation of it. Furthermore, their associated probability density functions, or PDFs, cumulative distribution functions, or CDFs, and characteristic functions, or CFs, are simply sums of the component functions weighted by their probabilities of observation.
Now there is nothing about such distributions, known as mixture distributions, that requires that the components are univariate. Given that copulas are simply multivariate distributions with standard uniformly distributed marginals, being the distributions of each element considered independently of the others, we can use the same technique to create new copulas too.

## Mixing It Up – a.k.

Last year we took a look at basis function interpolation which fits a weighted sum of n independent functions, known as basis functions, through observations of an arbitrary function's values at a set of n points in order to approximate it at unobserved points. In particular, we saw that symmetric probability density functions, or PDFs, make reasonable basis functions for approximating both univariate and multivariate functions.
It is quite tempting, therefore, to use weighted sums of PDFs to construct new PDFs and in this post we shall see how we can use a simple probabilistic argument to do so.

## A PR Exercise – a.k.

In the last few posts we've been looking at the BFGS quasi-Newton algorithm for minimising multivariate functions. This uses iteratively updated approximations of the Hessian matrix of second partial derivatives in order to choose directions in which to search for univariate minima, saving the expense of calculating it explicitly. A particularly useful property of the algorithm is that if the line search satisfies the Wolfe conditions then the positive definiteness of the Hessian is preserved, meaning that the implied locally quadratic approximation of the function must have a minimum.
Unfortunately for large numbers of dimension the calculation of the approximation will still be relatively expensive and will require a significant amount of memory to store and so in this post we shall take a look at an algorithm that only uses the vector of first partial derivatives.

## Bring Out The Big Flipping GunS – a.k.

Last month we took a look at quasi-Newton multivariate function minimisation algorithms which use approximations of the Hessian matrix of second partial derivatives to choose line search directions. We demonstrated that the BFGS rule for updating the Hessian after each line search maintains its positive definiteness if they conform to the Wolfe conditions, ensuring that the locally quadratic approximation of the function defined by its value, the vector of first partial derivatives and the Hessian has a minimum.
Now that we've got the theoretical details out of the way it's time to get on with the implementation.

## Big Friendly GiantS – a.k.

In the previous post we saw how we could perform a univariate line search for a point that satisfies the Wolfe conditions meaning that it is reasonably close to a minimum and takes a lot less work to find than the minimum itself. Line searches are used in a class of multivariate minimisation algorithms which iteratively choose directions in which to proceed, in particular those that use approximations of the Hessian matrix of second partial derivatives of a function to do so, similarly to how the Levenberg-Marquardt multivariate inversion algorithm uses a diagonal matrix in place of the sum of the products of its Hessian matrices for each element and the error in that element's current value, and in this post we shall take a look at one of them.

## Wolfe It Down – a.k.

Last time we saw how we could efficiently invert a vector valued multivariate function with the Levenberg-Marquardt algorithm which replaces the sum of its second derivatives with respect to each element in its result multiplied by the difference from those of its target value with a diagonal matrix. Similarly there are minimisation algorithms that use approximations of the Hessian matrix of second partial derivatives to estimate directions in which the value of the function will decrease.
Before we take a look at them, however, we'll need a way to step toward minima in such directions, known as a line search, and in this post we shall see how we might reasonably do so.

## Found In Space – a.k.

Some time ago we saw how Newton's method used the derivative of a univariate scalar valued function to guide the search for an argument at which it took a specific value. A related problem is finding a vector at which a multivariate vector valued function takes one, or at least comes as close as possible to it. In particular, we should often like to fit an arbitrary parametrically defined scalar valued functional form to a set of points with possibly noisy values, much as we did using linear regression to find the best fitting weighted sum of a given set of functions, and in this post we shall see how we can generalise Newton's method to solve such problems.

## Smooth Operator – a.k.

Last time we took a look at linear regression which finds the linear function that minimises the differences between its results and values at a set of points that are presumed, possibly after applying some specified transformation, to be random deviations from a straight line or, in multiple dimensions, a flat plane. The purpose was to reveal the underlying relationship between the independent variable represented by the points and the dependent variable represented by the values at them.
This time we shall see how we can approximate the function that defines the relationship between them without actually revealing what it is.

## Regressive Tendencies – a.k.

Several months ago we saw how we could use basis functions to interpolate between points upon arbitrary curves or surfaces to approximate the values between them. Related to that is linear regression which fits a straight line, or a flat plane, though points that have values that are assumed to be the results of a linear function with independent random errors, having means of zero and equal standard deviations, in order to reveal the underlying relationship between them. Specifically, we want to find the linear function that minimises the differences between its results and the values at those points.

## What’s The Lucky Number? – a.k.

Over the last few months we have been looking at Bernoulli processes which are sequences of Bernoulli trails, being observations of a Bernoulli distributed random variable with a success probability of p. We have seen that the number of failures before the first success follows the geometric distribution and the number of failures before the rth success follows the negative binomial distribution, which are the discrete analogues of the exponential and gamma distributions respectively.
This time we shall take a look at the binomial distribution which governs the number of successes out of n trials and is the discrete version of the Poisson distribution.

## Bad Luck Comes In Ks – a.k.

Lately we have been looking at Bernoulli processes which are sequences of independent experiments, known as Bernoulli trials, whose successes or failures are given by observations of a Bernoulli distributed random variable. Last time we saw that the number of failures before the first success was governed by the geometric distribution which is the discrete analogue of the exponential distribution and, like it, is a memoryless waiting time distribution in the sense that the distribution for the number of failures before the next success is identical no matter how many failures have already occurred whilst we've been waiting.
This time we shall take a look at the distribution of the number of failures before a given number of successes, which is a discrete version of the gamma distribution which defines the probabilities of how long we must wait for multiple exponentially distributed events to occur.

## If At First You Don’t Succeed – a.k.

Last time we took a first look at Bernoulli processes which are formed from a sequence of independent experiments, known as Bernoulli trials, each of which is governed by the Bernoulli distribution with a probability p of success. Since the outcome of one trial has no effect upon the next, such processes are memoryless meaning that the number of trials that we need to perform before getting a success is independent of how many we have already performed whilst waiting for one.
We have already seen that if waiting times for memoryless events with fixed average arrival rates are continuous then they must be exponentially distributed and in this post we shall be looking at the discrete analogue.

## One Thing Or Another – a.k.

Several years ago we took a look at memoryless processes in which the probability that we should wait for any given length of time for an event to occur is independent of how long we have already been waiting. We found that this implied that the waiting time must be exponentially distributed, that the waiting time for several events must be gamma distributed and that the number of events occuring in a unit of time must be Poisson distributed.
These govern continuous memoryless processes in which events can occur at any time but not those in which events can only occur at specified times, such as the roll of a die coming up six, known as Bernoulli processes. Observations of such processes are known as Bernoulli trials and their successes and failures are governed by the Bernoulli distribution, which we shall take a look at in this post.

## Beta Animals – a.k.

Several years ago we took a look at the gamma function Γ, which is a generalisation of the factorial to non-integers, being equal to the factorial of a non-negative integer n when passed an argument of n+1 and smoothly interpolating between them. Like the normal cumulative distribution function Φ, it and its related functions are examples of special functions; so named because they frequently crop up in the solutions to interesting mathematical problems but can't be expressed as simple formulae, forcing us to resort to numerical approximation.
This time we shall take a look at another family of special functions derived from the beta function B.

## Slashing The Odds – a.k.

In the previous post we explored the Cauchy distribution, which, having undefined means and standard deviations, is an example of a pathological distribution. We saw that this is because it has a relatively high probability of generating extremely large values which we concluded was a consequence of its standard random variable being equal to the ratio of two independent standard normally distributed random variables, so that the magnitudes of observations of it can be significantly increased by the not particularly unlikely event that observations of the denominator are close to zero.
Whilst we didn't originally derive the Cauchy distribution in this way, there are others, known as ratio distributions, that are explicitly constructed in this manner and in this post we shall take a look at one of them.

## Moments Of Pathological Behaviour – a.k.

Last time we took a look at basis function interpolation with which we approximate functions from their values at given sets of arguments, known as nodes, using weighted sums of distinct functions, known as basis functions. We began by constructing approximations using polynomials before moving on to using bell shaped curves, such as the normal probability density function, centred at the nodes. The latter are particularly useful for approximating multi-dimensional functions, as we saw by using multivariate normal PDFs.
An easy way to create rotationally symmetric functions, known as radial basis functions, is to apply univariate functions that are symmetric about zero to the distance between the interpolation's argument and their associated nodes. PDFs are a rich source of such functions and, in fact, the second bell shaped curve that we considered is related to that of the Cauchy distribution, which has some rather interesting properties.

## All Your Basis Are Belong To Us – a.k.

A few years ago we saw how we could approximate a function f between pairs of points (xi, f(xi)) and (xi+1, f(xi+1)) by linear and cubic spline interpolation which connect them with straight lines and cubic polynomials respectively, the latter of which yield smooth curves at the cost of somewhat arbitrary choices about their exact shapes.
An alternative approach is to construct a single function that passes through all of the points and, given that nth order polynomials are uniquely defined by n+1 values at distinct xi, it's tempting to use them.

## The Spectral Apparition – a.k.

Over the last few months we have seen how we can efficiently implement the Householder transformations and shifted Givens rotations used by Francis's algorithm to diagonalise a real symmetric matrix M, yielding its eigensystem in a matrix V whose columns are its eigenvectors and a diagonal matrix Λ whose diagonal elements are their associated eigenvalues, which satisfy

M = V × Λ × VT

and together are known as the spectral decomposition of M.
In this post, we shall add it to the `ak` library using the `householder` and `givens` functions that we have put so much effort into optimising.

## Funky Givens – a.k.

We have recently been looking at how we can use a special case of Francis's QR transformation to reduce a real symmetric matrix M to a diagonal matrix Λ by first applying Householder transformations to put it in tridiagonal form and then using shifted Givens rotations to zero out the off diagonal elements.
The columns of the matrix of transformations V and the elements on the leading diagonal of Λ are the unit eigenvectors and eigenvalues of M respectively and they consequently satisfy

M × V = V × Λ

and, since the product of V and its transpose is the identity matrix

M = V × Λ × VT

which is known as the spectral decomposition of M.
Last time we saw how we could efficiently apply the Householder transformations in-place, replacing the elements of M with those of the matrix of accumulated transformations Q and creating a pair of arrays to represent the leading and off diagonal elements of the tridiagonal matrix. This time we shall see how we can similarly improve the implementation of the Givens rotations.

## A Well Managed Household – a.k.

Over the last few months we have seen how we can use a sequence of Householder transformations followed by a sequence of shifted Givens rotations to efficiently find the spectral decomposition of a symmetric real matrix M, formed from a matrix V and a diagonal matrix Λ satisfying

M × V = V × Λ

implying that the columns of V are the unit eigenvectors of M and their associated elements on the diagonal of Λ are their eigenvalues so that

V × VT = I

where I is the identity matrix, and therefore

M = V × Λ × VT

From a mathematical perspective the combination of Householder transformations and shifted Givens rotations is particularly appealing, converging on the spectral decomposition after relatively few matrix multiplications, but from an implementation perspective using `ak.matrix` multiplication operations is less than satisfactory since it wastefully creates new `ak.matrix` objects at each step and so in this post we shall start to see how we can do better.

## Spryer Francis – a.k.

Last time we saw how we could use a sequence of Householder transformations to reduce a symmetric real matrix M to a symmetric tridiagonal matrix, having zeros everywhere other than upon the leading, upper and lower diagonals, which we could then further reduce to a diagonal matrix Λ using a sequence of Givens rotations to iteratively transform the elements upon the upper and lower diagonals to zero so that the columns of the accumulated transformations V were the unit eigenvectors of M and the elements on the leading diagonal of the result were their associated eigenvalues, satisfying

M × V = V × Λ

and, since the transpose of V is its own inverse

M = V × Λ × VT

which is known as the spectral decomposition of M.
Unfortunately, the way that we used Givens rotations to diagonalise tridiagonal symmetric matrices wasn't particularly efficient and I concluded by stating that it could be significantly improved with a relatively minor change. In this post we shall see what it is and why it works.

## FAO The Householder – a.k.

Some years ago we saw how we could use the Jacobi algorithm to find the eigensystem of a real valued symmetric matrix M, which is defined as the set of pairs of non-zero vectors vi and scalars λi that satisfy

M × vi = λi × vi

known as the eigenvectors and the eigenvalues respectively, with the vectors typically restricted to those of unit length in which case we can define its spectral decomposition as the product

M = V × Λ × VT

where the columns of V are the unit eigenvectors, Λ is a diagonal matrix whose ith diagonal element is the eigenvalue associated with the ith column of V and the T superscript denotes the transpose, in which the rows and columns of the matrix are swapped.
You may recall that this is a particularly convenient representation of the matrix since we can use it to generalise any scalar function to it with

f(M) = V × f(Λ) × VT

where f(Λ) is the diagonal matrix whose ith diagonal element is the result of applying f to the ith diagonal element of Λ.
You may also recall that I suggested that there's a more efficient way to find eigensystems and I think that it's high time that we took a look at it.

## A Little Bit Slinky – a.k.

For several months we've for been taking a look at cluster analysis which seeks to partition sets of data into subsets of similar data, known as clusters. Most recently we have focused our attention on hierarchical clusterings, which are sequences of sets of clusters in which pairs of data that belong to the same cluster at one step belong to the same cluster in the next step.
A simple way of constructing them is to initially place each datum in its own cluster and then iteratively merge the closest pairs of clusters in each clustering to produce the next one in the sequence, stopping when all of the data belong to a single cluster. We have considered three ways of measuring the distance between pairs of clusters, the average distance between their members, the distance between their closest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively, and implemented a reasonably efficient algorithm for generating hierarchical clusterings defined with them, using a min-heap structure to cache the distances between clusters.
Finally, I claimed that there is a more efficient algorithm for generating single linkage hierarchical clusterings that would make the sorting of clusters by size in our `ak.clustering` type too expensive and so last time we implemented the `ak.rawClustering` type to represent clusterings without sorting their clusters which we shall now use in the implementation of that algorithm.

## Cut Price Clusterings – a.k.

Last month we saw how we could efficiently generate hierarchical clusterings, which are sequences of sets of clusters, which are themselves subsets of a set of data that each contain elements that are similar to each other, such that if a pair of data are in the same clustering at one step then they must be in the same clustering in the next which will always be the case if we move from one step to the next by merging the closest pairs of clusters. Specifically, we used our `ak.minHeap` implementation of the min-heap structure to cache the distances between clusters, saving us the expense of recalculating them for clusters that don't change from one step in the hierarchy to the next.
Recall that we used three different schemes for calculating the distance between a pair of clusters, the average distance between their members, known as average linkage, the distance between their closest members, known as single linkage, and the distance between their farthest members, known as complete linkage, and that I concluded by noting that our algorithm was about as efficient as possible in general but that there is a much more efficient scheme for single linkage clusterings; efficient enough that sorting the clusters in each clustering by size would be the most costly operation and so in this post we shall implement objects to represent clusterings that don't do that.

## Racing Up The Hierarchy – a.k.

In the previous post we saw how to identify subsets of a set of data that are in some sense similar to each other, known as clusters, by constructing sequences of clusterings starting with each datum in its own cluster and ending with all of the data in the same cluster, subject to the constraint that if a pair of data are in the same cluster in one clustering then they must also be in the same cluster in the next, which are known as hierarchical clusterings.
We did this by selecting the closest pairs of clusters in one clustering and merging them to create the next, using one of three different measures of the distance between a pair of clusters; the average distance between their members, the distance between their nearest members and the distance between their farthest members, known as average linkage, single linkage and complete linkage respectively.
Unfortunately our implementation came in at a rather costly O(n3) operations and so in this post we shall look at how we can improve its performance.

## A Place In The Hierarchy – a.k.

Last time we implemented the clusterings type to store a set of clustering objects in order to represent hierarchical clusterings, which are sequences of clusterings having the property that if a pair of data are in the same cluster in one clustering then they will be in the same cluster in the next, where clusters are subsets of a set of data that are in some sense similar to each other.
We then went on to define the `ak.clade` type to represent hierarchical clusterings as trees, so named because that's what they're called in biology when they are used to show the relationships between species and their common ancestors.
Now that we have those structures in place we're ready to see how to create hierarchical clusterings and so in this post we shall start with a simple, general purpose, but admittedly rather inefficient, way to do so.

## Making The Clade – a.k.

We have so far seen a couple of schemes for identifying clusters in sets of data which are subsets whose members are in some sense similar to each other. Specifically, we have looked at the k means and the shared near neighbours algorithms which implicitly define clusters by the closeness of each datum to the average of each cluster and by their closeness to each other respectively.
Note that both of these algorithms use a heuristic, or rule of thumb, to assign data to clusters, but there's another way to construct clusterings; define a heuristic to measure how close to each other a pair of clusters are and then, starting with each datum in a cluster of its own, progressively merge the closest pairs until we end up with a single cluster containing all of the data. This means that we'll end up with a sequence of clusterings and so before we can look at such algorithms we'll need a structure to represent them.

## Nearest And Dearest – a.k.

Last time we saw how we could use the list of the nearest neighbours of a datum in a set of clustered data to calculate its strengths of association with their clusters. Specifically, we used the k nearest neighbours algorithm to define those strengths as the proportions of its k nearest neighbours that were members of each cluster or with a generalisation of it that assigned weights to the neighbours according to their positions in the list.
This time we shall take a look at a clustering algorithm that uses nearest neighbours to identify clusters, contrasting it with the k means clustering algorithm that we covered about four years ago.

## In The Neighbourhood – a.k.

A little under four years ago we saw how we could use the k means algorithm to divide sets of data into distinct subsets, known as clusters, whose members are in some sense similar to each other. The interesting thing about clustering is that even though we find it easy to spot clusters, at least in two dimensions, it's incredibly difficult to give them a firm mathematical definition and so clustering algorithms invariably define them implicitly as the subsets identified by this algorithm.
The k means algorithm, for example, does so by first picking k different elements of the data as cluster representatives and then places every element in the cluster whose representative is nearest to it. The cluster representatives are then replaced by the means of the elements assign to it and the process is repeated iteratively until the clusters stop changing.
Now I'd like to introduce some more clustering algorithms but there are a few things that we'll need first.

## A Heap Of Stuff – a.k.

A little over a year ago we added a bunch of basic computer science algorithms to the `ak` library, taking inspiration from the C++ standard library. Now, JavaScript isn't just short of algorithms in its standard library, it's also lacking all but a few data structures. In fact, from the programmer's perspective, it only has hash maps which use a function to map strings to non-negative integers that are then used as indices into an array of sets of key-value pairs. Technically speaking, even arrays are hash maps of the string representation of their indices, although in practice the interpreter will use more efficient data structures whenever it can.
As clever as its implementation might be, it's unlikely that it will always figure out exactly what we're trying to do and pick the most appropriate data structure and so it will occasionally be worth explicitly implementing a data structure so that we can be certain of its performance.
The first such data structure that we're going to need is a min-heap which will allow us to efficiently add elements in any order and remove them in ascending order, according to some comparison function.

## Archimedean Crew – a.k.

We have recently seen how we can define dependencies between random variables with Archimedean copulas which calculate the probability that they each fall below given values by applying a generator function φ to the results of their cumulative distribution functions, or CDFs, for those values, and applying its inverse to their sum.
Like all copulas they are effectively the CDFs of vector valued random variables whose elements are uniformly distributed when considered independently. Whilst those Archimedean CDFs were relatively trivial to implement, we found that their probability density functions, or PDFs, were somewhat more difficult and that the random variables themselves required some not at all obvious mathematical manipulation to get right.
Having done all the hard work implementing the `ak.archimedeanCopula`, `ak.archimedeanCopulaDensity` and `ak.archimedeanCopulaRnd` functions we shall now use them to implement some specific families of Archimedean copulas.

## Archimedean Review – a.k.

In the last couple of posts we've been taking a look at Archimedean copulas which define the dependency between the elements of vector values of a multivariate random variable by applying a generator function φ to the values of the cumulative distribution functions, or CDFs, of their distributions when considered independently, known as their marginal distributions, and applying the inverse of the generator to the sum of the results to yield the value of the multivariate CDF.
We have seen that the densities of Archimedean copulas are rather trickier to calculate and that making random observations of them is trickier still. Last time we found an algorithm for the latter, albeit with an implementation that had troubling performance and numerical stability issues, and in this post we shall add an improved version to the `ak` library that addresses those issues.

## Archimedean View – a.k.

Last time we took a look at how we could define copulas to represent the dependency between random variables by summing the results of a generator function φ applied to the results of their cumulative distribution functions, or CDFs, and then applying the inverse of that function φ-1 to that sum.
These are known as Archimedean copulas and are valid whenever φ is strictly decreasing over the interval [0,1], equal to zero when its argument equals one and have nth derivatives that are non-negative over that interval when n is even and non-positive when it is odd, for n up to the number of random variables.
Whilst such copulas are relatively easy to implement we saw that their densities are a rather trickier job, in contrast to Gaussian copulas where the reverse is true. In this post we shall see how to draw random vectors from Archimedean copulas which is also much more difficult than doing so from Gaussian copulas.

## Archimedean Skew – a.k.

About a year and a half ago we saw how we could use Gaussian copulas to define dependencies between the elements of a vector valued multivariate random variable whose elements, when considered in isolation, were governed by arbitrary cumulative distribution functions, known as marginals. Whilst Gaussian copulas are quite flexible, they can't represent every possible dependency between those elements and in this post we shall take a look at some others defined by the Archimedean family of copulas.

## New Directions Of Interpolation – a.k.

We have spent a few months looking at how we might interpolate between sets of points (xi, yi), where the xi are known as nodes and the yi as values, to approximate values of y for values of x between the nodes, either by connecting them with straight lines or with cubic curves.
Last time, in preparation for interpolating between multidimensional vector nodes, we implemented the `ak.grid` type to store ticks on a set of axes and map their intersections to `ak.vector` objects to represent such nodes arranged at the corners of hyperdimensional rectangular cuboids.
With this in place we're ready to take a look at one of the simplest multidimensional interpolation schemes; multilinear interpolation.

## Cuboid Space Division – a.k.

Over the last few months we have been taking a look at algorithms for interpolating over a set of points (xi,yi) in order to approximate values of y between the nodes xi. We began with linear interpolation which connects the points with straight lines and is perhaps the simplest interpolation algorithm. Then we moved on to cubic spline interpolation which yields a smooth curve by specifying gradients at the nodes and fitting cubic polynomials between them that match both their values and their gradients. Next we saw how this could result in curves that change from increasing to decreasing, or vice versa, between the nodes and how we could fix this problem by adjusting those gradients.
I concluded by noting that, even with this improvement, the shape of a cubic spline interpolation is governed by choices that are not uniquely determined by the points themselves and that linear interpolation is consequently a more mathematically appropriate scheme, which is why I chose to generalise it to other arithmetic types for y, like complex numbers or matrices, but not to similarly generalise cubic spline interpolation.

The obvious next question is whether or not we can also generalise the nodes to other arithmetic types; in particular to vectors so that we can interpolate between nodes in more than one dimension.

## We’re Not For Turning – a.k.

We have seen how it is possible to smoothly interpolate between a set of points (xi, yi), with the xi known as nodes and the yi as values, by specifying the gradients gi at the nodes and calculating values between adjacent pairs using the uniquely defined cubic polynomials that match the values and gradients at them.
We have also seen how extrapolating such polynomials beyond the first and last nodes can yield less than satisfactory results, which we fixed by specifying the first and last gradients and then adding new first and last nodes to ensure that the first and last polynomials would represent straight lines.
Now we shall see how cubic spline interpolation can break down rather more dramatically and how we might fix it.

## Cubic Line Division – a.k.

Last time we took a look at how we can use linear interpolation to approximate a function from a set of points on its graph by connecting them with straight lines. As a consequence the result isn't smooth, meaning that its derivative isn't continuous and is undefined at the x values of the points, known as the nodes of the interpolation.
In this post we shall see how we can define a smooth interpolation by connecting the points with curves rather than straight lines.

## Chalk The Lines – a.k.

Given a set of points (xi,yi), a common problem in numerical analysis is trying to estimate values of y for values of x that aren't in the set. The simplest scheme is linear interpolation, which connects points with consecutive values of x with straight lines and then uses them to calculate values of y for values of x that lie between those of their endpoints.
On the face of it implementing this would seem to be a pretty trivial business, but doing so both accurately and efficiently is a surprisingly tricky affair, as we shall see in this post.

## A Measure Of Borel Weight – a.k.

In the last few posts we have implemented a type to represent Borel sets of the real numbers, which are the subsets of them that can be created with countable unions of intervals with closed or open lower and upper bounds. Whilst I would argue that doing so was a worthwhile exercise in its own right, you may be forgiven for wondering what Borel sets are actually for and so in this post I shall try to justify the effort that we have spent on them.

## A Borel Universe – a.k.

Last time we took a look at Borel sets of real numbers, which are subsets of the real numbers that can be represented as unions of countable sets of intervals Ii. We got as far as implementing the `ak.borelInterval` type to represent an interval as a pair of `ak.borelBound` objects holding its lower and upper bounds.
With these in place we're ready to implement a type to represent Borel sets and we shall do exactly that in this post.

## A Decent Borel Code – a.k.

A few posts ago we took a look at how we might implement various operations on sets represented as sorted arrays, such as the union, being the set of every element that is in either of two sets, and the intersection, being the set of every element that is in both of them, which we implemented with `ak.setUnion` and `ak.setIntersection` respectively.
Such arrays are necessarily both finite and discrete and so cannot represent continuous subsets of the real numbers such as intervals, which contain every real number within a given range. Of particular interest are unions of countable sets of intervals Ii, known as Borel sets, and so it's worth adding a type to the `ak` library to represent them.

## The After Strife – a.k.

As well as required arithmetic operations, such as addition, subtraction, multiplication and division, the IEEE 754 floating point standard has a number of recommended functions. For example `finite` determines whether its argument is neither infinite nor NaN and `isnan` determines whether its argument is NaN; behaviours that shouldn't be particularly surprising since they're more or less equivalent to JavaScript's `isFinite` and `isNaN` functions respectively.
One recommended function that JavaScript does not provide, and which I should like to add to the `ak` library, is `nextafter` which returns the first representable floating point number after its first argument in the direction towards its second.

## Let’s Talk About Sets – a.k.

In the last couple of posts we have seen various ways to partially or fully sort data and the kinds of queries that we can run against them once they have been. Such query operations make fully sorted arrays a convenient way to represent sets, or more accurately multisets which treat repeated elements as distinct from each other, and in this post we shall exploit this fact to implement some operations that we might wish to perform upon them.

## I Still Haven’t Found What I’m Looking For – a.k.

Last time we took a look at a selection of sorting operations that we can use to sort arrays, or ranges of elements within them. After defining some useful comparison functions satisfying JavaScript's requirement of returning a negative number when the first argument compares smaller than the second, zero when they compare equal and a positive number otherwise, and a function to map negative integers to indices read from the end of arrays in the same way that `Array.slice` does, we first implemented `ak.partition` which divides elements into two ranges; those elements that satisfy some given condition followed by those elements that don't. We saw how this could be used to implement the quicksort algorithm but instead defined `ak.sort` to sort a range of elements using `Array.sort`, slicing them out beforehand and splicing them back in again afterwards if they didn't represent whole arrays. We did use it, however, to implement `ak.nthElement` which puts a the correctly sorted element in a given position position within a range, putting before it elements that are no greater and after it elements that are no smaller. Finally, we implemented `ak.partialSort` which puts every element in a range up to, but not including, a given position into its correctly sorted place with all of the elements from that position onwards comparing no less than the last correctly sorted element.
This time we shall take a look at some of the ways that we can query data after we have manipulated it with these functions.

## We’re All Sorted From A To Z – a.k.

Something that I miss when programming in JavaScript is the wide variety of array manipulation functions available in my primary language, C++. We have, in fact, already implemented one of them with `ak.shuffle` which randomly rearranges the elements of an array. We shall be needing another one of them in the not too distant future and so I have decided to take a short break from numerical computing to add those of them that I use the most frequently to the `ak` library, starting with a selection of sorting operations.

## Do The Evolution – a.k.

In the last few posts we have taken a look at genetic algorithms, which use simple models of biological evolution to search for global maxima of functions, being those points at which they return their greatest possible values.
These models typically represent the arguments of the function as genes within the binary chromosomes of individuals whose fitnesses are the values of the function for those arguments, exchange genetic information between them with a crossover operator, make small random changes to them with a mutation operator and, most importantly, favour the fitter individuals in the population for reproduction into the next generation with a selection operator.
We used a theoretical analysis of a simple genetic algorithm to suggest improved versions of the crossover operator, as well as proposing more robust schemes for selection and the genetic encoding of the parameters.
In this post we shall use some of them to implement a genetic algorithm for the `ak` library.

## The Best Laid Schemata – a.k.

We have seen how we can exploit a simple model of biological evolution, known as a genetic algorithm, to search for global maxima of functions, being those points at which they return their greatest values.
This model treated the function being optimised as a non-negative measure of the fitness of individuals to survive and reproduce, replacing negative results with zero, and represented their chromosomes with arrays of bits which were mapped onto its arguments by treating subsets of them as integers that were linearly mapped to floating point numbers with given lower and upper bounds. It simulated sexual reproduction by splitting pairs of the chromosomes of randomly chosen individuals at a randomly chosen position and swapping their bits from it to their ends, and mutations by flipping randomly chosen bits from the chromosomes of randomly chosen individuals. Finally, and most crucially, it set the probability that an individual would be copied into the next generation to its fitness as a proportion of the total fitness of the population, ensuring that that total fitness would tend to increase from generation to generation.
I concluded by noting that, whilst the resulting algorithm was reasonably effective, it had some problems that a theoretical analysis would reveal and that is what we shall look into in this post.

## The Best Laid Schemata – a.k.

We have seen how we can exploit a simple model of biological evolution, known as a genetic algorithm, to search for global maxima of functions, being those points at which they return their greatest values.
This model treated the function being optimised as a non-negative measure of the fitness of individuals to survive and reproduce, replacing negative results with zero, and represented their chromosomes with arrays of bits which were mapped onto its arguments by treating subsets of them as integers that were linearly mapped to floating point numbers with given lower and upper bounds. It simulated sexual reproduction by splitting pairs of the chromosomes of randomly chosen individuals at a randomly chosen position and swapping their bits from it to their ends, and mutations by flipping randomly chosen bits from the chromosomes of randomly chosen individuals. Finally, and most crucially, it set the probability that an individual would be copied into the next generation to its fitness as a proportion of the total fitness of the population, ensuring that that total fitness would tend to increase from generation to generation.
I concluded by noting that, whilst the resulting algorithm was reasonably effective, it had some problems that a theoretical analysis would reveal and that is what we shall look into in this post.

## It’s All In The Genes – a.k.

Last time we took a look at the simulated annealing global minimisation algorithm which searches for points at which functions return their least possible values and which drew its inspiration from the metallurgical process of annealing which minimises the energy state of the crystalline structure of metals by first heating and then slowly cooling them.
Now as it happens, physics isn't the only branch of science from which we can draw inspiration for global optimisation algorithms. For example, in biology we have the process of evolution through which the myriad species of life on Earth have become extraordinarily well adapted to their environments. Put very simply this happens because offspring differ slightly from their parents and differences that reduce the chances that they will survive to have offspring of their own are less likely to be passed down through the generations than those that increase those chances.
Noting that extraordinarily well adapted is more or less synonymous with near maximally adapted, it's not unreasonable to suppose that we might exploit a mathematical model of evolution to search for global maxima of functions, being those points at which they return their greatest possible values.

## It’s All In The Genes – a.k.

Last time we took a look at the simulated annealing global minimisation algorithm which searches for points at which functions return their least possible values and which drew its inspiration from the metallurgical process of annealing which minimises the energy state of the crystalline structure of metals by first heating and then slowly cooling them.
Now as it happens, physics isn't the only branch of science from which we can draw inspiration for global optimisation algorithms. For example, in biology we have the process of evolution through which the myriad species of life on Earth have become extraordinarily well adapted to their environments. Put very simply this happens because offspring differ slightly from their parents and differences that reduce the chances that they will survive to have offspring of their own are less likely to be passed down through the generations than those that increase those chances.
Noting that extraordinarily well adapted is more or less synonymous with near maximally adapted, it's not unreasonable to suppose that we might exploit a mathematical model of evolution to search for global maxima of functions, being those points at which they return their greatest possible values.

## Annealing Down – a.k.

A few years ago we saw how we could search for a local minimum of a function, being a point for which it returns a lesser value than any in its immediate vicinity, by taking random steps and rejecting those that lead uphill; an algorithm that we dubbed the blindfolded hill climber. Whilst we have since seen that we could progress towards a minimum much more rapidly by choosing the directions in which to step deterministically, there is a way that we can use random steps to yield better results.

## Annealing Down – a.k.

A few years ago we saw how we could search for a local minimum of a function, being a point for which it returns a lesser value than any in its immediate vicinity, by taking random steps and rejecting those that lead uphill; an algorithm that we dubbed the blindfolded hill climber. Whilst we have since seen that we could progress towards a minimum much more rapidly by choosing the directions in which to step deterministically, there is a way that we can use random steps to yield better results.

## Copulating Normally – a.k.

Last year we took a look at multivariate uniformly distributed random variables, which generalise uniform random variables to multiple dimensions with random vectors whose elements are independently uniformly distributed. We have now seen how we can similarly generalise normally distributed random variables with the added property that the normally distributed elements of their vectors may be dependent upon each other; specifically that they may be correlated.
As it turns out, we can generalise this dependence to arbitrary sets of random variables with a fairly simple observation.

## Copulating Normally – a.k.

Last year we took a look at multivariate uniformly distributed random variables, which generalise uniform random variables to multiple dimensions with random vectors whose elements are independently uniformly distributed. We have now seen how we can similarly generalise normally distributed random variables with the added property that the normally distributed elements of their vectors may be dependent upon each other; specifically that they may be correlated.
As it turns out, we can generalise this dependence to arbitrary sets of random variables with a fairly simple observation.

## The Cumulative Distribution Unction – a.k.

We have previously seen how we can generalise normally distributed random variables to multiple dimensions by defining vectors with elements that are linear functions of independent standard normally distributed random variables, having means of zero and standard deviations of one, with

Z' = L × Z + μ

where L is a constant matrix, Z is a vector whose elements are the independent standard normally distributed random variables and μ is a constant vector.
So far we have derived and implemented the probability density function and the characteristic function of the multivariate normal distribution that governs such random vectors but have yet to do the same for its cumulative distribution function since it's a rather more difficult task and thus requires a dedicated treatment, which we shall have in this post.

## The Cumulative Distribution Unction – a.k.

We have previously seen how we can generalise normally distributed random variables to multiple dimensions by defining vectors with elements that are linear functions of independent standard normally distributed random variables, having means of zero and standard deviations of one, with

Z' = L × Z + μ

where L is a constant matrix, Z is a vector whose elements are the independent standard normally distributed random variables and μ is a constant vector.
So far we have derived and implemented the probability density function and the characteristic function of the multivariate normal distribution that governs such random vectors but have yet to do the same for its cumulative distribution function since it's a rather more difficult task and thus requires a dedicated treatment, which we shall have in this post.

## Multiple Multiply Normal Functions – a.k.

Last time we took a look at how we could define multivariate normally distributed random variables with linear functions of multiple independent standard univariate normal random variables.
Specifically, given a Z whose elements are independent standard univariate normal random variables, a constant vector μ and a constant matrix L

Z' = L × Z + μ

has linearly dependent normally distributed elements, a mean vector of μ and a covariance matrix of

Σ' = L × LT

where LT is the transpose of L in which the rows and columns are switched.
We got as far as deducing the characteristic function and the probability density function of the multivariate normal distribution, leaving its cumulative distribution function and its complement aside until we'd implemented both them and the random variable itself, which we shall do in this post.

## Multiple Multiply Normal Functions – a.k.

Last time we took a look at how we could define multivariate normally distributed random variables with linear functions of multiple independent standard univariate normal random variables.
Specifically, given a Z whose elements are independent standard univariate normal random variables, a constant vector μ and a constant matrix L

Z' = L × Z + μ

has linearly dependent normally distributed elements, a mean vector of μ and a covariance matrix of

Σ' = L × LT

where LT is the transpose of L in which the rows and columns are switched.
We got as far as deducing the characteristic function and the probability density function of the multivariate normal distribution, leaving its cumulative distribution function and its complement aside until we'd implemented both them and the random variable itself, which we shall do in this post.

## Every Which Way Is Normal – a.k.

A few months ago we saw how we could generalise the concept of a random variable to multiple dimensions by generating random vectors rather than numbers. Specifically we took a look at the multivariate uniform distribution which governs random vectors whose elements are independently uniformly distributed.
Whilst it demonstrated that we can find multivariate versions of distribution functions such as the probability density function, the cumulative distribution function and the characteristic function, the uniform distribution is fairly trivial and so, for a more interesting example, this time we shall look at generalising the normal distribution to multiple dimensions.

## Every Which Way Is Normal – a.k.

A few months ago we saw how we could generalise the concept of a random variable to multiple dimensions by generating random vectors rather than numbers. Specifically we took a look at the multivariate uniform distribution which governs random vectors whose elements are independently uniformly distributed.
Whilst it demonstrated that we can find multivariate versions of distribution functions such as the probability density function, the cumulative distribution function and the characteristic function, the uniform distribution is fairly trivial and so, for a more interesting example, this time we shall look at generalising the normal distribution to multiple dimensions.

## Integration, Quasi Mode – a.k.

Last time we saw how it was possible to use uniformly distributed random variables to approximate the integrals of univariate and multivariate functions, being those that take numbers and vectors as arguments respectively. Specifically, since the integral of a univariate function is equal to the net area under its graph within the interval of integration it must equal its average height multiplied by the length of that interval and, by definition, the expected value of that function for a uniformly distributed random variable on that interval is the average height and can be approximated by the average of a large number of samples of it. This is trivially generalised to multivariate functions with multidimensional volumes instead of areas and lengths.
We have also seen how quasi random sequences fill areas more evenly than pseudo random sequences and so you might be asking yourself whether we could do better by using the former rather than the latter to approximate integrals.

Clever you!

## Integration, Quasi Mode – a.k.

Last time we saw how it was possible to use uniformly distributed random variables to approximate the integrals of univariate and multivariate functions, being those that take numbers and vectors as arguments respectively. Specifically, since the integral of a univariate function is equal to the net area under its graph within the interval of integration it must equal its average height multiplied by the length of that interval and, by definition, the expected value of that function for a uniformly distributed random variable on that interval is the average height and can be approximated by the average of a large number of samples of it. This is trivially generalised to multivariate functions with multidimensional volumes instead of areas and lengths.
We have also seen how quasi random sequences fill areas more evenly than pseudo random sequences and so you might be asking yourself whether we could do better by using the former rather than the latter to approximate integrals.

Clever you!