In statistics, you frequently need to find the probability of a particular outcome given some distribution. The problem is that this functionality is surprisingly hard to come by in c/c++.
Of course, it is possible to rely on Boost, which contains everything you might need, but also represents a very heavy dependency. Boost comes with probability mass/density functions for all the common distributions. C++11 also comes standard with a host of distributions, but they have no functions for evaluating the probability of an outcome, only functions to generate random outcomes from the distributions. This makes sense because they are located in the random
header, but also doesn’t make sense, since there are many more things you might want to do with a distribution.
So, without relying on Boost, it would be pretty straight forward to just take the formula for the probability mass function (pmf) and implement in, right? Well, let’s take a closer look at the pmf of the Poisson distribution:
We immediately notice that the factorial () is potentially tricky to calculate precisely, but that doesn’t really matter, as we will compress the result down to a single double
to represent the probability anyway. We can just use the built-in gamma function (tgamma
from the cmath
header) to approximate the result since This does, however, not quite solve our problem, because the factorial function has the inherent problem of resulting in very large numbers, which means that a double
overflows when calculating the factorial of values around 170. This is an inherent limitation of the factorial/gamma function and double
s, and there is nothing we can do about it.
Instead, we have to return to the pmf we are trying to solve. We notice that we are not actually concerned with the result of the factorial function as such, but rather the ratio between the exponentially growing and the factorial function. When solving problems involving very large numbers, it is frequently useful to convert them into logarithmic domain (if you can stand the loss in precision, which we can, since we will compress the result into a single probability represented as a double
). The logarithm of the factorial function becomes a sum, which we will have no problems representing with a double, and which we can easily calculate with a call to the purpose-built function lgamma
, which also comes standard in the cmath
header.
By using the gamma function instead of the factorial, and solving the pmf in logarithmic domain, we get the following function:
Which avoids the overflow problems, because only the logarithms of the large numbers are being manipulated. It is also straight forward to implement. It even keeps within the confines of c, no c++ stuff needed! Here is a bare-bone version of the code:
#include <cmath>
double poisson_pmf(const double k, const double lambda) {
return exp(k * log(lambda) - lgamma(k + 1.0) - lambda);
}
The only difference between the c++ version above, and the c version of the code is that the c version includes math.h
instead of cmath
.
As an added benefit of using the gamma function rather than the factorial function, we have also generalized the pmf to reals, rather than merely be valid for integers. A similar approach is also useful for other probability mass/density functions.
comments powered by Disqus