# JOST A MON

Feb 26, 2009

## Goldbach's Comet

As I'm slightly at loose end at the moment, I'm flexing my fingers with a bit of R coding. To those unaware of the wondrous possibilities offered by this programming language, I can only say it's good stuff. Its capabilities have been outlined here and there, most recently in the New York Times, and is the programming environment of choice for statisticians and epidemiologists and others.

I'm intrigued by Goldbach's Comet, which - contrary to your astronomical visions - has less to do with the cosmos than with mathematics. In 1742, Christian Goldbach conjectured in a letter to Leonhard Euler that every even number greater than 2 can be expressed as the sum of two primes. In common with lots of theorems in number theory, this is more easily stated than proved. In fact, the closest anyone has come to proving this assertion is Olivier Ramaré, who in 1995 showed that every even number is the sum of at most 6 primes.

We can introduce the Goldbach function (or partition) at this point. G(n) is defined as the number of ways the number n can be expressed as the sum of two primes. If n is odd, we force G(n) = 0. For even n , e.g., G(10) = 2 because 10 = 3 + 7 = 5 + 5.

If we now plot the even numbers against their Goldbach numbers, we get a graph that looks like a fine spray tapering to a point, as in the figure below. This is Goldbach's Comet.

Now, what's the easiest way to compute G(n)? The brute-force way, of course. A computer doesn't groan and creak at repetitive exercise, so code such as the following will do the job - for small numbers n.
`library(gmp)library(gtools)goldbach <- function(n){  count <- 0  if (even(n)) {    for (i in seq(3, n / 2, by = 2)) {      if ((isprime(i) > 0) & (isprime(n - i) > 0)) {        count <- count + 1      }    }  }  count}`

Note that I'm not really doing any bounds checking here: the function will collapse for n < 6.

Now, the thing about R is that it's a vectorised language, and works rather well on arrays. It even provides a routine called Vectorize() to convert a function such as goldbach() that takes a single argument into one that takes an array. So if you wanted to compute goldbach() for 6, 8, ..., 3000, you don't have to write a loop to do it (although you could); instead, you'd just do the following
`n <- 3000x <- seq(6, n, by = 2)y <- Vectorize(goldbach)(x)`
And then you can plot the comet
`plot(x = x, y = y, type = "p", col = "red", lwd = 2)`
The limitations of the brute-force approach are quite evident even with this toy example. It takes my machine about 10.5 seconds to compute G(n) for even numbers up to 3000. For every argument n, it generates all the odd numbers x less than n / 2, and checks if each of x and n - x is prime. The total number of possible pairs tested for all even numbers up to n is, therefore, about the square of n. The primality test is another computational hog (isprime() does trial divisions up to a certain size of n, and probabilistic tests beyond that). So what quick improvements can we make?

For one thing, we can pre-generate the list of primes until n / 2, and then test if for every prime p in the list, (n - p) is also in the list. We immediately get rid of the multiple calls to isprime() in the original code. Here's the modified routine
`goldbach2 <- function(n){  count <- 0  if (even(n)) {    x <- 1 : n    x <- x[isprime(x) > 0]    p <- x[x <= n / 2]    for (i in p) {      if ((n - i) %in% x) {        count <- count + 1      }      i <- x[i + 1]    }  }  ifelse(n == 6, 1, count)}`
This does slightly better: 9.2 seconds, an improvement of about 12%. The operation that takes the longest is %in%, which searches the array of primes to see if every (n - i) is in it. It's not efficient to call it separately for every candidate in (n - p). It's optimised for vector operations, and here's where R's superb vector handling comes to the fore. We can replace the entire for-loop with just a couple of lines of code, to get
`goldbach3 <- function(n){  count <- 0  if (even(n)) {    x <- 1 : n    x <- x[isprime(x) > 0] # Generate all primes up to n    np <- x[x >= n / 2]    p <- x[x <= n / 2]    count <- sum((n - p) %in% np)  }  ifelse(n == 6, 1, count)}`
This does the same job in 3.3 seconds. Better, eh? Note here that (n - p) creates a vector of differences between the argument n and the list p of primes smaller than n / 2. The %in% operator is vectorised, as I said above, and it zips through the entire vector (n - p) to test for membership in the vector np. Wherever there's a match, R tags a TRUE (a boolean with numeric value 1) and everywhere else a FALSE (numerically 0). Summing the booleans gives us the count of primes that add up to n.

But if R is such a deliciously vectorisable language, why call the function goldbach3() for every even number? We could, instead, generate the entire sequence of Goldbach partitions within the function call itself, and do without using any loops at all, not even the artificial Vectorize() routine that we used on each of the previous tests. We will create the list of primes smaller than n as before. We then create all possible sums of pairs of these primes. The R function outer() enables us to do this rapidly. Of course, outer() creates a matrix, but we are only interested in the upper triangle of it (it is a symmetric matrix, obviously), so we extract that bit by means of the upperTriangle() routine. We can then bash it back into an array and remove all entries that exceed n. Finally, the R function hist() can be used to compute the number of times each entry in this array repeats itself, and that becomes the required series of Goldbach partitions
`goldbach4 <- function(n){    xx <- 1 : n    xx <- xx[isprime(xx) > 0]    xx <- xx[-1]    z <- as.numeric(upperTriangle(outer(xx, xx, "+"),                     diag = TRUE))    z <- z[z <= n]    hist(z, plot = FALSE,          breaks = seq(4, max(z), by = 2))\$counts}`
This takes 0.01 seconds to run! It is an incredible speedup from an admittedly shoddy original code. Of course, I note that I'm now trading space (i.e. memory) for time: I'm creating a large matrix internally within the function. As n increases, I'll sooner or later hit the largest size that R can support. I tried running this bit of code for n = 100000, and it collapsed with the message that R cannot allocate space for a vector of size 390MB. But for n = 70000, the code runs in 7.4 seconds.

I'm sure experienced R users will be able to improve this code still further. Mathematicians active in the area have even faster algorithms for the computation of Goldbach's function. For further details check out the papers below.

Richstein, J. Verifying the Goldbach Conjecture upto 4 . 10 14, Mathematics of Computation, Vol 70, Number 236, pp 1745-1749.

Liang W., et al, Fractals in the statistics of Goldbach partition, ArXiv, March 2006.

Romain Francois said...

Hi,

Well done, but this is not over, I posted a follow up on my blog

Romain

Romain Francois said...

... and a bit further using C code:

> system.time( out <- goldbach6(100000) )
user system elapsed
0.204 0.005 0.228
> system.time( out <- goldbach5(100000) )
user system elapsed
4.839 2.630 7.981
> system.time( out <- goldbach4(100000) )
user system elapsed
28.425 5.932 38.380