# Sieve of Eratosthenes - Finding Primes Python

Just to clarify, this is not a homework problem :)

I wanted to find primes for a math application I am building & came across Sieve of Eratosthenes approach.

I have written an implementation of it in Python. But it's terribly slow. For say, if I want to find all primes less than 2 million. It takes > 20 mins. (I stopped it at this point). How can I speed this up?

``````def primes_sieve(limit):
limitn = limit+1
primes = range(2, limitn)

for i in primes:
factors = range(i, limitn, i)
for f in factors[1:]:
if f in primes:
primes.remove(f)
return primes

print primes_sieve(2000)
``````

UPDATE: I ended up doing profiling on this code & found that quite a lot of time was spent on removing an element from the list. Quite understandable considering it has to traverse the entire list (worst-case) to find the element & then remove it and then readjust the list (maybe some copy goes on?). Anyway, I chucked out list for dictionary. My new implementation -

``````def primes_sieve1(limit):
limitn = limit+1
primes = dict()
for i in range(2, limitn): primes[i] = True

for i in primes:
factors = range(i,limitn, i)
for f in factors[1:]:
primes[f] = False
return [i for i in primes if primes[i]==True]

print primes_sieve1(2000000)
``````

52

You're not quite implementing the correct algorithm:

In your first example, `primes_sieve` doesn't maintain a list of primality flags to strike/unset (as in the algorithm), but instead resizes a list of integers continuously, which is very expensive: removing an item from a list requires shifting all subsequent items down by one.

In the second example, `primes_sieve1` maintains a dictionary of primality flags, which is a step in the right direction, but it iterates over the dictionary in undefined order, and redundantly strikes out factors of factors (instead of only factors of primes, as in the algorithm). You could fix this by sorting the keys, and skipping non-primes (which already makes it an order of magnitude faster), but it's still much more efficient to just use a list directly.

The correct algorithm (with a list instead of a dictionary) looks something like:

``````def primes_sieve2(limit):
a = [True] * limit                          # Initialize the primality list
a = a = False

for (i, isprime) in enumerate(a):
if isprime:
yield i
for n in range(i*i, limit, i):     # Mark factors non-prime
a[n] = False
``````

(Note that this also includes the algorithmic optimization of starting the non-prime marking at the prime's square (`i*i`) instead of its double.)

Tuesday, June 1, 2021

73
``````void getPrimes(int num, int* count, int** array){
(*count) = (num - 1);
int sieve[num-1], primenums = 0, index, fillnum, multiple;
``````

You're declaring an array of `num-1` elements, for the numbers from 2 through `num`. That's okay.

``````    //Fills the array with the numbers up to the user's ending number, num.
for(index = 0, fillnum = 2; fillnum <= num; index++, fillnum++){
sieve[index] = fillnum;
}
``````

You're filling each slot with the number it associated with, also okay.

``````     /* Starts crossing out non prime numbers starting with 2 because 1 is not a prime.
It then deletes all of those multiples and
moves on to the next number that isnt crossed out, which is a prime. */
``````

You're stopping roughly at the square root, that's good.

``````        {
if (sieve[primenums] != 0){ //Checks if that number is NULL which means it's crossed out

//If it is not crossed out it starts deleting its multiples.
{  //Crossing multiples out and decrements count to move to next number
``````

You're having a problem here. You only stop the loop when `multiple >= num`, but you're writing to the index `multiple + primenums`, and that is often past the end of the array. For example, with `num == 19`, and `primenums == 1` (which crosses out the multiples of 3), the last write is to index `18 + 1`, but the last valid index is `num - 2 = 17`.

The indexing bias of point 1 has been fixed. But

``````                            --(*count);
``````

You're unconditionally decrementing `*count` here, in the earlier code you only decremented it when `sieve[multiple]` was not already crossed out. That is the correct way. I suggest

``````for(multiple = primenums + sieve[primenums]; multiple < num - 1; multiple += sieve[primenums]) {
if (sieve[multiple]) {
sieve[multiple] = 0;
--(*count);
}
}
``````

to have it a bit simpler.

``````                   }
}
}
int k;
for(k=0; k < num; k++)
printf("%d n", sieve[k]);
``````

You're printing the contents of `sieve` regardless of whether it is `0` or not, and you also print out `sieve[num - 1]`, which does not exist. Make it

``````for(k = 0; k < num-1; ++k) {
if (sieve[k]) {
printf("%dn", sieve[k]);
}
}
``````

to print only the primes.

``````            printf("%d n", *count);
array = malloc(sizeof(int) * (num + 1));
assert(array);
(*array) = sieve;
}
``````

Replace the `(*array) = sieve` with

``````int i = 0;
for(k = 0; k < num-1; ++k) {
if (sieve[k]) {
(*array)[i] = sieve[k];
++i;
}
}
``````

to write only the primes to `*array`. Also, you need not allocate `(num + 1)*sizeof(int)`, but only `*count * sizeof(int)` for that.

Friday, July 30, 2021

60

Increment by 2, not 1. That's the minimal optimization you should always use - working with odds only. No need to bother with the evens.

In C++, use `vector<bool>` for the sieve array. It gets automatically bit-packed.

Pre-calculate your core primes with segmented sieve. Then continue to work by big enough segments that fit in your cache, without adding new primes to the core list. For each prime `p` maintain additional `long long int value`: its current multiple (starting from the prime's square, of course). The step value is twice `p` in value, or `p` offset in the odds-packed sieve array, where the `i`-th entry stands for the number `o + 2i`, `o` being the least odd not below the range start. No need to sort by the multiples' values, the upper bound of core primes' use rises monotonically.

sqrt(0xFFFFFFFFFF) = 1048576. PrimePi(1048576)=82025 primes is all you need in your core primes list. That's peanuts.

Integer arithmetics for `long long int`s should work just fine to find the modulo, and so the smallest multiple in range, when you first start (or resume your work).

Sunday, August 15, 2021

14

You are given

1 <= m <= n <= 1000000000, n-m<=100000

these are very small numbers. To sieve a range with an upper bound of `n`, you need the primes to `√n`. Here you know `n <= 10^9`, so `√n < 31623`, so you need at worst the primes to 31621. There are 3401. You can generate them with a standard sieve in a few microseconds.

Then you can simply sieve the small range from `m` to `n` by marking the multiples of the primes you've sieved before, stopping when the prime exceeds `√n`. Some speedup can be gained by eliminating the multiples of some small primes from the sieve, but the logic becomes more complicated (you need to treat sieves with small `m` specially).

``````public int[] chunk(int m, int n) {
if (n < 2) return null;
if (m < 2) m = 2;
if (n < m) throw new IllegalArgumentException("Borked");
int root = (int)Math.sqrt((double)n);
boolean[] sieve = new boolean[n-m+1];
// primes is the global array of primes to 31621 populated earlier
// primeCount is the number of primes stored in primes, i.e. 3401
// We ignore even numbers, but keep them in the sieve to avoid index arithmetic.
// It would be very simple to omit them, though.
for(int i = 1, p = primes; i < primeCount; ++i) {
if ((p = primes[i]) > root) break;
int mult;
if (p*p < m) {
mult = (m-1)/p+1;
if (mult % 2 == 0) ++mult;
mult = p*mult;
} else {
mult = p*p;
}
for(; mult <= n; mult += 2*p) {
sieve[mult-m] = true;
}
}
int count = m == 2 ? 1 : 0;
for(int i = 1 - m%2; i < n-m; i += 2) {
if (!sieve[i]) ++count;
}
int sievedPrimes[] = new int[count];
int pi = 0;
if (m == 2) {
sievedPrimes = 2;
pi = 1;
}
for(int i = 1 - m%2; i < n-m; i += 2) {
if (!sieve[i]) {
sievedPrimes[pi++] = m+i;
}
}
return sievedPrimes;
}
``````

Using a `BitSet` or any other type of packed flag-array would reduce the memory usage and thus may give a significant speed-up due to better cache-locality.

Tuesday, August 17, 2021

54

Since there weren't any takers for the 'Sundaram vs. Eratosthenes' problem, I sat down and analysed it. Result: classic Sundaram's has strictly higher overdraw than an odds-only Eratosthenes; if you apply an obvious, small optimisation then the overdraw is exactly the same - for reasons that shall become obvious. If you fix Sundaram's to avoid overdraw entirely then you get something like Pritchard's Sieve, which is massively more complicated.

The exposition of Sundaram's Sieve in Lucky's Notes is probably the best by far; slightly rewritten to use a hypothetical (i.e. non-standard and not supplied here) type `bitmap_t` it looks somewhat like this. In order to measure overdraw the bitmap type needs an operation corresponding to the `BTS` (bit test and set) CPU instruction, which is available via the _bittestandset() intrinsic with Wintel compilers and with MinGW editions of gcc. The intrinsics are very bad for performance but very convenient for counting overdraw.

Note: for sieving all primes up to N one would call the sieve with max_bit = N/2; if bit i of the resulting bitmap is set then the number (2 * i + 1) is composite. The function has '31' in its name because the index math breaks for bitmaps greater than 2^31; hence this code can only sieve numbers up to 2^32-1 (corresponding to max_bit <= 2^31-1).

``````uint64_t Sundaram31_a (bitmap_t &bm, uint32_t max_bit)
{
assert( max_bit <= UINT32_MAX / 2 );

uint32_t m = max_bit;
uint64_t overdraw = 0;

bm.set_all(0);

for (uint32_t i = 1; i < m / 2; ++i)
{
for (uint32_t j = i; j <= (m - i) / (2 * i + 1); ++j)
{
uint32_t k = i + j + 2 * i * j;

overdraw += bm.bts(k);
}
}

return overdraw;
}
``````

Lucky's bound on `j` is exact but the one on `i` is very loose. Tightening it and losing the `m` alias that I had added to make the code look more like common expositions on the net, we get:

``````uint64_t Sundaram31_b (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;

bm.set_all(0);

for (uint32_t i = 1; i <= i_max; ++i)
{
for (uint32_t j = i; j <= (max_bit - i) / (2 * i + 1); ++j)
{
uint32_t k = i + j + 2 * i * j;

overdraw += bm.bts(k);
}
}

return overdraw;
}
``````

The `assert` was dumped in order to reduce noise but it is actually still valid and necessary. Now it's time for a bit of strength reduction, turning multiplication into iterated addition:

``````uint64_t Sundaram31_c (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;

bm.set_all(0);

for (uint32_t i = 1; i <= i_max; ++i)
{
uint32_t n = 2 * i + 1;
uint32_t k = n * i + i;   // <= max_bit because that's how we computed i_max
uint32_t j_max = (max_bit - i) / n;

for (uint32_t j = i; j <= j_max; ++j, k += n)
{
overdraw += bm.bts(k);
}
}

return overdraw;
}
``````

Transforming the loop condition to use `k` allows us to lose `j`; things should be looking exceedingly familiar by now...

``````uint64_t Sundaram31_d (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = uint32_t(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;

bm.set_all(0);

for (uint32_t i = 1; i <= i_max; ++i)
{
uint32_t n = 2 * i + 1;
uint32_t k = n * i + i;   // <= max_bit because that's how we computed i_max

for ( ; k <= max_bit; k += n)
{
overdraw += bm.bts(k);
}
}

return overdraw;
}
``````

With things looking the way they do, it's time to analyse whether a certain obvious small change is warranted by the math. The proof is left as an exercise for the reader...

``````uint64_t Sundaram31_e (bitmap_t &bm, uint32_t max_bit)
{
uint32_t i_max = unsigned(std::sqrt(double(2 * max_bit + 1)) - 1) / 2;
uint64_t overdraw = 0;

bm.set_all(0);

for (uint32_t i = 1; i <= i_max; ++i)
{
if (bm.bt(i))  continue;

uint32_t n = 2 * i + 1;
uint32_t k = n * i + i;   // <= m because we computed i_max to get this bound

for ( ; k <= max_bit; k += n)
{
overdraw += bm.bts(k);
}
}

return overdraw;
}
``````

The only thing that still differs from classic odds-only Eratosthenes (apart from the name) is the initial value for `k`, which is normally `(n * n) / 2` for the old Greek. However, substituting `2 * i + 1` for `n` the difference turns out to be 1/2, which rounds to 0. Hence, Sundaram's is odds-only Eratosthenes without the 'optimisation' of skipping composites to avoid at least some crossings-out of already crossed-out numbers. The value for `i_max` is the same as the Greek's `max_factor_bit`, just arrived at using completely different logical steps and computed using a marginally different formula.

P.S.: after seeing `overdraw` in the code so many times, people will probably want to know what it actually is... Sieving the numbers up to 2^32-1 (i.e. a full 2^31 bit bitmap) Sundaram's has an overdraw of 8,643,678,027 (roughly 2 * 2^32) or 444.6%; with the small fix that turns it into odds-only Eratosthenes the overdraw becomes 2,610,022,328 or 134.2%.

Thursday, October 21, 2021