Category Archives: Python

Project Euler Problems 7-8

View and download this notebook from nbviewer
In [1]:
from IPython.display import display
from IPython.display import HTML

Problem 7

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.
What is the 10 001st prime number?


Another prime question. Calculating 10,000. We did this before in question 3, time to reuse it..

Method 1: brute force

In [84]:
def isPrime(x):
    if (x==1):
        return False
    for i in range(2,x):
        if x%i==0:
            return False
    return True

def getPrimes(maxValue):
    primes = []
    for i in range(1,maxValue):
        if isPrime(i):
            primes.append(i)
    return primes

primes = getPrimes(10000)
In [86]:
%%timeit
getPrimes(10000)
1 loops, best of 3: 1.25 s per loop

In [85]:
len(primes)
Out[85]:
1229

The brute force solution solution takes more than a second to calculate primes up to 10000. And how many primes did that yield? Only 1229! This doesn't look like a reasonable way to calculate 10000 primes. Luckily, there is a very simple and clever algorithm that can do this job much faster.

Method 2: Sieve of Eratosthenes

The basic notion of the sieve of Eratosthenes is to pre-allocate a list of numbers up to n, and then, taking a prime (starting with 2), cross out every multiple of that prime, as those multiples clearly can't be primes. The next prime is then the next unmarked value in the list. The process repeats until there are no more primes to be found.

In [123]:
def showState(l, p, nx):
    numbers = ''
    for n in l:
        style=''
        if n<0:
            style+='text-decoration: line-through; background-color: rgb(171, 231, 255);'
        if n==p:
            style+='background-color: rgb(230,255,95);'
        if n==nx:
            style+='background-color: rgb(150, 233, 150);'
        if n==0:
            style+='background-color: rgb(220,220,220); color: rgb(220,220,220);'    
        numbers+='{1}'.format(style, abs(n))
    s = """    {0}
"""
.format(numbers) h = HTML(s) display(h) def sieve(size, showStates=True): l = list(range(2,size+1)) #generate the candidate set idx = lambda x: x-2 #just a simple mapping from number in list to list index p = 2 #seed with initial prime for iteration in range(len(l)): #mark every multiple of p for i in range(p*2, size+1, p): l[idx(i)] = -i #find the next unmarked value, that's the next p nextPrime = 0 for i in l[idx(p+1):]: if i>0: nextPrime = i break if (showStates): showState(l, p, nextPrime) for i in range(p*2, size+1, p): l[idx(i)] = 0 p = nextPrime #if we haven't found any unmarked values, we're done if p == 0: break #return all unmarked values return filter(lambda x: x>0, l) sieve(38, True)
234567891011121314151617181920212223242526272829303132333435363738
23056709011121301501718190210232425027029303103303536370
230507001011013015017019200023025000293031000350370
2305070001101314001701902102300002829031000350370
23050700011013000170190022230000029031033000370
2305070001101300017019000230026002903100000370
2305070001101300017019000230000029031003400370
2305070001101300017019000230000029031000003738
230507000110130001701900023000002903100000370
230507000110130001701900023000002903100000370
230507000110130001701900023000002903100000370
230507000110130001701900023000002903100000370
Out[123]:
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]

Above is the state of the preallocated list at each iteration of sifting primes up to 38.

Starting with a fully unmarked list, and the first prime, 2 (shown in yellow), every multiple of 2 is marked off in the list (shown in blue). The next prime (green) is then found by moving up the list until the first unmarked number.

The next iteration starts at the newly found prime, 3, and proceeds to mark off every multiple of 3 in the list, and so forth.

Finally, the last iteration attempts to find unmarked values to the right of 37 and finds none. At that point the algorithm can terminate and return the remaining unmarked values in the list.

In [58]:
%%timeit
v = sieve(10000, False)
10 loops, best of 3: 55.2 ms per loop

In [72]:
len(sieve(10000, False))
Out[72]:
1229

At less than 60ms to find all primes less than 10000, this algorithm is orders of magnitude faster.

It can be further optimized by recognizing that if one divisor or factor of a number (other than a perfect square) is greater than its square root, then the other factor will be less than its square root. Hence all multiples of primes greater than the square root of n need not be considered[1]. The sieve function can be trivially modified to use this knowledge by limiting the marking phase to \(\sqrt{n}\)

[1] http://britton.disted.camosun.bc.ca/jberatosthenes.htm

In [73]:
#comments removed for brevity
def sieve(size, showStates=True):
    l = list(range(2,size+1)) 
    idx = lambda x: x-2 
    p = 2 
    for iteration in range(int(0.5+len(l)**0.5)):
        #mark every multiple of p up to sqrt(n)
        for i in range(p*2, size+1, p):
            l[idx(i)] = -i
        nextPrime = 0
        for i in l[idx(p+1):]:
            if i>0:
                nextPrime = i
                break
        if (showStates):
            showState(l, p, nextPrime)
            for i in range(p*2, size+1, p):
                l[idx(i)] = 0
        p = nextPrime
        if p == 0:
            break
    return filter(lambda x: x>0, l)
In [74]:
%%timeit
v = sieve(10000, False)
100 loops, best of 3: 13.2 ms per loop

So the Eratosthenes sieve is very fast at finding primes up to some limit m. At m=10000, we find n=1229. What range do we have to sieve to actually get our n=1000 primes?

Rosser's theorem[2] provides a useful inequality that establishes bounds on the value of the nth prime number:

\(\ln n + \ln\ln n - 1 < \frac{p_n}{n} < \ln n + \ln \ln n \quad\text{for } n \ge 6\)

[2] http://en.wikipedia.org/wiki/Prime_number_theorem#Approximations_for_the_nth_prime_number

In [114]:
def maxPrime(n):
    return int(0.5+(float(n)*log(n)+ n*log(log(n))))
    
limit = maxPrime(10000)
print('The 10000th prime has a value < {0}'.format(limit))
The 10000th prime has a value < 114307

In [115]:
primes = sieve(limit, False)
len(primes)
Out[115]:
10816

The upper bound function appears to have done it's job and netted just over 10000 primes. We can now obtain the 10001st

In [118]:
primes[10000]
Out[118]:
104743

Problem 8

The four adjacent digits in the 1000-digit number that have the greatest product are 9 × 9 × 8 × 9 = 5832.

73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450

Find the thirteen adjacent digits in the 1000-digit number that have the greatest product. What is the value of this product?


In [121]:
source = '''
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
'''.replace('\n','')

#break the source string into a series of 13 character long slices at every possible position
window_size = 13
slices = [source[x:x+window_size] for x in range(len(source) - window_size + 1)]

#compute the product of each slice
products = [product(map(int, row), dtype='int64') for row in slices]

max(products)
Out[121]:
23514624000

Project Euler Problems 5-6

View and download this notebook from nbviewer

Problem 5

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder.

What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?


This is an interesting problem!

First thing's first, we can establish that the largest positive number that meets the condition is \(1×2×3..×20\) or simply \(20!\) We can work our way down by repeatedly dividing this upper boundary number by any number in the range [1,20] and seeing if it's an even division.

This approach results in a runtime complexity of O(log(n!)), better known as O(n log n)

In [16]:
factors = 20

upper = math.factorial(factors)
divisors = range(2, factors+1)
current = upper

#repeatedly attempt to divide current number by prime factors ordered 
#from largest to smallest as long as the result has a remainder of 0
while True:
    found = False
    for p in reversed(divisors):
        c = current / p
        if c % p == 0:
            found = True
            current = c
            break
            
    if not found:
       break
        
    print 'divided by', p, 'got', current
divided by 20 got 121645100408832000
divided by 20 got 6082255020441600
divided by 20 got 304112751022080
divided by 18 got 16895152834560
divided by 18 got 938619601920
divided by 18 got 52145533440
divided by 16 got 3259095840
divided by 14 got 232792560
divided by 12 got 19399380
divided by 2 got 9699690

Problem 6

The sum of the squares of the first ten natural numbers is, 12 + 22 + ... + 102 = 385

The square of the sum of the first ten natural numbers is, (1 + 2 + ... + 10)2 = 552 = 3025

Hence the difference between the sum of the squares of the first ten natural numbers and the square of the sum is 3025 − 385 = 2640.

Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.


Method 1: brute force

Complexity: O(N)

In [10]:
def squareDiff(x):
    s = range(1, x+1)
    sumSquares = sum([x*x for x in s])
    squareSum = math.pow(sum(s),2)
    diff = squareSum - sumSquares
    return diff

squareDiff(100)
Out[10]:
25164150.0

Easy enough, however it's well known that the sum of a series of natural numbers up to n can be calculated as \(\frac{n(n+1)}{2}\)

Is it possible that the sum of a series of natural numbers squared up to n can be calculated in constant time as well? I didn't know the answer and cheated by using a genetic algorithm to attempt to fit an equation to match sumSquares for the first 140 inputs.

Amazingly, it came back with a polynomial that had 0 residual error: \(\frac{1}{6}n + \frac{1}{2}n^2 + \frac{1}{3}n^3\)

Let's plot this polynomial to double check

In [11]:
brute = lambda n: sum([x*x for x in xrange(1,n+1)])
poly = lambda n: round(1./6 * n + 1./2 * pow(n, 2) + 1./3 * pow(n, 3))

x = np.array(range(1,2200))
brute_y = np.array([brute(t) for t in x])
poly_y = np.array([poly(t) for t in x])

plt.plot(x, brute_y, label='Brute Force', color='blue')
plt.plot(x, poly_y, label='Polynomial', color='red')
plt.legend()

print 'max error:', max(brute_y - poly_y)
max error: 1431655765.0

Looks like the polynomial solution suffers from integer overflow at around n$\approx$1300; earlier than the brute force solution. This is understandable considering the polynomial solution deals with n3 while brute force only deals with n2. We'll switch to floats to overcome overflow issues in both cases.

In [12]:
brute = lambda n: sum([1.*x*x for x in xrange(1,n+1)])
poly = lambda n: round(1./6 * n + 1./2 * pow(n, 2.) + 1./3 * pow(n, 3.))

x = np.array(range(1,2200))
brute_y = np.array([brute(t) for t in x])
poly_y = np.array([poly(t) for t in x])

plt.plot(x, brute_y, label='Brute Force', color='blue')
plt.plot(x, poly_y, label='Polynomial', color='red')
plt.legend()

print 'max error:', max(brute_y - poly_y)
max error: 0.0

A maximum error of 0 across a small input set is promising, however I got in touch with a friend to check, and he promtly came back with a proof!


All credit to Jonah Schreiber for the below proof:

\[\sum_{x=1}^{n}{x^2}=\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}\]

Base

Here, we show that the formula is correct for \(n=1\). On the left-hand side, we have \(1\), and on the right-hand side, we have \(\frac{1^3}{3}+\frac{1^2}{2}+\frac{1}{6}=1\), so the base case is true.

Assumption

Assume that

\[\sum_{x=1}^{n}{x^2}=\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}\]

is true.

Induction

Show then that it is true for \(n+1\), that is,

\[\sum_{x=1}^{n+1}{x^2}=\frac{(n+1)^3}{3}+\frac{(n+1)^2}{2}+\frac{n+1}{6}\]

Let us break out the last term on the left-hand side, and expand the right-hand side:

\[\sum_{x=1}^{n}{x^2}+(n+1)^2=\frac{n^3+3n^2+3n+1}{3}+\frac{n^2+2n+1}{2}+\frac{n+1}{6}\]

We already know the sum on the left-hand side, which we insert.

\[\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}+n^2+2n+1=\frac{n^3+3n^2+3n+1}{3}+\frac{n^2+2n+1}{2}+\frac{n+1}{6}\]

Collecting terms, we get

\[\frac{n^3}{3}+\frac{3n^2}{2}+\frac{13n}{6}+1=\frac{n^3}{3}+\frac{3n^2}{2}+\frac{13n}{6}+1\]

The two sides are equal, so the formula is correct.

Thanks again to Jonah Schreiber for the above proof.


See http://www.trans4mind.com/personal_development/mathematics/series/sumNaturalSquares.htm for several derivations of this formula.


And so, finally,

Method 2: Arithmetic

Complexity: O(1)

In [13]:
def squareDiff2(x):
    sumSquares = round(1./6 * x + 1./2 * pow(x, 2.) + 1./3 * pow(x, 3.))
    squareSum = pow(x*(x+1)/2., 2)
    return squareSum - sumSquares

squareDiff2(100)
Out[13]:
25164150.0

Project Euler Problems 1-4

View and download this notebook from nbviewer

Problem 1

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.


In [1]:
#trivial with python's comprehensions
sum(x for x in xrange(1,1000) if x%5==0 or x%3==0)
Out[1]:
233168

Problem 2

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...

By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.


Method 1: memoization

In [2]:
#set base case
fibs = {0:0, 1:1}

def fib(n):
    ret = fibs.get(n, None)
    if ret != None:
        return ret
    ret = fib(n-2) + fib(n-1)
    fibs[n] = ret
    return ret

def getEvenSum(upperBound = 4000000):
    evenValued = []
    for i in range(upperBound):
        v = fib(i)
        if v > upperBound:
            break
        if v%2 == 0:
            evenValued.append(v)

    return sum(evenValued)

getEvenSum()
Out[2]:
4613732
In [3]:
%%timeit
getEvenSum()
10 loops, best of 3: 65.6 ms per loop

This isn't terrible, but we can do a lot better by iterating from the bottom up.

Method 2: iterative

In [4]:
def getEvenSum(upperBound = 4000000):
    previous = 0
    total = 1
    even = 0

    while total < upperBound:
        temp = total
        total += previous
        if total %2 == 0:
            even += total
        previous = temp

    return even

getEvenSum()
Out[4]:
4613732
In [5]:
%%timeit
getEvenSum()
100000 loops, best of 3: 3.88 µs per loop

Problem 3

The prime factors of 13195 are 5, 7, 13 and 29.

What is the largest prime factor of the number 600851475143 ?


This is an uninspired brute force solution, see problem 6 for a better way to find primes

In [6]:
def isPrime(x):
    if (x==1):
        return False
    for i in range(2,x):
        if x%i==0:
            return False
    return True
    
#build a list of all primes <=20000
primes = []
for i in range(1,20000):
    if isPrime(i):
        primes.append(i)
        
primes[-3:]
Out[6]:
[19991, 19993, 19997]
In [7]:
#find the complete prime factorisation for the target number
target = 600851475143
factors = []
for p in reversed(primes):
    if target % p == 0:
        factors.append(p)
factors
Out[7]:
[6857, 1471, 839, 71]

Problem 4

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.

Find the largest palindrome made from the product of two 3-digit numbers.


In [8]:
#nothing clever here, brute force over the search space and pick out all the palindromes
palindromes = []

def isPalindrome(x):
    s = str(x)
    return s == s[::-1]

#note the inner loop doesn't cover the entire search space but starts at x
#this halves work by not calculating redundant products like (1*2, 2*1)
for x in reversed(range(100, 1000)):
    for y in reversed(range(x, 1000)):
        n = x*y
        if (isPalindrome(n)):
            palindromes.append(n)
            
max(palindromes)
Out[8]:
906609

Making matplotlib look like ggplot

When I first started using matplotlib, the output looked very crisp and polished compared to excel, however after seeing ggplot2, I realized that matplotlib’s default presentation settings leave a lot to be desired. I have put together a quick script that will restyle an axes to look more or less like ggplot2′s.

def rstyle(ax):
    """Styles an axes to appear like ggplot2
    Must be called after all plot and axis manipulation operations have been carried out (needs to know final tick spacing)
    """

    #set the style of the major and minor grid lines, filled blocks
    ax.grid(True, 'major', color='w', linestyle='-', linewidth=1.4)
    ax.grid(True, 'minor', color='0.92', linestyle='-', linewidth=0.7)
    ax.patch.set_facecolor('0.85')
    ax.set_axisbelow(True)
   
    #set minor tick spacing to 1/2 of the major ticks
    ax.xaxis.set_minor_locator(MultipleLocator( (plt.xticks()[0][1]-plt.xticks()[0][0]) / 2.0 ))
    ax.yaxis.set_minor_locator(MultipleLocator( (plt.yticks()[0][1]-plt.yticks()[0][0]) / 2.0 ))
   
    #remove axis border
    for child in ax.get_children():
        if isinstance(child, matplotlib.spines.Spine):
            child.set_alpha(0)
       
    #restyle the tick lines
    for line in ax.get_xticklines() + ax.get_yticklines():
        line.set_markersize(5)
        line.set_color("gray")
        line.set_markeredgewidth(1.4)
   
    #remove the minor tick lines    
    for line in ax.xaxis.get_ticklines(minor=True) + ax.yaxis.get_ticklines(minor=True):
        line.set_markersize(0)
   
    #only show bottom left ticks, pointing out of axis
    rcParams['xtick.direction'] = 'out'
    rcParams['ytick.direction'] = 'out'
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
   
   
    if ax.legend_ <> None:
        lg = ax.legend_
        lg.get_frame().set_linewidth(0)
        lg.get_frame().set_alpha(0.5)
       
       
def rhist(ax, data, **keywords):
    """Creates a histogram with default style parameters to look like ggplot2
    Is equivalent to calling ax.hist and accepts the same keyword parameters.
    If style parameters are explicitly defined, they will not be overwritten
    """

   
    defaults = {
                'facecolor' : '0.3',
                'edgecolor' : '0.28',
                'linewidth' : '1',
                'bins' : 100
                }
   
    for k, v in defaults.items():
        if k not in keywords: keywords[k] = v
   
    return ax.hist(data, **keywords)


def rbox(ax, data, **keywords):
    """Creates a ggplot2 style boxplot, is eqivalent to calling ax.boxplot with the following additions:
   
    Keyword arguments:
    colors -- array-like collection of colours for box fills
    names -- array-like collection of box names which are passed on as tick labels

    """


    hasColors = 'colors' in keywords
    if hasColors:
        colors = keywords['colors']
        keywords.pop('colors')
       
    if 'names' in keywords:
        ax.tickNames = plt.setp(ax, xticklabels=keywords['names'] )
        keywords.pop('names')
   
    bp = ax.boxplot(data, **keywords)
    pylab.setp(bp['boxes'], color='black')
    pylab.setp(bp['whiskers'], color='black', linestyle = 'solid')
    pylab.setp(bp['fliers'], color='black', alpha = 0.9, marker= 'o', markersize = 3)
    pylab.setp(bp['medians'], color='black')
   
    numBoxes = len(data)
    for i in range(numBoxes):
        box = bp['boxes'][i]
        boxX = []
        boxY = []
        for j in range(5):
          boxX.append(box.get_xdata()[j])
          boxY.append(box.get_ydata()[j])
        boxCoords = zip(boxX,boxY)
       
        if hasColors:
            boxPolygon = Polygon(boxCoords, facecolor = colors[i % len(colors)])
        else:
            boxPolygon = Polygon(boxCoords, facecolor = '0.95')
           
        ax.add_patch(boxPolygon)
    return bp

Usage is very simple, call rstyle(axes) just before showing or saving your figure. It is key to call it after all drawing and axis manipulation has been done, because it will be reading the major tick positions to work out where to put the minors.

from pylab import *
import scipy.stats

t = arange(0.0, 100.0, 0.1)
s = sin(0.1*pi*t)*exp(-t*0.01)
fig = plt.figure()
ax = fig.add_subplot(111)

plot(t,s, label = "Original")
plot(t,s*2, label = "Doubled")

ax.legend()
rstyle(ax)
plt.show()

I have also included a function that creates a ggplot style histogram for you. This is nothing more than setting some default parameters to the hist function.

from pylab import *
import scipy.stats

t = arange(0.0, 100.0, 0.1)
s = sin(0.1*pi*t)*exp(-t*0.01)

fig = plt.figure()
ax = fig.add_subplot(111)

data = scipy.stats.norm.rvs(size = 1000)
rhist(ax, data, label = "Histogram")
ax.legend()
rstyle(ax)
plt.show()

There is also a slightly more involved boxplot function which handles fill colours and names for you.

from pylab import *
import scipy.stats
data = [scipy.stats.norm.rvs(size = 100), scipy.stats.norm.rvs(size = 100), scipy.stats.norm.rvs(size = 100)]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.legend()
rbox(ax, data, names = ("One", "Two", "Three"), colors = ('white', 'cyan'))
rstyle(ax)

Finally, with a bit of help from Justin Peel over at StackOverflow, you can get some really nice graphics going that you won’t be ashamed to put in your published material or presentation.

I have only used these scripts in my fairly limited scenario and there are several obvious things such as the requirement to pass an axes, the enforcement of minor ticks at 1/2 majors, and the fact that I haven’t really done much with the legend, but it should be enough to get you started in your projects. Happy visualizating!