Project Euler Problems 7-8

View and download this notebook from nbviewer
In [1]:
from IPython.display import display
from IPython.display import HTML

Problem 7

By listing the first six prime numbers: 2, 3, 5, 7, 11, and 13, we can see that the 6th prime is 13.
What is the 10 001st prime number?


Another prime question. Calculating 10,000. We did this before in question 3, time to reuse it..

Method 1: brute force

In [84]:
def isPrime(x):
    if (x==1):
        return False
    for i in range(2,x):
        if x%i==0:
            return False
    return True

def getPrimes(maxValue):
    primes = []
    for i in range(1,maxValue):
        if isPrime(i):
            primes.append(i)
    return primes

primes = getPrimes(10000)
In [86]:
%%timeit
getPrimes(10000)
1 loops, best of 3: 1.25 s per loop

In [85]:
len(primes)
Out[85]:
1229

The brute force solution solution takes more than a second to calculate primes up to 10000. And how many primes did that yield? Only 1229! This doesn't look like a reasonable way to calculate 10000 primes. Luckily, there is a very simple and clever algorithm that can do this job much faster.

Method 2: Sieve of Eratosthenes

The basic notion of the sieve of Eratosthenes is to pre-allocate a list of numbers up to n, and then, taking a prime (starting with 2), cross out every multiple of that prime, as those multiples clearly can't be primes. The next prime is then the next unmarked value in the list. The process repeats until there are no more primes to be found.

In [123]:
def showState(l, p, nx):
    numbers = ''
    for n in l:
        style=''
        if n<0:
            style+='text-decoration: line-through; background-color: rgb(171, 231, 255);'
        if n==p:
            style+='background-color: rgb(230,255,95);'
        if n==nx:
            style+='background-color: rgb(150, 233, 150);'
        if n==0:
            style+='background-color: rgb(220,220,220); color: rgb(220,220,220);'    
        numbers+='{1}'.format(style, abs(n))
    s = """    {0}
"""
.format(numbers) h = HTML(s) display(h) def sieve(size, showStates=True): l = list(range(2,size+1)) #generate the candidate set idx = lambda x: x-2 #just a simple mapping from number in list to list index p = 2 #seed with initial prime for iteration in range(len(l)): #mark every multiple of p for i in range(p*2, size+1, p): l[idx(i)] = -i #find the next unmarked value, that's the next p nextPrime = 0 for i in l[idx(p+1):]: if i>0: nextPrime = i break if (showStates): showState(l, p, nextPrime) for i in range(p*2, size+1, p): l[idx(i)] = 0 p = nextPrime #if we haven't found any unmarked values, we're done if p == 0: break #return all unmarked values return filter(lambda x: x>0, l) sieve(38, True)
234567891011121314151617181920212223242526272829303132333435363738
23056709011121301501718190210232425027029303103303536370
230507001011013015017019200023025000293031000350370
2305070001101314001701902102300002829031000350370
23050700011013000170190022230000029031033000370
2305070001101300017019000230026002903100000370
2305070001101300017019000230000029031003400370
2305070001101300017019000230000029031000003738
230507000110130001701900023000002903100000370
230507000110130001701900023000002903100000370
230507000110130001701900023000002903100000370
230507000110130001701900023000002903100000370
Out[123]:
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]

Above is the state of the preallocated list at each iteration of sifting primes up to 38.

Starting with a fully unmarked list, and the first prime, 2 (shown in yellow), every multiple of 2 is marked off in the list (shown in blue). The next prime (green) is then found by moving up the list until the first unmarked number.

The next iteration starts at the newly found prime, 3, and proceeds to mark off every multiple of 3 in the list, and so forth.

Finally, the last iteration attempts to find unmarked values to the right of 37 and finds none. At that point the algorithm can terminate and return the remaining unmarked values in the list.

In [58]:
%%timeit
v = sieve(10000, False)
10 loops, best of 3: 55.2 ms per loop

In [72]:
len(sieve(10000, False))
Out[72]:
1229

At less than 60ms to find all primes less than 10000, this algorithm is orders of magnitude faster.

It can be further optimized by recognizing that if one divisor or factor of a number (other than a perfect square) is greater than its square root, then the other factor will be less than its square root. Hence all multiples of primes greater than the square root of n need not be considered[1]. The sieve function can be trivially modified to use this knowledge by limiting the marking phase to \(\sqrt{n}\)

[1] http://britton.disted.camosun.bc.ca/jberatosthenes.htm

In [73]:
#comments removed for brevity
def sieve(size, showStates=True):
    l = list(range(2,size+1)) 
    idx = lambda x: x-2 
    p = 2 
    for iteration in range(int(0.5+len(l)**0.5)):
        #mark every multiple of p up to sqrt(n)
        for i in range(p*2, size+1, p):
            l[idx(i)] = -i
        nextPrime = 0
        for i in l[idx(p+1):]:
            if i>0:
                nextPrime = i
                break
        if (showStates):
            showState(l, p, nextPrime)
            for i in range(p*2, size+1, p):
                l[idx(i)] = 0
        p = nextPrime
        if p == 0:
            break
    return filter(lambda x: x>0, l)
In [74]:
%%timeit
v = sieve(10000, False)
100 loops, best of 3: 13.2 ms per loop

So the Eratosthenes sieve is very fast at finding primes up to some limit m. At m=10000, we find n=1229. What range do we have to sieve to actually get our n=1000 primes?

Rosser's theorem[2] provides a useful inequality that establishes bounds on the value of the nth prime number:

\(\ln n + \ln\ln n - 1 < \frac{p_n}{n} < \ln n + \ln \ln n \quad\text{for } n \ge 6\)

[2] http://en.wikipedia.org/wiki/Prime_number_theorem#Approximations_for_the_nth_prime_number

In [114]:
def maxPrime(n):
    return int(0.5+(float(n)*log(n)+ n*log(log(n))))
    
limit = maxPrime(10000)
print('The 10000th prime has a value < {0}'.format(limit))
The 10000th prime has a value < 114307

In [115]:
primes = sieve(limit, False)
len(primes)
Out[115]:
10816

The upper bound function appears to have done it's job and netted just over 10000 primes. We can now obtain the 10001st

In [118]:
primes[10000]
Out[118]:
104743

Problem 8

The four adjacent digits in the 1000-digit number that have the greatest product are 9 × 9 × 8 × 9 = 5832.

73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450

Find the thirteen adjacent digits in the 1000-digit number that have the greatest product. What is the value of this product?


In [121]:
source = '''
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
'''.replace('\n','')

#break the source string into a series of 13 character long slices at every possible position
window_size = 13
slices = [source[x:x+window_size] for x in range(len(source) - window_size + 1)]

#compute the product of each slice
products = [product(map(int, row), dtype='int64') for row in slices]

max(products)
Out[121]:
23514624000

Project Euler Problems 5-6

View and download this notebook from nbviewer

Problem 5

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder.

What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?


This is an interesting problem!

First thing's first, we can establish that the largest positive number that meets the condition is \(1×2×3..×20\) or simply \(20!\) We can work our way down by repeatedly dividing this upper boundary number by any number in the range [1,20] and seeing if it's an even division.

This approach results in a runtime complexity of O(log(n!)), better known as O(n log n)

In [16]:
factors = 20

upper = math.factorial(factors)
divisors = range(2, factors+1)
current = upper

#repeatedly attempt to divide current number by prime factors ordered 
#from largest to smallest as long as the result has a remainder of 0
while True:
    found = False
    for p in reversed(divisors):
        c = current / p
        if c % p == 0:
            found = True
            current = c
            break
            
    if not found:
       break
        
    print 'divided by', p, 'got', current
divided by 20 got 121645100408832000
divided by 20 got 6082255020441600
divided by 20 got 304112751022080
divided by 18 got 16895152834560
divided by 18 got 938619601920
divided by 18 got 52145533440
divided by 16 got 3259095840
divided by 14 got 232792560
divided by 12 got 19399380
divided by 2 got 9699690

Problem 6

The sum of the squares of the first ten natural numbers is, 12 + 22 + ... + 102 = 385

The square of the sum of the first ten natural numbers is, (1 + 2 + ... + 10)2 = 552 = 3025

Hence the difference between the sum of the squares of the first ten natural numbers and the square of the sum is 3025 − 385 = 2640.

Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.


Method 1: brute force

Complexity: O(N)

In [10]:
def squareDiff(x):
    s = range(1, x+1)
    sumSquares = sum([x*x for x in s])
    squareSum = math.pow(sum(s),2)
    diff = squareSum - sumSquares
    return diff

squareDiff(100)
Out[10]:
25164150.0

Easy enough, however it's well known that the sum of a series of natural numbers up to n can be calculated as \(\frac{n(n+1)}{2}\)

Is it possible that the sum of a series of natural numbers squared up to n can be calculated in constant time as well? I didn't know the answer and cheated by using a genetic algorithm to attempt to fit an equation to match sumSquares for the first 140 inputs.

Amazingly, it came back with a polynomial that had 0 residual error: \(\frac{1}{6}n + \frac{1}{2}n^2 + \frac{1}{3}n^3\)

Let's plot this polynomial to double check

In [11]:
brute = lambda n: sum([x*x for x in xrange(1,n+1)])
poly = lambda n: round(1./6 * n + 1./2 * pow(n, 2) + 1./3 * pow(n, 3))

x = np.array(range(1,2200))
brute_y = np.array([brute(t) for t in x])
poly_y = np.array([poly(t) for t in x])

plt.plot(x, brute_y, label='Brute Force', color='blue')
plt.plot(x, poly_y, label='Polynomial', color='red')
plt.legend()

print 'max error:', max(brute_y - poly_y)
max error: 1431655765.0

Looks like the polynomial solution suffers from integer overflow at around n$\approx$1300; earlier than the brute force solution. This is understandable considering the polynomial solution deals with n3 while brute force only deals with n2. We'll switch to floats to overcome overflow issues in both cases.

In [12]:
brute = lambda n: sum([1.*x*x for x in xrange(1,n+1)])
poly = lambda n: round(1./6 * n + 1./2 * pow(n, 2.) + 1./3 * pow(n, 3.))

x = np.array(range(1,2200))
brute_y = np.array([brute(t) for t in x])
poly_y = np.array([poly(t) for t in x])

plt.plot(x, brute_y, label='Brute Force', color='blue')
plt.plot(x, poly_y, label='Polynomial', color='red')
plt.legend()

print 'max error:', max(brute_y - poly_y)
max error: 0.0

A maximum error of 0 across a small input set is promising, however I got in touch with a friend to check, and he promtly came back with a proof!


All credit to Jonah Schreiber for the below proof:

\[\sum_{x=1}^{n}{x^2}=\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}\]

Base

Here, we show that the formula is correct for \(n=1\). On the left-hand side, we have \(1\), and on the right-hand side, we have \(\frac{1^3}{3}+\frac{1^2}{2}+\frac{1}{6}=1\), so the base case is true.

Assumption

Assume that

\[\sum_{x=1}^{n}{x^2}=\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}\]

is true.

Induction

Show then that it is true for \(n+1\), that is,

\[\sum_{x=1}^{n+1}{x^2}=\frac{(n+1)^3}{3}+\frac{(n+1)^2}{2}+\frac{n+1}{6}\]

Let us break out the last term on the left-hand side, and expand the right-hand side:

\[\sum_{x=1}^{n}{x^2}+(n+1)^2=\frac{n^3+3n^2+3n+1}{3}+\frac{n^2+2n+1}{2}+\frac{n+1}{6}\]

We already know the sum on the left-hand side, which we insert.

\[\frac{n^3}{3}+\frac{n^2}{2}+\frac{n}{6}+n^2+2n+1=\frac{n^3+3n^2+3n+1}{3}+\frac{n^2+2n+1}{2}+\frac{n+1}{6}\]

Collecting terms, we get

\[\frac{n^3}{3}+\frac{3n^2}{2}+\frac{13n}{6}+1=\frac{n^3}{3}+\frac{3n^2}{2}+\frac{13n}{6}+1\]

The two sides are equal, so the formula is correct.

Thanks again to Jonah Schreiber for the above proof.


See http://www.trans4mind.com/personal_development/mathematics/series/sumNaturalSquares.htm for several derivations of this formula.


And so, finally,

Method 2: Arithmetic

Complexity: O(1)

In [13]:
def squareDiff2(x):
    sumSquares = round(1./6 * x + 1./2 * pow(x, 2.) + 1./3 * pow(x, 3.))
    squareSum = pow(x*(x+1)/2., 2)
    return squareSum - sumSquares

squareDiff2(100)
Out[13]:
25164150.0

Project Euler Problems 1-4

View and download this notebook from nbviewer

Problem 1

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.


In [1]:
#trivial with python's comprehensions
sum(x for x in xrange(1,1000) if x%5==0 or x%3==0)
Out[1]:
233168

Problem 2

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...

By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.


Method 1: memoization

In [2]:
#set base case
fibs = {0:0, 1:1}

def fib(n):
    ret = fibs.get(n, None)
    if ret != None:
        return ret
    ret = fib(n-2) + fib(n-1)
    fibs[n] = ret
    return ret

def getEvenSum(upperBound = 4000000):
    evenValued = []
    for i in range(upperBound):
        v = fib(i)
        if v > upperBound:
            break
        if v%2 == 0:
            evenValued.append(v)

    return sum(evenValued)

getEvenSum()
Out[2]:
4613732
In [3]:
%%timeit
getEvenSum()
10 loops, best of 3: 65.6 ms per loop

This isn't terrible, but we can do a lot better by iterating from the bottom up.

Method 2: iterative

In [4]:
def getEvenSum(upperBound = 4000000):
    previous = 0
    total = 1
    even = 0

    while total < upperBound:
        temp = total
        total += previous
        if total %2 == 0:
            even += total
        previous = temp

    return even

getEvenSum()
Out[4]:
4613732
In [5]:
%%timeit
getEvenSum()
100000 loops, best of 3: 3.88 µs per loop

Problem 3

The prime factors of 13195 are 5, 7, 13 and 29.

What is the largest prime factor of the number 600851475143 ?


This is an uninspired brute force solution, see problem 6 for a better way to find primes

In [6]:
def isPrime(x):
    if (x==1):
        return False
    for i in range(2,x):
        if x%i==0:
            return False
    return True
    
#build a list of all primes <=20000
primes = []
for i in range(1,20000):
    if isPrime(i):
        primes.append(i)
        
primes[-3:]
Out[6]:
[19991, 19993, 19997]
In [7]:
#find the complete prime factorisation for the target number
target = 600851475143
factors = []
for p in reversed(primes):
    if target % p == 0:
        factors.append(p)
factors
Out[7]:
[6857, 1471, 839, 71]

Problem 4

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99.

Find the largest palindrome made from the product of two 3-digit numbers.


In [8]:
#nothing clever here, brute force over the search space and pick out all the palindromes
palindromes = []

def isPalindrome(x):
    s = str(x)
    return s == s[::-1]

#note the inner loop doesn't cover the entire search space but starts at x
#this halves work by not calculating redundant products like (1*2, 2*1)
for x in reversed(range(100, 1000)):
    for y in reversed(range(x, 1000)):
        n = x*y
        if (isPalindrome(n)):
            palindromes.append(n)
            
max(palindromes)
Out[8]:
906609

My Experience with the Surface Pro 2 as a Software Developer

The recently released surface pro 2 has been on my mind for a while. It addresses the biggest issues with the original surface; most importantly, the battery life is now comparable to laptops in the same class – even if still poor. The type cover has also been slightly improved and I found that I can touch type on it just as fast as I do on my desktop.

I was hoping that I could replace my laptop with the surface for a super mobile setup that can move between work and home. The surface pro 2 certainly has enough horsepower and ram to get serious work done, and with a full size keyboard and dual screens at work and home, I shouldn’t have to deal with the tiny keyboard and onboard screen too much. In return I’d get a bonus tablet and a full fledged core i5 pc in a tiny form factor, a pretty sweet deal.

After spending a week with the surface, I have to say that it didn’t work as well as I hoped. The following are the highlights of my experience.

The type cover is not good enough
The increased key travel on the type cover 2 is a welcome improvement, but it comes with a step backwards on the track-pad which used to have physical clickable buttons and a rubbery surface. These have been replaced with capacitive buttons and a felt finish, which is starting to fade in high traffic areas after a mere week of use. The bottom line is that despite the improvements, it’s still just too small a keyboard to do serious work with. I got the full fledged pc experience I wanted at home and work, but without a keyboard attached the usefulness of the surface is severely crippled.

The biggest redeeming factor here is the active digitizer pen, which is amazing. While it doesn’t replace a mouse, its a great supplement and feels very natural to use. The touch screen is just no comparison to the fine control you get with the pen, and the active digitizer means you can do things equivalent to mouse movements without clicking by hovering the pen over the screen. I actually miss having the pen when using other computers now, I hope to see a laptop come with this feature in the future.

The pixel density on the screen is too high, kinda
The show-stopping problem to me stemmed from the small screen, but not for the obvious reason. I was quite aware that a 10″ screen isn’t much to do real work on, and prepared to supplement it with dual monitors. What I wasn’t prepared for is how terrible DPI scaling is on windows.

Here’s the problem: the screen is a full hd 1920×1080 panel packed into a mere 10 inches. Applications not specifically designed to deal with high dpi displays render tiny text and tiny buttons which are very difficult to read and impossible to click on with your finger. Windows alleviates this with a dpi scaling option which forces applications to increase the size of what they’re rendering. Unfortunately unless applications were written to deal with this, it looks like they just get upscaled with what looks to be a bilinear filter. The result is that everything is blurry! This affects just about every application I’ve used except for internet explorer and visual studio. Even chrome and firefox don’t support dpi scaling and can at best be hacked by being run in compatibility mode to prevent scaling, followed by increasing page zoom or default font size. This tends to mess up some site layouts and still leaves you with tiny unreadable tabs and other native UI components.

Here’s IE vs what you get with chrome out of the box:

IE and chrome with dpi scaling

IE and chrome with dpi scaling

And here’s opera hacked to work sort-of okay vs chrome out of the box. Note the tiny tabs and broken visuals on opera.

Opera with dpi scaling disabled + font tweaks vs chrome with dpi scaling

Opera with dpi scaling disabled + font tweaks vs chrome with dpi scaling

I was also surprised to find that .net winforms applications that I’m developing have the same scaling problems. I assumed that Microsoft would definitely make sure that apps built with visual studio are ready to run on the surface out of the box. As a developer, I had no idea about this being a problem until I was on the receiving end, and I suspect that that’s the case with a lot of applications out there.

The problem is made even worse by the fact that the dpi scaling setting is global across all monitors in a multi-monitor setup. This means that when I plug the surface in at work, I either get giant scaled graphics on the large screens, or tiny unreadable graphics on the surface. On top of that, the hacky application setups to prevent blurring that I described above, are also carried over to the large screens. This just doesn’t work at all.

It’s not a replacement for the iPad
After a week of use its pretty clear to me that the surface pro is not really a tablet. I own an iPad and I continued to prefer it as my tablet both hardware and software wise. The aspect ratio on the surface doesn’t lend itself very well to the tablet experience. The browser is worse than on the iPad, and IE is the only browser that works ok in a tablet fashion. I found that I actually preferred IE over chrome which is pretty depressing.

I also encountered a pretty serious issue where the surface would randomly refuse to wake from sleep sometimes and reboot instead. Basically every time I closed the lid, I risked losing all my unsaved work. Microsoft’s tech support walked me through all the scripts that covered anything related to this issue, including a full factory reset, but nothing helped. I do know that other users have reported the same problem and suspect it’s caused by some application that I installed. Sadly I only installed the bare basics for work such as visual studio, vmware, sublime text, and office, so it looks like another show stopping problem.

You can forget about using any desktop applications in portrait orientation or without the type cover, the experience is just painful. Also, every time you go to portrait mode, your desktop icons are rearranged to fit horizontally and don’t go back when you return to landscape.

Last but not least, its just too thick and heavy to comfortably hold as a tablet. This is excusable if you account for the fact that you’re actually holding a high end laptop worth of horsepower but doesn’t change the fact that it’s a poor tablet experience.

Wrapping it up
All in all, the surface pro is an incredible piece of hardware at a really good price point, and I really want to like it, but in the end it can’t replace my laptop and it can’t replace my iPad. I would love to own one in addition to a laptop+tablet, but I just can’t justify 1500 dollars on a device that doesn’t have a clear purpose. I might reconsider in the future when high dpi screens become prevalent and application developers are forced to support them.

Despite the issues I’ve had, I’m going to be sad to part with the surface. It looks and feels amazing and I imagine it’s a great device for lighter work. I was also surprised that despite my strong dislike for windows 8 based on previous experience, after a week I’m not only used to it but actually prefer it in many ways. I thought the first thing I’ll be doing is reverting the start menu back to 7, but you know what? The windows 8 start is actually really good if you give it a chance, and doubles as a solid replacement for Launchy.

Follow-up
After a few more days of taking it to work, I ended up returning my surface for a refund. My general experience was that the small screen and the type cover simply weren’t good enough for prolonged serious work. In particular, there wasn’t enough screen real-estate to have a decent Visual Studio workspace and I found myself constantly trying to find balance between making the text too small or not being able to see enough code at once.
This was compounded by the fact that the dual screen experience was terrible and, my original notion of coming to work/home and docking the surface for serious work was simply not viable due to the hidpi scaling issues mentioned above. This was the selling point of the surface for me, and it simply didn’t deliver.

Instead, I picked up the Lenovo Yoga 2 Pro and couldn’t be happier
Despite being larger, I think this laptop is on equal footing in terms of mobility; I feel that it’s actually better because you can comfortably plop it in your lap – which is rather difficult with the surface + type cover, and you can stand it up on any angle rather than the surface’s 2 predefined kickstand modes.

In exchange for the higher price tag and larger form factor, you get a real keyboard, a good trackpad, 2 USB ports, and a screen that’s large enough to comfortably use busy tools like VS for extended periods of time. The dual screen experience is just as bad as with the surface, but at least the Yoga 2 is a perfectly usable development machine on its own.

Since the time of the original post, hidpi software adoption has also made great progress, and you can now expect a lot of tools to work out of the box. Two notable exceptions are Adobe Photoshop which is completely unusable, and Remote Desktop which doesn’t support scaling of the remote display. An alternative to the latter is Remote Desktop Connection Manager which is free and supports scaling.

Making matplotlib look like ggplot

When I first started using matplotlib, the output looked very crisp and polished compared to excel, however after seeing ggplot2, I realized that matplotlib’s default presentation settings leave a lot to be desired. I have put together a quick script that will restyle an axes to look more or less like ggplot2′s.

def rstyle(ax):
    """Styles an axes to appear like ggplot2
    Must be called after all plot and axis manipulation operations have been carried out (needs to know final tick spacing)
    """

    #set the style of the major and minor grid lines, filled blocks
    ax.grid(True, 'major', color='w', linestyle='-', linewidth=1.4)
    ax.grid(True, 'minor', color='0.92', linestyle='-', linewidth=0.7)
    ax.patch.set_facecolor('0.85')
    ax.set_axisbelow(True)
   
    #set minor tick spacing to 1/2 of the major ticks
    ax.xaxis.set_minor_locator(MultipleLocator( (plt.xticks()[0][1]-plt.xticks()[0][0]) / 2.0 ))
    ax.yaxis.set_minor_locator(MultipleLocator( (plt.yticks()[0][1]-plt.yticks()[0][0]) / 2.0 ))
   
    #remove axis border
    for child in ax.get_children():
        if isinstance(child, matplotlib.spines.Spine):
            child.set_alpha(0)
       
    #restyle the tick lines
    for line in ax.get_xticklines() + ax.get_yticklines():
        line.set_markersize(5)
        line.set_color("gray")
        line.set_markeredgewidth(1.4)
   
    #remove the minor tick lines    
    for line in ax.xaxis.get_ticklines(minor=True) + ax.yaxis.get_ticklines(minor=True):
        line.set_markersize(0)
   
    #only show bottom left ticks, pointing out of axis
    rcParams['xtick.direction'] = 'out'
    rcParams['ytick.direction'] = 'out'
    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('left')
   
   
    if ax.legend_ <> None:
        lg = ax.legend_
        lg.get_frame().set_linewidth(0)
        lg.get_frame().set_alpha(0.5)
       
       
def rhist(ax, data, **keywords):
    """Creates a histogram with default style parameters to look like ggplot2
    Is equivalent to calling ax.hist and accepts the same keyword parameters.
    If style parameters are explicitly defined, they will not be overwritten
    """

   
    defaults = {
                'facecolor' : '0.3',
                'edgecolor' : '0.28',
                'linewidth' : '1',
                'bins' : 100
                }
   
    for k, v in defaults.items():
        if k not in keywords: keywords[k] = v
   
    return ax.hist(data, **keywords)


def rbox(ax, data, **keywords):
    """Creates a ggplot2 style boxplot, is eqivalent to calling ax.boxplot with the following additions:
   
    Keyword arguments:
    colors -- array-like collection of colours for box fills
    names -- array-like collection of box names which are passed on as tick labels

    """


    hasColors = 'colors' in keywords
    if hasColors:
        colors = keywords['colors']
        keywords.pop('colors')
       
    if 'names' in keywords:
        ax.tickNames = plt.setp(ax, xticklabels=keywords['names'] )
        keywords.pop('names')
   
    bp = ax.boxplot(data, **keywords)
    pylab.setp(bp['boxes'], color='black')
    pylab.setp(bp['whiskers'], color='black', linestyle = 'solid')
    pylab.setp(bp['fliers'], color='black', alpha = 0.9, marker= 'o', markersize = 3)
    pylab.setp(bp['medians'], color='black')
   
    numBoxes = len(data)
    for i in range(numBoxes):
        box = bp['boxes'][i]
        boxX = []
        boxY = []
        for j in range(5):
          boxX.append(box.get_xdata()[j])
          boxY.append(box.get_ydata()[j])
        boxCoords = zip(boxX,boxY)
       
        if hasColors:
            boxPolygon = Polygon(boxCoords, facecolor = colors[i % len(colors)])
        else:
            boxPolygon = Polygon(boxCoords, facecolor = '0.95')
           
        ax.add_patch(boxPolygon)
    return bp

Usage is very simple, call rstyle(axes) just before showing or saving your figure. It is key to call it after all drawing and axis manipulation has been done, because it will be reading the major tick positions to work out where to put the minors.

from pylab import *
import scipy.stats

t = arange(0.0, 100.0, 0.1)
s = sin(0.1*pi*t)*exp(-t*0.01)
fig = plt.figure()
ax = fig.add_subplot(111)

plot(t,s, label = "Original")
plot(t,s*2, label = "Doubled")

ax.legend()
rstyle(ax)
plt.show()

I have also included a function that creates a ggplot style histogram for you. This is nothing more than setting some default parameters to the hist function.

from pylab import *
import scipy.stats

t = arange(0.0, 100.0, 0.1)
s = sin(0.1*pi*t)*exp(-t*0.01)

fig = plt.figure()
ax = fig.add_subplot(111)

data = scipy.stats.norm.rvs(size = 1000)
rhist(ax, data, label = "Histogram")
ax.legend()
rstyle(ax)
plt.show()

There is also a slightly more involved boxplot function which handles fill colours and names for you.

from pylab import *
import scipy.stats
data = [scipy.stats.norm.rvs(size = 100), scipy.stats.norm.rvs(size = 100), scipy.stats.norm.rvs(size = 100)]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.legend()
rbox(ax, data, names = ("One", "Two", "Three"), colors = ('white', 'cyan'))
rstyle(ax)

Finally, with a bit of help from Justin Peel over at StackOverflow, you can get some really nice graphics going that you won’t be ashamed to put in your published material or presentation.

I have only used these scripts in my fairly limited scenario and there are several obvious things such as the requirement to pass an axes, the enforcement of minor ticks at 1/2 majors, and the fact that I haven’t really done much with the legend, but it should be enough to get you started in your projects. Happy visualizating!