Quantcast
Viewing all articles
Browse latest Browse all 8

Visualize This chapter 2: scrape weather data using Python

Image may be NSFW.
Clik here to view.
Cover of Visualize This by Nathan Yau
This weekend I started working through the book Visualize This: The FlowingData Guide to Design, Visualization and Statistics by Nathan Yau, the writer and data visualisation expert behind flowingdata.com.

I’ve been interested in this field for some time but haven’t invested the time in learning new technical skills required to really push myself on, partly through fear of programming. So, I sat down with the book, opened a text editor on my Mac and fired up (gulp) Terminal, that weird text only window with a language and behaviour all its own (and baffling to me).

The first practical exercise in Yau’s book looks at using Python and the Beautiful Soup Python library to scrape historical temperature data from Weather Underground (chapter 2, pages 30-37). The principle is to automatically scrape the maximum temperature in one location for every day in 2009, rather than load a separate web page for each day manually and record the maximum temperature yourself. Doing that 365 times would be tedious.

However, it probably would have been faster than the 2-3 hours I spent trying to get this first introductory exercise to work on my Mac!

As someone completely new to Python, installing things via Terminal and basic Terminal commands, this exercise was tricky. Rather than doing the actual task, I had to learn things around it first. That’s ok; that’s what learning is.

I found this first exercise difficult though so, in the hope of saving other people confusion, here are my notes in context with quotes from Yau’s instructions, filling out some of the detail that may help newbies get through it.

Install Python

Page 30:

If you work on Mac OS X, you should have Python installed already. Open the Terminal application and type python to start.

If you’ve never done any programming before – and I suspect plenty of the book’s readers haven’t – you might be thinking ‘what the dickens is the Terminal application?’

  1. Open a Finder window.
  2. Go to Applications > Utilities.
  3. The Terminal application is in here.
  4. Double-click on it.
  5. The Terminal application looks like a basic text editor window.
  6. In the new Terminal window, type python.
  7. You should now see something like this:

Image may be NSFW.
Clik here to view.
Screenshot of Terminal: install Python

Download Beautiful Soup

Page 30:

Next, you need to download Beautiful Soup, which can help you read web pages quickly and easily. Save the Beautiful Soup Python (.py) file in the directory that you plan to save your code in. If you know your way around Python, you can also put Beautiful Soup in your library path, but it’ll work the same either way.

I went to the Beautiful Soup website and downloaded Beautiful Soup 4.1.3, the current version at the time of writing.

It’s a folder, not a file, but I kept going and saved this folder to where I was planning to do the scraping weather data:

Image may be NSFW.
Clik here to view.
Screenshot: Beautiful Soup 4.1.3 directory contents in Finder

So far, so good.

Full script for get-weather-data.py

This is where I got stuck. Beautiful Soup has changed since the book was published in 2011. The first two lines of code in Yau’s get-weather-data.py file are:

import urllib2
from BeautifulSoup import BeautifulSoup

It turns out that this won’t work with Beautiful Soup version 4 onwards. Thankfully, I found Dikei’s solution on Stackoverflow. You need to edit the second line in get-weather-data.py to read:

from bs4 import BeautifulSoup

Run the code

Page 37:

The only thing left to do now is to run the code, which you do in your terminal using:

$ python get-weather-data.py

Taking this literally, back in your Terminal window, you get an error message:

Image may be NSFW.
Clik here to view.
Screenshot of Terminal: run Python script error

There’s a crucial step missing from Yau’s instructions: in Terminal you need to navigate to the directory where you saved Beautiful Soup.

Firstly, if you still have the same Terminal window open from when you installed Python, you’ll get stuck again. This is what I did:

  1. In the Terminal menu, go to Shell > New Window.
  2. In the Terminal window, type cd followed by the path to the directory where you have saved your get-weather-data.py file. For example:
cd /Users/gavinwray/Documents/Books/Visualize-This-by-Nathan-Yau/Gavin/ch02/weather-scrape/beautifulsoup4-4.1.3

Now that’s a very long path name and easy to make a typo with. Two things I learnt here that might save you frustration:

  1. The get-weather-data.py script will only run if the path to the directory where it is saved contains no spaces. For example, if you’ve got a directory called /Visualize This/chapter 02/ in your file path, the script won’t run.
  2. To save typing the full path, you can copy it by right-clicking on the directory name in Finder and choosing Copy. This copies the file path and you can paste it directly into your Terminal window. (Thanks Max Woolf for showing me this shortcut.)

So, you should now see your Terminal window showing something like this:

Image may be NSFW.
Clik here to view.
Screenshot of Terminal: changed directory

Now back to Yau’s instruction to run the code. Exposing my own numptiness and taking the instructions literally, I kept typing in that first dollar sign:

$ python get-weather-data.py

Turns out that you don’t type in the dollar sign yourself. That sign in Terminal is already there, so you just need to type in:

python get-weather-data.py

Be patient

Page 37:

It takes a little while to run, so be patient. In the process of running, your computer is essentially loading 365 pages, one for each day of 2009. You should have a file called wunder-data.txt in your working directory when the script is done running.

The file  wunder-data.txt was written to my directory immediately and the Terminal window appeared to hang, like this:

Image may be NSFW.
Clik here to view.
Screenshot in Terminal: get weather data

Now you really do have to be patient. The script took about 15 minutes to complete on my machine, more than enough time to think that the script wasn’t working, complain grumpily on Twitter and post the problem to flickr as a precursor to asking for help.

15 minutes later, huzzah, I have a comma-delimited file with 365 rows, one for each day in 2009 (now in Google Spreadsheets) showing the maximum temperature reached that day in Birmingham.

Prior to yesterday, I would never have thought I could be overjoyed at such a result. Maybe my geek kudos has just gone up a few points while see-sawing another personality rating rapidly down. Hopefully, though, what I learnt yesterday will save people trying out this exercise the confusion I did.


Viewing all articles
Browse latest Browse all 8

Trending Articles