EchoDitto Blog

Free Zipcode-to-Representative Matching Database! Come and Get It!

August 17, 2006 - 5:02pm

In a previous, much crappier professional life, I worked as a programmer for the government. Most DC-area geeks do, in fact, although the work is usually so secret and/or boring that you don't hear much about it. But the federal government spends a staggering amount on IT. The only thing more astounding than the scale of the enterprise is how little direct good it does for the public.

I'll be the first to admit that not every project is a good candidate for release into citizens' hands. But there's a lot of code and data that could and should be released. But it isn't.

Now that I work on the advocacy side of things, I know that one prime example is the difficulty involved in helping a user find their congresswoman. Matching a zip code to a congressional district is a pretty obvious and simple capability that the government could make available to developers for very little cost. This would presumably facilitate conversations between constituents and representatives — if you believe in representative democracy, it's pretty hard to say that this would be anything other than a good thing.

But instead, developers usually have to buy this information from a vendor, for hundreds or even thousands of dollars. To me, it seems obvious that this information ought to be free.

Fortunately, from my time in the belly of the beast I know that the government actually does make this information available... at least, sort of. There's a collection of webpages on house.gov that provide the necessary data for a given zipcode, but they relay it in a thoroughly unusable form. If you're a developer, the obvious answer is to write some scripts to chew through that output, turning it into easily digestible SQL. Then you repeat the process for every zip code that you need to match to congressional districts. As part of another project, that's exactly what I did. I figured that other people might find it useful. At the very least, the price is right.

So! If you might find this database useful, have at it. You've got two options: first, you can download the scripts and use them to recreate the database. But the house.gov people probably wouldn't like that very much, and I'd hate to have them shut the door on this valuable data. Besides, it takes several hours to spider the necessary information.

The other option is to visit our charmingly ad-hoc bittorrent tracker and download the whole database. That archive includes the scripts, too, so that you can rebuild the database when the spiritual descendants of Tom Delay inevitably gerrymander us further into oblivion.

Enjoy! Now if we could just get the postal service to loosen their restrictions on zip+4 matching...

UPDATE: As pointed out in comments, it's slightly silly of me to have offered this via a torrent. Besides, the initial traffic has now died down. So: if you need the larger database file, you can download it here.

UPDATE 2: Since I'm still getting occasional emails about this entry, I thought I'd add a note about where everything stands. Unfortunately this data is for the 109th congress, and the 110th altered their site format in a way that broke the screen-scraping scripts. So this database is outdated and probably won't be much use to you.

But there's good news! The Sunlight Foundation is now offering this functionality via a free API. You can find information about it here. Good luck!

( categories: Open Source )

Fantastic! This is surely going to be a huge asset for open-source online advocacy! Thanks so much for releasing this to the world, Tom.

Submitted by Jon Stahl on August 18, 2006 - 12:51pm.

Very cool...look forward to checking that out...

Tim

Submitted by TIm C on August 18, 2006 - 1:00pm.

Thanks! This is great.

Other than the obvious fun challenge of running your own charmingly ad-hoc bittorrent tracker, why use BT for this? It's only a 2 mb file...

I mean, if it were 50 mb, sure... but 2? We all put up video files bigger than that all the time...

Not complaining, just wondering.

Submitted by Kari Chisholm on August 18, 2006 - 2:55pm.
Submitted by Todd on August 18, 2006 - 5:10pm.

Kari,

Mostly for the reason you surmised. I wanted an excuse to get a BT tracker up on one of our machines (although I was disappointed to see that the last BlogTorrent release has a broken Mac client).

But the other reason is an overabundance of caution -- we were burned on a big bandwidth bill not too long ago. Two megabyte downloads for a developer-centric archive aren't likely to add up to much, but why risk it? In hindsight I might as well have just used the Coral CDN, though.

Submitted by Tom on August 18, 2006 - 5:38pm.

What if you could upload this data (in an appropriate format) into www.civicfootprint.org and then anyone could geocode addresses against it using the civicfootprint API?

Imagine how many other people have TIGER/Census data and data about local city council districts. If they were to upload that data and get it back through a geocoding utility, that would be pretty tremendous. Eventually you could generate a community-driven and maintained resource for geocoding most political boundries in the US.

Anyone interested in contributing some coding time to this effort, we are putting together a team of volunteers to make the necessary modifications to the civicfootprint code. Contact me at dgeilhufe ATT civicspacelabs DOTT org.

Submitted by David Geilhufe on August 19, 2006 - 12:06pm.

hey tom:

this is awesome news. much appreciated. we'll figure out how to integrate and use within CiviCRM shortly :)

a stupid question, i've got the bit torrent client up and running, however it does not make any progress with this torrent file. anyone else has sucessfully downloaded it?

lobo

Submitted by Donald Lobo on August 20, 2006 - 10:51am.

Tom, thanks - that's cool. The only problem is the one that Donald identifies... if there's no one seeding the file, the torrent goes nowhere. I've been trying to download for almost two days now -- and while I've got between 3 and 12 peers, we're not making any progress because there's no seed.

Submitted by Kari Chisholm on August 20, 2006 - 6:59pm.

Huh! That's odd. My apologies -- blogtorrent is supposed to provide a guaranteed seed, but is failing to do so for some reason. Unfortunately, I'm unable to get to a bt-ready machine at the moment, but will begin seeding the file within a few hours.

Submitted by Tom on August 20, 2006 - 7:59pm.

bless you. this is stupendous.

i love the idea of it getting integrated into civicfootprint or something similar. and i can't wait to see it in civicrm.

i am going to be looking at it right away for our darfur scorecard.

Submitted by ivan on August 20, 2006 - 8:50pm.

Tom:

Thanx for the quick fix, i've got the file on my labtop. now onto figuring out routers, firewall and the NAT error :)

Once again thanx for doing this. we'll figure out how to use it in CiviCRM land in the next release :)
lobo

Submitted by Donald Lobo on August 20, 2006 - 11:48pm.

Tom:

Thank you so much! This is a huge breakthrough for open-source online advocacy. Your hard work is very much appreciated.

Lobo:

Thank you, too! As you know, we've been interested in having this sort of functionality with CiviCRM for a long time. We even looked into spending the funds to purchase one of the databases Tom mentioned. Having finally completed our migration to CiviCRM, we can't wait to see this implemented (particularly in Joomla). This is great news!!

Submitted by Devin Burghart on August 21, 2006 - 1:26am.

Tom,

Excellent work! Thanks for sharing this with the community, too. Very cool.

How would you compare this data set to GovTrack's free Zip-to-district data set? Are there data you're grabbing that they missed?

Submitted by Jason Lefkowitz on August 21, 2006 - 7:24pm.

Jason: to be honest, I wasn't familiar with GovTrack's offerings until just now. From taking a quick look at their site, it appears that they list the +4 portion of the zipcode as a delimited string. Obviously you could process this further, but out of the box, it seems likely that queries against it will offer poorer performance. The database on offer here uses zip4min and zip4max columns, defining a range for each row -- which represents a block of one or more contiguous zip codes (same zip5, different zip4s) within a congressional district.

Additionally, GovTrack says their data isn't as good as the commercial DBs. I don't know what shortcomings they're referring to -- maybe they're just being modest. I really couldn't say how this package compares, but I believe it's complete for the zip codes that were queried (specified in a text file in the tarball). And of course, if you find missing zip codes, you can use the scripts to grab data for them that is as good as what the House of Representatives' own web folks use.

Submitted by Tom on August 21, 2006 - 9:20pm.

McCain: What up with that, Nico...

Submitted by Anonymous on August 24, 2006 - 12:03am.

This is fantastic! I'm the author of an open-source online petition software that's in use by a couple of websites, and I've been searching for something like this for a while now so that I can improve my software and make it more usable by the general public.

Thank you!!!

Submitted by adam on August 25, 2006 - 11:39am.

I'm sure McCain's campaign will love this too.

Submitted by Matt Taylor on August 25, 2006 - 3:58pm.

Torrent seems dead. No seeders or peers listed by my client, though your tracker seems to think there are 14 of each...

Submitted by Josh Koenig on September 6, 2006 - 6:26pm.