Tuesday, July 14, 2009

Garbage In, Garbage Out: Why Scraping Doesn't Work for Local Search

As we've mentioned before, there are three main (non-exclusive) ways to collect data for local search: 1) license a database from a third party, 2) collect it yourself (I'll include user generated content under this category), or most commonly, 3) use a web scraper to get data from other sites.

We've tried scraping ourselves in the past (yes, it's perfectly legal), and initially populated our New York listings this way. However, the quality of the data from scraping is so poor that this has actually created more work for us than collecting the data ourselves, as we now have to go through these listings by hand and remove all the junk from our database (incorrect data, businesses that closed years ago, businesses that never existed, etc.).

A case in point: Angeno's Pizza in Maple Grove, MN (my local pizza place while growing up) closed years ago. We marked it as closed, but Google still shows it as open, as does Yahoo Local. If you can't trust the two largest sites on the internet, who can you trust?

The problems with incorrect data become clear if you take a look at a search for the address of the Pink Floyd Cafe in Prague.

At some point, someone at some local search or travel site entered the address of Pink Floyd Cafe incorrectly as Americká 18 (it's actually 16), and other sites either copied this data manually, or (more likely) by way of a web scraper. This error then propagated through the internet, and now there are just as many references to the incorrect address as there are to the correct address. Anyone looking at the Google search results for "Pink Floyd Cafe" in Prague might reasonably conclude that Nelso has the wrong address. Even a sophisticated web scraper would have trouble contending with this kind of contradictory data.

We even fell into the same trap ourselves, and almost changed the address to #18 (our manual data entry process shows us references to the same place on other sites). However, if you look at the photo (click through for a very high-resolution version), it's clear that the address is Americká 16, not Americká 18.

The lesson we take away from this is that it's best to collect our own data, and ideally to back up that data with a photograph of the business. This of course doesn't mean that our data is always correct, or that we don't occasionally show closed businesses as still open. But because we show the date of the photos to all users of the site, we can with certainty say that Kolkovna was open for business on July 8, 2009, and at the address we display on the site.

4 comments:

Yaacov said...

I bought a mailing list once and over 20% of the businesses had bad addresses. My mailman hated me!

This time around, I'm using Amazon's Mechanical Turk to find businesses with websites and using the primary source as my data. For five to ten cents, I'm getting great data plus a description of the business.

Anonymous said...

Garbage in, garbage out is correct, but when you can find non-garbage sources for local resources, you can make a damn good site. http://www.streeteasy.com is one example.

Mick T. said...

Interesting site. However, it doesn't seem to work well with US territories such as Puerto Rico. It either gives me a map of somewhere in Texas, or I get a time-out:

http://www.nelso.com/search/?what=&where=san+juan%2C+pr%2C+US

Easybee said...

I have to say that I am a fan of scraping data. If you set it up correctly you data will be good. I harvested all of the data for chocolate.org and created the largest directory of chocolatiers in the world. It only took me about 2 hours using Mozenda. Done right it saves a TON of time. Just use the right tool and be careful how you implement the data.