Scraping

Scraping data from PDFs and the web

So, somebody gave you a scanned photocopy of the document as PDF? Or a website has some great data, but it’s hidden behind an awful JavaScript-heavy interface? No fear – there’s some great tools at your disposal to scrape that data and get it into a nicely machine-readable format.

As always, the School of Data has an excellent series on the ins and outs of extracting data from PDFs and scraping websites – A gentle Introduction into Extracting Data – with many useful recommendations of the best tools to use for the job.

tl;dr? Well there are a few standout tools…

Tabula is getting a lot of notice for making the process of extracting tabular data from PDFs a (relative) breeze. Download, install, point it at some PDFs and it’ll extract any tabula data in them to a nicely machine-readable CSV or XLS file for you. For a more indepth view have a read through Introducing Tabula (Source news).

Apache Tika, the older man in the scraping PDFs market, is great for extracting text and metadata from a pile of document formats (PDF, XLS, PPT, …) – even PDFs containing text in scanned images. Useful, the Practical Data Journalism blog, has a good walkthrough of Getting Text Out Of Anything (docs, PDFs, Images) Using Apache Tika.

Worth a mention as well is PDF Tables a web-based tool from the folks behind ScraperWiki that pretty much does what it says on the box – pulls tabular data out of any PDFs you provide.

On the website scraping end of the equation there are a few desktop and web-based tools around – import.ioUiPath (free trial) and 80legs – but sometimes you just need to write code to do it properly.

Morph.io, which arose out of the demise of ScraperWiki, offers a lightweight scraping framework (Python, PHP, Ruby, or Perl) and a whole web platform and community around scrapers (think Heroku for web scraping).

In Python-land there’s Scrapy – a neat framework for extracting data from the web with a strong community and easily extensible codebase. You can think of Scrapy as being the next level up from libraries like BeautifulSoup and lxml (which excel at parsing HTML and XML) in that it incorporates higher level concepts of scraping like spiders, selectors, and items.

Likewise, Scrapekit is awesome and includes a range of advanced features such as caching, multi-threading, and logging.

This Quora post has a good thread with suggestions for scraping frameworks in a variety of languages.