Unstructured Data

Unstructured data covers much of the data you will come across – from data buried in PDFs and web sites, to mining data from social networks, but it all requires analysis to extract meaning. There are many tools for getting at the data – see the previous section on scraping data for a range of tools – but the Sunlight Foundation’s Text Analysis in Transparency talk is a great introduction to that world of text analysis and natural language processing.

Extracting meaning from text

Once you have your data in a nicer format you may well need to tackle the problem of pulling something meaningful out of it. Fortunately, there are a lot of good analysis and natural language processing libraries around these days that will allow you to automatically find the meaningful keywords in a body of text.

Natural language processing may be a bit of a heavy topic to dive into during a hackathon, but if you’re feeling brave there are a few good tutorials on the subject to get you started (if you’d like some more academic articles check this StackOverflow question).

As always, there are web-based tools – such as TextRazor and Yahoo Content Analysis – that may be able to save you the trouble of diving into code and learning too much about the theory and practice of NLP whilst time is tight.

There are a surprising number of good NLP libraries around for all of the major languages though:

Java has OpenNLP and Apache UIMA; Python has NLTKpattern, and TextBlob; and this StackOverflow question has a good discussion of NLP libraries for both languages.

In .NET-land the Stanford NLP Group have made parts of their software available, and SharpNLP and Abodit NLP are worth a look too.

Beyond the world of NLP you might consider going straight to a search engine that provides similar text interrogation capabilities along with a database to store your data and APIs to query it. Solr and Elastic (formerly ElasticSearch) are pretty well known in this space – but Sphinx and Constellio are worthy entries.

Lastly, for a spot of lightweight text mining Latent Semantic Analysis in Ruby and Simple Text Mining with R will point you in the right direction.

Visualising unstructured text

Being able to visualise unstructured information is key to making sense of it – be it a word tree of a text blob, a whole web page, or a social media feed – tools like Word TreeOverview, and even Google Charts will help you turn out some quick visualisations. On the academic end of the spectrum the National Science Foundation has made its Jigsaw toolbox available.

Check out See Text in Whole New Way: Text Visualization Tools from Princeton University for a range of other tools.