Working with Tabular Data

Working with tabular data

At its simplest, tabular data is data that is stored in rows and columns (hence the name “tabular” i.e. in tables), either in a flat file or a database, and usually consists of simple alphanumeric values. CSV/TSV, JSON, XLS(X), and XML are some of the more common formats you’ll find tabular data in, though unfortunately it does still often appear in non machine-readable formats like PDF and DOC and most first be extracted and cleaned before being used.

Converting between data formats

There’s a good chance that you’re going to want to convert your data from the format you’ve found into something a little more modern and useful (like JSON). Mr. Data Converter is a simple web-based tool for converting from Excel, CSV, and TSV to JSON, HTML, MySQL, PHP, Python, Ruby, and more.

For batching up the conversion process to run across many datasets consider the dataconvert command-line tool developed by OKFN Labs for converting from CSV, XLS(X), and JSON to CSV.

If you need even more control consider the Python libraries pandas (which provides a whole data analysis and modelling framework as well), tablib, or any of the Science and Data Analysis libraries listed on Awesome Python.

Cleaning your data

If your data has had humans involved in entering it then it’s probably full of all sorts of small variations in how the data have been entered that you’ll need to clean up before it becomes machine-readable. Fortunately, there are a couple of great tools.

OpenRefine (formerly Google Refine) is a powerful desktop tool for cleaning messy data, transforming it between different tabular data formats, and even integrates with web services via some simple connectors so you can, for example, geocode a bunch of addresses using Google directly in OpenRefine. Check out the School of Data’s simple tutorial on using OpenRefine to see it in action.

Depending on how badly munged your data is a simple old spreadsheet application may get you most of the way to having clean data – as per the excellent A Gentle Introduction to Data Cleaning series from the School of Data. Their Cleaning Data with Spreadsheets walkthrough may also fit the bill.

If out-of-the-box tools aren’t cutting it and you need to dive into code take a look back at some of the Python libraries, like pandas, that we recommended in Converting between data formats. If you’re feeling brave take a look at dedupe, which leverages machine-learning to perform de-duplication and cleansing of data.

And if all else fails you can always fall back to reliable command-line tools like grep, awk, and sed combined with regular expressions. If you need to upskill your regex foo Debuggex and Regexpr should set you on the right path.

Analysing tabular data

So you’ve got a nice clean dataset and now you want to do some analysis on it to understand if reality matches your hypothesis!

Spreadsheets

Sometimes the simplest tools are the best and a spreadsheet is all you need – Excel is, afterall, the world’s most widely used IDE!

The School of Data has an excellent tutorial Using Excel to do precision data journalism. If you really need to get back to first principles their Data Fundamentals series would be well worth a look.

The Sunlight Foundation has a set of good videos as an intro to Data Visualisation in Google Docs which also covers analysis. And finally, check out this rundown of Excel plugins for analysing and visualising data.

Databases

When datasets get larger, or the analysis requirements get more complex, you’ll probably find yourself reaching for a database to do the heavy lifting.

The School of Data has a neat little tutorial on Using SQL for Lightweight Data Analysis that’ll get you started. If you’re playing in PostgreSQL you may find its window functions of great use to perform calculations across sets within your data.

For a deep dive on data analysis in PostgreSQL, R, and Python check out this blog post from Zev Ross.

R (Arrr!)

R provides a platform for advanced data analysis to let you discover and visualise trends even in large datasets. If you’re new to R you should start with The Guerilla Guide to R, basic statistics and graphs in R, and the official Introduction to R. To ease the learning curve check out some of the IDEs for R – RStudio, Rattle, and Deducer.

The true value of R lies in its huge array of libraries and addons, such as bigvis (visualise up to 10 million data points in mere seconds) and the big list of 10 R packages I wish I knew about earlier.

To get started with charting in R check out the handy Getting Started with Charts in R guide, Simple charts in R tutorial, or some fun putting pictures of Pokemon where their power level is on an X/Y axis.

When it comes to sharing your analysis with the world check out Knitr, for quick and easy report generation, googleVis for making R and Google Charts talk nicely, and Shiny for a full-blown web app framework for R to turn your awesome analyses into a shiny interactive web app (such as this demo).

Visualising tabular data

We’ve already touched on visualisation in a previous section on Resources for building data visualisations.