Overview of data cleaning
Many times when you download data, it is not quite ready to use. It could be formatted wrong. There could be extra headers. The values may be inconsistent. There may be many tweaks that you want to make before using or publishing the data.
Most of the adjustments can be accomplished with a text editor or a spreadsheet. But, the time spent cutting and pasting hundreds or thousands of cells and correcting their spelling can be prohibitively time consuming. If you are a pro with spreadsheet or script programming, you can reduce the amount of time spent. However, the programs listed here will simplify the tasks and reduce that time even further.
Data can be in a variety of formats. Typically, they are in ascii format in a tab or comma separated document, or in a spreadsheet. The data cleaning programs work on text and numeric values. If your spreadsheet has formulas in the cells, they will be converted into static values. (Copy, paste as value). After you import the data into one of these programs, you select the cells or column and tell the program what global changes you want to make. The program then processes through all the data and makes the changes you specify. If you don't like the results, it is easy to undo them. You can also copy the list of actions and apply them to another set of data.
The programs are fast because they work on only just the few cells that are displayed will you work. When you are ready to export the spreadsheet, then all the actions are applied to you data at once.
The 2 programs that are available at this time (2011 Fall) are Data Wrangler and Freebase Gridworks AKA Google refine. Both DataWrangler and FreebaseGridworks/Google-refine are works in progress. They use a browser for the user interface, but don't work on all browsers. They both work best in Google Chrome and neither will work with Microsoft Internet Explorer. Firefox and Safari may work, but not as well.
The types of things you can do:
Split cells based on pattern matching of the contents. Every cell in a column can be split on the nth character, or between words that you specify. The text on the left is placed in one column and the text after the split is placed in another column.
Cut will remove characters that match a pattern from your data.
Extract portions of cells based on a pattern that you specify and place those portions in a new column.
Edit the values in a specific cell
Fill copies values from adjacent cells that matches the patter you specify.
Delete rows or columns. (Drop)
Merge the values from separate columns into a single column.
Move the values from a row into the names of the columns. (Promote)
Convert columns into extra rows and visa versa. (Fold/Unfold)
Transpose the values using special rules.
Shift cells in any direction. (Translate)
Cluster cells that have similar entries and allow you to make all cells in a cluster have the same text.
Create a column based on information that is looked up on the internet. (URL or freebase)
Data Wrangler has a visual user interface which makes it the easiest to use.
Google-refine requires the user to enter bits of code to make most things happen.
Both are faster and easier than changing the data in a spreadsheet by hand.