Aarhus University Seal

Data Cleaning Examples

When working with a set of digital data a phase of data cleaning will often be necessary before the data will be ready for analysis.

Examples of needs for data cleaning:

Duplicates - May occur in many ways and under different circumstances, but are a frequent side effect of automated data scraping or harvesting, e.g. having scraped a link collection from a number of websites (duplicates of links ar elikely to occur), having harvested a number of websites copies of the same page will often occur (and if there are subtle changes, then which copy should be kept?), having collected posts from social media with different sets of search criteria, etc. A short description of duplicate removal in Excel can be found in the manual on the page for Screaming Frog SEO.

Special signs

- Some textual signs may be misrepresented after the original source has been copied. This occurs because of a mismatch between alphanumeric or linguistic type sets and may cause trouble with special siggn such as Æ æ Ø ø Å å é á ñ. For example ñ may be changed to ñ, so Señorita becomes Señorita. Depending on how and where you copy an URL you may also get unwanted encodings; for example cc.au.dk/en/cdmm/tools-and-tutorials may have changed to https:%2F%2Fcc.au.dk%2Fen%2Fcdmm%2Ftools-and-tutorials in your copy; %2F being the URL encoding for forward slash (/). If you have a data set with such misrepresentations you may need to search/replace the misrepresented signs before a proper analysis can be started.

Irrelevant data, legal or ethical issues - In some situations you may have irrelevant data that can disturb your analysis, and where some columns of data may simply need deletion. In other cases you may have to go through your data set in order to delete/blur/change person sensitive data.

This is not a complete list; other needs for data cleaning may arise depending on the type of data, the research question(s), the legal framework or ethical concerns.