Data Cleaning Examples

When working with a set of digital data a phase of data cleaning will often be necessary before the data will be ready for analysis.

Important: It is very strongly advised to keep a backup of the original data set and use a copy created specifically for the cleaning process. When editing or cleaning a data set there is always a risk of deleting or altering something that will at some point call for restoration.

Examples of needs for data cleaning:

Duplicates - May occur in many ways and under different circumstances, but are a frequent side effect of automated data scraping or harvesting, e.g. having scraped a link collection from a number of websites (duplicates of links ar elikely to occur), having harvested a number of websites copies of the same page will often occur (and if there are subtle changes, then which copy should be kept?), having collected posts from social media with different sets of search criteria, etc. A short description of duplicate removal in Excel can be found in the manual on the page for Screaming Frog SEO.

Special signs - Some textual signs may be misrepresented after the original source has been copied. This occurs because of a mismatch between alphanumeric or linguistic type sets and may cause trouble with special signs such as Æ æ Ø ø Å å é á ñ. For example ñ may be changed to Ã±, so Señorita becomes SeÃ±orita. Depending on how and where you copy an URL you may also get unwanted encodings; for example cc.au.dk/en/cdmm/tools-and-tutorials may have changed to https:%2F%2Fcc.au.dk%2Fen%2Fcdmm%2Ftools-and-tutorials in your copy; %2F being the URL encoding for forward slash (/). If you have a data set with such misrepresentations you may need to search/replace the misrepresented signs before a proper analysis can be started.

Irrelevant data, legal or ethical issues - In some situations you may have irrelevant data that can disturb your analysis, and where some columns of data may simply need deletion. In other cases you may have to go through your data set in order to delete/blur/change person sensitive data.

This is not a complete list; other needs for data cleaning may arise depending on the type of data, the research question(s), the legal framework or ethical concerns.

Revised 01.07.2025

Asger Harlung

School of Communication and Culture

Centre for Digital Methods and Media

Data Cleaning Examples