I have quite a bit of experience dealing with messy data. Wildlife field work is sometimes conducted with volunteers and inexperienced students who don’t always understand the importance of rigorous and consistent data collection. Cleaning up messy data sets has been a laborious and tedious process. But that was when I was working with spreadsheet software.

Python is transformative for a scientist. I have been excited to begin cleaning data sets with it for months. Yes, I did write that: excited to begin cleaning data sets.

A recent bootcamp assignment required I clean up a data set and then produce some basic descriptive statistics. But the real purpose of the assignment was to practice cleaning messy data. The data was provided by the bootcamp as a csv file.

Load data

First I loaded the csv file into a Pandas dataframe.

open_access_pubs = 
    pd.read_csv('/home/Documents/filename.csv')

Which returned this error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 
    0xa3 in position 6: invalid start byte

This wasn’t a surprise since the assignment instructions alluded to potential problems with encoding. I followed the bootcamp’s recommendation and read this post on StackOverflow, which led me to David Zentgraf’s article What every programmer absolutely, positively needs to know about encodings and character sets to work with text. The article was an interesting and thorough explanation of encoding, but I still didn’t know what to do. I opened the file in Atom and saw the pound sterling symbol had been replaced by a question mark.

Demonstrate incorrect encoding.
Question marks replaced the pound sterling symbol in this csv file.

I noticed a tab at the bottom of the window labeled UTF-8.

‘UTF-8’ at bottom of the Atom window.

When I clicked on the tab I saw a menu of encodings.

Encoding drop-down menu in Atom text editor.

I chose “Autodetect” and learned the file had been encoded in ISO 8859-2 (Latin Alphabet-2). After a quick search I added encoding= “iso 8859-2” as an argument in the Pandas file read method.

open_access_pubs = 
    pd.read_csv('/home/Documents/filename.csv', 
    encoding= 'iso 8859-2')

Which did the trick. The question mark was replaced by an ‘L’ with a stroke through it. And the pd.read_csv did not return an error.

The pound sterling symbol is now represented by an ‘L’ with a stroke through it.

Exploratory data analysis (EDA)

Cleaning

Now on to the data. I like to start by looking at the dataframe and the data types.

display(open_access_pubs.head())
open_access_pubs.info()

The cost column name is quite unwieldy with eight words in it.

I changed the name to ‘cost’, making it easier to perform methods on the column. I changed the rest of the column names while I was at it.

open_access_pubs.columns = ['id', 'publisher', \
'journal', 'article', 'cost']

Also, the ‘cost’ column is not ready to use since its of data type object. I split the column into two with one column including the monetary unit and the other including the number. Within the same step I also extracted these two columns into a new dataframe called cost_split.

cost_split = open_access_pubs['cost'].\
str.extract('(\D+)(.*)')

The line above includes regular expressions (\D+ and .*) in order to match the pound symbol and the number and separate in between them. Check out RexEgg for information about regular expressions.

I gave names to the columns in this new dataframe.

cost_split.columns = ['cost_unit', 'price']

And then added this new dataframe to the previous one. I displayed the head to make sure combining the dataframes worked and the info to check the data type for the ‘price’ column.

pubs_parsed = pd.concat([open_access_pubs, \
cost_split], axis=1)
display(pubs_parsed) #To make sure it worked.
pubs_parsed.info()

‘Price’ is an object so I changed it to a float for analysis.

pubs_parsed['price'] = pd.to_numeric\
(pubs_parsed['price'], errors='coerce')

I checked for null values.

pubs_parsed.isnull().sum()*100/pubs_parsed.\
isnull().count()
Percent null values for each variable.

I dropped rows with null values in “price” only because I am interested in the average price of a journal article for each journal.

pubs_parsed_2 = pubs_parsed.dropna\
(subset = ['price'])

There wasn’t any control of journal name entry, so the same journal could be listed multiple times. I visually inspected the journal list to see if there were any duplicates. I used the value_counts method because it also served as a baseline for later checking if cleaning up the title names works. If you look at the list below, you can see that one journal is listed four times: PLOS One.

pubs_parsed.journal.value_counts()
Excerpt of journal list and the number of articles in this database.

Python didn’t return the entire list of journals because the list is too large. So I searched within the csv file for other journal name duplicates. Recently I stumbled upon the code that allows you to see all of the rows and columns in a dataframe, so you can be spared the tedium of scanning the csv file.

pd.options.display.max_rows = y
pd.options.display.max_columns = x
#Where y is the number of rows and x the number of 
#columns you choose to display.

I changed the journal names to all lowercase so case wouldn’t affect value_counts. And I replaced ‘plosone’ with ‘plos one’

pubs_parsed['journal'] = pubs_parsed['journal'].str.lower()
pubs_parsed.replace(to_replace ="plosone", 
                 value ="plos one", inplace=True)

Basic Analysis

As soon as I began the preliminary data analysis I had to return to cleaning. The describe method returns descriptive statistics for continuous variables, in this case ‘price’. The table below shows a max price for publishing an article of $999,999, which is impossible.

pubs_parsed.describe()

Today PLOS One charges US$1,595 per regular article and US$2,095 for special issue ones. The current exchange rate is US$1 = Ł0.82, so US$2,095 = Ł1,723. Since the exchange rate fluctuates and the price may have been higher when the data was collected, I set an upper limit of Ł2,000.

plos_one_cleaned = plos_one_only\
[plos_one_only.price < 2000]
plos_one_cleaned.sort_values\
(by=['price'], ascending=False,\
na_position='first')

I repeated this process with the five journals with the most publications, choosing an upper limit based on the current charges set by each journal. Then I created a table with the statistical output from the describe method for each of these journals.

#Create table from statistical output.
prolific_journals = pd.DataFrame({'Journal': \
['PLOS One', 'Journal of Biological Chemistry', \
'Neuroimage', 'PLOS Pathogens', 'PLOS Genetics'],\
'No. of Articles':['199', '53', '29', '24', '24'],\
'Mean': ['Ł931.91', 'Ł1,423.59', 'Ł2,215.17', \
'Ł1,590.93', 'Ł1,643.11'], 'Standard Deviation': \
['198.01', '411.95', '266.65', '183.17', '153.37'],\ 
'Median': ['Ł894.07', 'Ł1,301.14', 'Ł2,326.43', \
'Ł1,623.51','Ł1,712.73']})
display(prolific_journals)
Five most prolific scientific journals in the database.

That’s it until next time I have a quirky data set…which won’t be long. If you have alternative ways of accomplishing any of the cleaning described in this post, please share it in the comments.