Victoria Vazquez

A data scientist with a background in wildlife and disease.

My Raspberry Pi 4 Has Arrived

My first Raspberry Pi arrived the other day and I have been so excited to check it out. I promised my daughter I would wait until the weekend so we could open the package together. In the spirit of pop culture, we filmed ourselves opening the package and all its contents. Check out the video below. I liked the clear case I found at Vilros, so I bought everything from them. The Raspberry Pi Foundation website provides a variety of vendors too.

Our first project will be creating a wifi router to use with our home’s smart devices. I have wanted a dedicated router for the devices for over a year, but kept balking at the cost of a new one. When I was introduced to the Raspberry Pi, I realized I could just make a router. The Raspberry Pi is cheap and works with open source software (free). Creating the router appears easy based on Instructable’s article Use Raspberry Pi 3 As Router, which looks like it will work for the Pi 4 as well. I’ll tell you next time if it works.

Messy Data

I have quite a bit of experience dealing with messy data. Wildlife field work is sometimes conducted with volunteers and inexperienced students who don’t always understand the importance of rigorous and consistent data collection. Cleaning up messy data sets has been a laborious and tedious process. But that was when I was working with spreadsheet software.

Python is transformative for a scientist. I have been excited to begin cleaning data sets with it for months. Yes, I did write that: excited to begin cleaning data sets.

A recent bootcamp assignment required I clean up a data set and then produce some basic descriptive statistics. But the real purpose of the assignment was to practice cleaning messy data. The data was provided by the bootcamp as a csv file.

Load data

First I loaded the csv file into a Pandas dataframe.

open_access_pubs = 
    pd.read_csv('/home/Documents/filename.csv')

Which returned this error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 
    0xa3 in position 6: invalid start byte

This wasn’t a surprise since the assignment instructions alluded to potential problems with encoding. I followed the bootcamp’s recommendation and read this post on StackOverflow, which led me to David Zentgraf’s article What every programmer absolutely, positively needs to know about encodings and character sets to work with text. The article was an interesting and thorough explanation of encoding, but I still didn’t know what to do. I opened the file in Atom and saw the pound sterling symbol had been replaced by a question mark.

Demonstrate incorrect encoding.
Question marks replaced the pound sterling symbol in this csv file.

I noticed a tab at the bottom of the window labeled UTF-8.

‘UTF-8’ at bottom of the Atom window.

When I clicked on the tab I saw a menu of encodings.

Encoding drop-down menu in Atom text editor.

I chose “Autodetect” and learned the file had been encoded in ISO 8859-2 (Latin Alphabet-2). After a quick search I added encoding= “iso 8859-2” as an argument in the Pandas file read method.

open_access_pubs = 
    pd.read_csv('/home/Documents/filename.csv', 
    encoding= 'iso 8859-2')

Which did the trick. The question mark was replaced by an ‘L’ with a stroke through it. And the pd.read_csv did not return an error.

The pound sterling symbol is now represented by an ‘L’ with a stroke through it.

Exploratory data analysis (EDA)

Cleaning

Now on to the data. I like to start by looking at the dataframe and the data types.

display(open_access_pubs.head())
open_access_pubs.info()

The cost column name is quite unwieldy with eight words in it.

I changed the name to ‘cost’, making it easier to perform methods on the column. I changed the rest of the column names while I was at it.

open_access_pubs.columns = ['id', 'publisher', \
'journal', 'article', 'cost']

Also, the ‘cost’ column is not ready to use since its of data type object. I split the column into two with one column including the monetary unit and the other including the number. Within the same step I also extracted these two columns into a new dataframe called cost_split.

cost_split = open_access_pubs['cost'].\
str.extract('(\D+)(.*)')

The line above includes regular expressions (\D+ and .*) in order to match the pound symbol and the number and separate in between them. Check out RexEgg for information about regular expressions.

I gave names to the columns in this new dataframe.

cost_split.columns = ['cost_unit', 'price']

And then added this new dataframe to the previous one. I displayed the head to make sure combining the dataframes worked and the info to check the data type for the ‘price’ column.

pubs_parsed = pd.concat([open_access_pubs, \
cost_split], axis=1)
display(pubs_parsed) #To make sure it worked.
pubs_parsed.info()

‘Price’ is an object so I changed it to a float for analysis.

pubs_parsed['price'] = pd.to_numeric\
(pubs_parsed['price'], errors='coerce')

I checked for null values.

pubs_parsed.isnull().sum()*100/pubs_parsed.\
isnull().count()
Percent null values for each variable.

I dropped rows with null values in “price” only because I am interested in the average price of a journal article for each journal.

pubs_parsed_2 = pubs_parsed.dropna\
(subset = ['price'])

There wasn’t any control of journal name entry, so the same journal could be listed multiple times. I visually inspected the journal list to see if there were any duplicates. I used the value_counts method because it also served as a baseline for later checking if cleaning up the title names works. If you look at the list below, you can see that one journal is listed four times: PLOS One.

pubs_parsed.journal.value_counts()
Excerpt of journal list and the number of articles in this database.

Python didn’t return the entire list of journals because the list is too large. So I searched within the csv file for other journal name duplicates. Recently I stumbled upon the code that allows you to see all of the rows and columns in a dataframe, so you can be spared the tedium of scanning the csv file.

pd.options.display.max_rows = y
pd.options.display.max_columns = x
#Where y is the number of rows and x the number of 
#columns you choose to display.

I changed the journal names to all lowercase so case wouldn’t affect value_counts. And I replaced ‘plosone’ with ‘plos one’

pubs_parsed['journal'] = pubs_parsed['journal'].str.lower()
pubs_parsed.replace(to_replace ="plosone", 
                 value ="plos one", inplace=True)

Basic Analysis

As soon as I began the preliminary data analysis I had to return to cleaning. The describe method returns descriptive statistics for continuous variables, in this case ‘price’. The table below shows a max price for publishing an article of $999,999, which is impossible.

pubs_parsed.describe()

Today PLOS One charges US$1,595 per regular article and US$2,095 for special issue ones. The current exchange rate is US$1 = Ł0.82, so US$2,095 = Ł1,723. Since the exchange rate fluctuates and the price may have been higher when the data was collected, I set an upper limit of Ł2,000.

plos_one_cleaned = plos_one_only\
[plos_one_only.price < 2000]
plos_one_cleaned.sort_values\
(by=['price'], ascending=False,\
na_position='first')

I repeated this process with the five journals with the most publications, choosing an upper limit based on the current charges set by each journal. Then I created a table with the statistical output from the describe method for each of these journals.

#Create table from statistical output.
prolific_journals = pd.DataFrame({'Journal': \
['PLOS One', 'Journal of Biological Chemistry', \
'Neuroimage', 'PLOS Pathogens', 'PLOS Genetics'],\
'No. of Articles':['199', '53', '29', '24', '24'],\
'Mean': ['Ł931.91', 'Ł1,423.59', 'Ł2,215.17', \
'Ł1,590.93', 'Ł1,643.11'], 'Standard Deviation': \
['198.01', '411.95', '266.65', '183.17', '153.37'],\ 
'Median': ['Ł894.07', 'Ł1,301.14', 'Ł2,326.43', \
'Ł1,623.51','Ł1,712.73']})
display(prolific_journals)
Five most prolific scientific journals in the database.

That’s it until next time I have a quirky data set…which won’t be long. If you have alternative ways of accomplishing any of the cleaning described in this post, please share it in the comments.

Treading Water in the Deep End

I jumped in feet first in the pool of data science and am treading water. It may not look pretty, but I’m getting the job done.

I have been enrolled in the Thinkful Data Science bootcamp for almost two months. I have been oscillating daily between excited and frustrated depending on the success of my Python code and Postgresql queries. I have become quite cozy with Stackoverflow, Postgresql Tutorial, and Kaggle datasets and underwhelmed with the Python, Seaborn, and Matplotlib documentation, the latter likely because of my urgency to begin my new career. It is difficult to find thorough explanations of all components of a plot that make sense to a newbie like me. Like Frankenstein I piece together snippets of code from the documentations and throughout the internet, including Thinkful’s mentors, which takes forever.

Here is an example. I obtained rain data for Australia from Kaggle to complete a bootcamp assignment on Exploring Data with Seaborn Plots. To start, I wanted to plot the maximum temperature for Sydney, Australia.

# Plot MaxTemp by date

sns.set_palette("husl") #Choose color palette.
plt.figure(figsize=(18,6)) #Set figure size.
g = sns.lineplot(x="Date_f", y="MaxTemp", 
                 data=sub_weather)
plt.ylabel('Maximum Daily Temperature (Celsius)')
plt.xlabel('Date')
plt.title('Daily Maximum Temperature in Sydney, 
          Australia')

I chose the color palette from Pydata.org’s Choosing Color Palettes documentation.

Husl color palette with eight colors from seaborn.pydata.org.
# Plot MaxTemp by date

sns.set_palette("husl") #Choose color palette.
plt.figure(figsize=(18,6)) #Set figure size.
g = sns.lineplot(x="Date_f", y="MaxTemp", 
                 data=sub_weather)
plt.ylabel('Maximum Daily Temperature (Celsius)')
plt.xlabel('Date')
plt.title('Daily Maximum Temperature in Sydney, 
          Australia')

If you don’t set the figure size as I have done on line 4, the graph will be horizontally compressed, making it difficult to view the data. Below is the graph with the default width.

The plot width was not set for this plot.

I chose the figure size based on trial-and-error. Note that this line (#4) didn’t work when placed below the lineplot code (#5 & 6).

Figure size is set to 18 wide and 6 tall.
# Plot MaxTemp by date

sns.set_palette("husl") #Choose color palette.
plt.figure(figsize=(18,6)) #Set figure size.
g = sns.lineplot(x="Date_f", y="MaxTemp", 
                 data=sub_weather)
plt.ylabel('Maximum Daily Temperature (Celsius)') 
plt.xlabel('Date')
plt.title('Daily Maximum Temperature in Sydney, 
          Australia')

This is the actual line of the plot. Sns is for Seaborn. In parentheses we set our desired x-axis variable (in this case “Date_f”) and y-axis variable (in this case “MaxTemp”) and the data set from which we are getting the data (in this case sub_weather). I believe the ‘g’ in line 5 is arbitrary.

# Plot MaxTemp by date

sns.set_palette("husl") #Choose color palette.
plt.figure(figsize=(18,6)) #Set figure size.
g = sns.lineplot(x="Date_f", y="MaxTemp", 
                 data=sub_weather)
plt.ylabel('Maximum Daily Temperature (Celsius)') 
plt.xlabel('Date')
plt.title('Daily Maximum Temperature in Sydney, 
          Australia')

On lines 7 and 8 I changed the labels of the axes to make them more explanatory, including the unit of measurement for the temperature.

# Plot MaxTemp by date

sns.set_palette("husl") #Choose color palette.
plt.figure(figsize=(18,6)) #Set figure size.
g = sns.lineplot(x="Date_f", y="MaxTemp", 
                 data=sub_weather)
plt.ylabel('Maximum Daily Temperature (Celsius)') 
plt.xlabel('Date')
plt.title('Daily Maximum Temperature in Sydney, 
          Australia')

And, finally, I set the title for the plot on line 9.

This plot isn’t presentation-ready yet. The titles and axis labels are tiny and there are no annotations to point out any important aspects of the plot. But I’ll cover those tweaks on another day.

Career Shift, Not Change

Since I began on my career journey, my path has been anything but straight. But the step I’m preparing to take feels like the biggest one yet. I am considering shifting from wildlife conservation to data science. Wait, scratch that hesitancy. I am switching careers.

I didn’t lose my passion for wildlife conservation. But it became clear that my brain craved more problem solving for which I had some semblance of control. Organizational health, on the other hand, is a complicated, multifaceted problem for which I am not in a position to suggest solutions, let alone solve. But using data to answer questions? That I can do. I just need to go back to taking classes, this time remotely.

I am a problem solver by nature, so when I had the revelation last Wednesday that I wanted to be a data scientist, I began searching immediately for courses, certificate programs, and other resources to augment my learning. By Saturday night I was taking an introductory Python course on Udemy with two other courses in the cue. By Sunday I was listening to the Becoming a Data Scientist podcast while shopping for groceries. Now if only I had more hours in a day.

Powered by WordPress & Theme by Anders Norén