Pandas & Cleaning
Continued from Part One
Right so you have done a bunch of scraping, and now you have a lot of data. sweet let's get our feet up and chill.
Wrong, you're summoned back in to the boss.
You could have scraped that data a bit quicker, but whatever. Now I need you to do some analysis on that data, here's what we want to look at:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?
It would be excellent if we could get these answers straight away, but our data is nowhere near up to snuff.
There's a real handy library for dealing with lots of data all at once, its called Pandas. Pandas allows us to load our dictionary of values from the scrape into an object called a Data Frame. We can think of the dataframe like a sheet in excel, but a lot more versatile, it makes our lives a lot easier by making a lot of functions readily accessible. The first thing we are going to do is to save the data from memory to our hard drive, then we can breathe a lot easier.
import pandas as pd
#Create a dataframe from our results
df = pd.DataFrame(data=results)
#save the data to a .csv file
df.to_csv('indeed_scrape.csv')
From here, we will interact with the data in our dataframe 'df'
When we ran the scrape, if a feature wasn't present in the listing, we kept a string, for example saying "No Salary". Unfortunately if we want to be predicting a salary, or salary range, we have no way of using the listing to build a model on. We are going to have to remove these rows from our dataframe.
df = df[df.salary != "No Salary"]
We can see that the salary postings vary quite a lot by city:
Not much luck in Pittsburgh.
So for us to do anything with the salary information, we are going to need to remove any extraneous information, only keeping the numerical data. We see that most of the values indicate a range, and some break the range down by day, month or year, so we are going to have to fix that too.
Before we get rid of the extra info, we can keep some indicators in new columns, e.g. daily, monthly, yearly. We'll use a lambda function here, but for the time being we won't delve into them too much.
df['yearly'] = df['salary'].map(lambda x: 1 if 'year' in x.lower() else 0)
df['monthly'] = df['salary'].map(lambda x: 1 if 'month' in x.lower() else 0)
df['hourly'] = df['salary'].map(lambda x: 1 if 'hour' in x.lower() else 0)
def string_to_salary(x):
if type(x) == int:
return x
#Check if salary give as yearly
elif 'a year' in str(x).lower():
# check if salary given as a range
if '-' in str(x):
# remove dollar signs from string
x.replace('