Data Cleaning Introduction with python
Data Cleaning- Four step process.
1. Finding the missing data and fixing it
2. Invalid value
Fundamental
pd.isnull(pd.DataFrame({
'Column A': [1, np.nan, 7],
'Column B': [np.nan, 2, 3],
'Column C': [np.nan, 2, np.nan]
}))
the code above will return you something like
Column A | Column B | Column C | |
---|---|---|---|
0 | False | True | True |
1 | True | False | False |
2 | False | False | True |
so lession is,
pd.isnull(pd.DataFrame(#pass a series or a dataframe)
Pandas Operations with Missing Values
nan
s will no longer behave as "viruses", and operations will just ignore them completely:Filtering missing data
#this will count all the field data. this can be done using s.count() as well.
How to show all the notnull data??
s[pd.notnull(s)]
or,
s[s.notnull()]
dropping null values in dataframe
How to drop all the rows containing any null value:
df.dropna()
#behavior will drop all the rows in which any null value is present:
df.dropna(axis=1)
#behaviour will drop all the column with any null value.
How to drop rows if all the values as null.
df.dropna(how="all")
How to drop the entire row if any value is null.
df.dropna(how="any")
# default behavior
You can also use the thresh
parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:
df.dropna(thresh=3) #you need this amount of valid row values
df.dropna(thresh=3, axis='columns')# number of columns with valid values you need.
Filling null values
Sometimes instead of dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a nan
can be replaced with a 0
, sometimes it can be replaced with the mean
of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem.
lets say we have a series called s:
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])
if you print s, you will see two null values. we can replace the null values with 1000 or any other arbitrary values:
s.fillna(1000) #we can just put the values inside.
filling a null value with the mean of the column.
s.fillna(s.mean())
filling a null value with value that comes after the nulls:
s.fillna(method="bfill")
filling a null value that comes before the null:
s.fillna(method="ffill")
#these method=" " also has its drawback, if you want to learn about the drawback, run these codes:
pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')
pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')
Therefore, for missing data at the end and at the beginning, you should use s.fillna(s.mean()) in standard practice.
How to fill null fields of a column using the mean of that column? filling null with column mean/Filling null fields with arbritrary value:
Lets say you have the following dataframe:
df = pd.DataFrame({
'Column A': [1, np.nan, 30, np.nan],
'Column B': [2, 8, 31, np.nan],
'Column C': [np.nan, 9, 32, 100],
'Column D': [5, 8, 34, 110],
})
#now run these codes:
How to check if there is any missing value or null?
missing_values = len(s.dropna()) != len(s)
missing_values
#because s.dropna() will always be smaller than the length of s. so we want to see if the length of s.dropna() isnot equal to length of s.
s.isnull().any() #this code also helps u do that.
s.isnull().values.any()#this is a stricter way of checking null dataframe values in particular.
alternative codes:
missing_values = s.count() != len(s) #s.count() also works like len(s.dropna()) because both give u the same intergers as result.
missing_values
Cleaning not-null values
always start with these:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
#beacause you never know when you need what.
#lets make a dataframe
df = pd.DataFrame({
'Sex': ['M', 'F', 'F', 'D', '?'],
'Age': [29, 30, 24, 290, 25],
})
df
Finding Unique Values¶
The first step to clean invalid values is to notice them, then identify them and finally handle them appropriately (remove them, replace them, etc). Usually, for a "categorical" type of field (like Sex, which only takes values of a discrete set ('M', 'F')
), we start by analyzing the variety of values present. For that, we use the unique()
method:
How to see the unique data that exists:
df['Sex'].unique()
How to replace values of particular column in a dataframe?
df['Sex'].replace({'D': 'F', 'N': 'M'})
How to replace values of many columns in a dataframe?
df.replace({
'Sex': { #sex is the column
'D': 'F',
'N': 'M'
},
'Age': { #age is another column
290: 29
}
}) #this is like a dictionary inside another dictionary
I explicitly replaced 290 with 29 (assuming it was just an extra 0 entered at data-entry phase). But what if you'd like to remove all the extra 0s from the ages columns? (example, 150 > 15
, 490 > 49
).
The first step would be to just set the limit of the "not possible" age. Is it 100? 120? Let's say that anything above 100 isn't credible for our dataset. We can then combine boolean selection with the operation:
lets see how many data are there with this mistake of extra 0 at the end;
df[df['Age'] > 100] #running this code to see.
now lets write a program:
df.loc[df['Age'] > 100, 'Age'] = df.loc[df['Age'] > 100, 'Age'] / 10
#here, i am deviding the age above 100 with 10 so i can get a decent sounding age instead of some superhuman age count.
Finding and dealing with duplicates¶
Checking duplicate values is extremely simple. It'll behave differently between Series and DataFrames.
Duplicates in Series:
Let's start with Series. As an example, let's say we're throwing a fancy party and we're inviting Ambassadors from Europe. But can only invite one ambassador per country. This is our original list, and as you can see, both the UK and Germany have duplicated ambassadors:
ambassadors = pd.Series([
'France',
'United Kingdom',
'United Kingdom',
'Italy',
'Germany',
'Germany',
'Germany',
], index=[
'GĂ©rard Araud',
'Kim Darroch',
'Peter Westmacott',
'Armando Varricchio',
'Peter Wittig',
'Peter Ammon',
'Klaus Scharioth '
])
The two most important methods to deal with duplicates are df.duplicated()
(that will tell you which values are duplicates) and df.drop_duplicates()
(which will just get rid of duplicates):
ambassadors.duplicated()
by default, it'll consider the first occurrence of the value as not-duplicate and the socond one as the duplicate. You can change this behavior with the keep
parameter:
ambassadors.duplicated(keep='last')
#this will consider the last one as the original one and the rest will be the duplicates.
A similar method is drop_duplicates
, which just excludes the duplicated values and also accepts the keep
parameter:
ambassadors.drop_duplicates()
Just like the .duplicated() one, this one by default keeps the first instance as the original data and drops rest of the recurring data.
you can also pass in the instruction for changing the deault settings...
ambassadors.drop_duplicates(keep='last')
ambassadors.drop_duplicates(keep=False) #this helps you to keep the unique data those were never repeated.
Duplicates in Dataframe:
Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates:
let's make a dataframe:
players = pd.DataFrame({
'Name': [
'Kobe Bryant',
'LeBron James',
'Kobe Bryant',
'Carmelo Anthony',
'Kobe Bryant',
],
'Pos': [
'SG',
'SF',
'SG',
'SF',
'SF'
]
})
In the DataFrame, we clearly see that Kobe is duplicated; but he appears with two different positions. What does duplicated
say?
players.duplicated()
"duplicated" means "all the column values should be duplicates". We can customize this with the subset
parameter:
players.duplicated(subset=['Name'])
How to drop duplicated rows of a column in a dataframe?
players.drop_duplicates(subset=['Name'])
Splitting Columns
main code:
df['Data'].str.split('_')
how to create dataframe out of that?
df['Data'].str.split('_', expand=True)
How to set the columns?
first save the new dataframe:
df = df['Data'].str.split('_', expand=True)
then set the columns:
df.columns = ['Year', 'Sex', 'Country', 'No Children']
How to check if a column contains something?
df['Year'].str.contains('\?')
df['Country'].str.contains('U')
How to remove a space or special character from a field?
df['Country'].str.replace(' ', '')
df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m: m.group('year'))
My preferred way of doing this:
df['Year']=df['Year'].str.replace(r"\?", "", regex=True)
No comments:
Post a Comment