Thursday, 20 May 2021

Data Cleaning Introduction with python

 Data Cleaning Introduction with python

Code Source

     Data Cleaning- Four step process.


1. Finding the missing data and fixing it

2. Invalid value




Fundamental 

pd.isnull(pd.DataFrame({

    'Column A': [1, np.nan, 7],

    'Column B': [np.nan, 2, 3],

    'Column C': [np.nan, 2, np.nan]

})) 


the code above will return you something like 

Column AColumn BColumn C
0FalseTrueTrue
1TrueFalseFalse
2FalseFalseTrue

so lession is, 

pd.isnull(pd.DataFrame(#pass a series or a dataframe) 


Pandas Operations with Missing Values


Pandas manages missing values more gracefully than numpy. nans will no longer behave as "viruses", and operations will just ignore them completely:

pd.Series([1, 2, np.nan]).count()
#2

pd.Series([2, 2, np.nan]).mean()
#2


Filtering missing data

How to count the null values in a dataframe/pd.series?

lets make a series first. series can indicate a column or dataframe.
s = pd.Series([1, 2, 3, np.nan, np.nan, 4])
pd.isnull(s).sum()




how to count the not null values in a dataframe/pd.series?
pd.notnull(s).sum()

#this will count all the field data. this can be done using s.count() as well.

How to show all the notnull data??

s[pd.notnull(s)]

or, 

s[s.notnull()]

dropping null values in dataframe

How to drop all the rows containing any null value:

df.dropna()
#
behavior will drop all the rows in which any null value is present:

df.dropna(axis=1)

#behaviour will drop all the column with any null value.


How to drop rows if all the values as null. 

df.dropna(how="all")

How to drop the entire row if any value is null. 

df.dropna(how="any")
 # default behavior

You can also use the thresh parameter to indicate a threshold (a minimum number) of non-null values for the row/column to be kept:


df.dropna(thresh=3) #you need this amount of valid row values

df.dropna(thresh=3, axis='columns')# number of columns with valid values you need. 


Filling null values

Sometimes instead of dropping the null values, we might need to replace them with some other value. This highly depends on your context and the dataset you're currently working. Sometimes a nan can be replaced with a 0, sometimes it can be replaced with the mean of the sample, and some other times you can take the closest value. Again, it depends on the context. We'll show you the different methods and mechanisms and you can then apply them to your own problem.

lets say we have a series called s:

s = pd.Series([1, 2, 3, np.nan, np.nan, 4])


if you print s, you will see  two null values. we can replace the null values with 1000 or any other arbitrary values:



s.fillna(1000) #we can just put the values inside.


filling a null value with the mean of the column.

s.fillna(s.mean())


filling a null value with value that comes after the nulls:

s.fillna(method="bfill")


filling a null value that comes before the null:

s.fillna(method="ffill")


#these method=" " also has its drawback, if you want to learn about the drawback, run these codes:

pd.Series([np.nan, 3, np.nan, 9]).fillna(method='ffill')

pd.Series([1, np.nan, 3, np.nan, np.nan]).fillna(method='bfill')


Therefore, for missing data at the end and at the beginning, you should use s.fillna(s.mean()) in standard practice.

 

How to fill null fields of a column using the mean of that column? filling null with column mean/Filling null fields with arbritrary value:


Lets say you have the following dataframe:

df = pd.DataFrame({

    'Column A': [1, np.nan, 30, np.nan],

    'Column B': [2, 8, 31, np.nan],

    'Column C': [np.nan, 9, 32, 100],

    'Column D': [5, 8, 34, 110],

})

#now run these codes:


df.fillna({'Column A': 0, 'Column B': 99, 'Column C': df['Column C'].mean()})

#you can also use methods here:

df.fillna(method='ffill', axis=0) #fills up from top to bottom

df.fillna(method='ffill', axis=1)#fills up from left to right

How to check if there is any missing value or null?

missing_values = len(s.dropna()) != len(s)

missing_values

#because s.dropna() will always be smaller than the length of s.  so we want to see if the length of s.dropna() isnot equal to length of s. 

s.isnull().any() #this code also helps u do that.

s.isnull().values.any()#this is a stricter way of checking null dataframe values in particular. 

alternative codes: 

missing_values = s.count() != len(s) #s.count() also works like len(s.dropna()) because both give u the same intergers as result.

missing_values


Cleaning not-null values

always start with these:


import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


%matplotlib inline

#beacause you never know when you need what.


#lets make a dataframe

df = pd.DataFrame({

    'Sex': ['M', 'F', 'F', 'D', '?'],

    'Age': [29, 30, 24, 290, 25],

})

df




Finding Unique Values

The first step to clean invalid values is to notice them, then identify them and finally handle them appropriately (remove them, replace them, etc). Usually, for a "categorical" type of field (like Sex, which only takes values of a discrete set ('M', 'F')), we start by analyzing the variety of values present. For that, we use the unique() method:


How to see the unique data that exists: 

df['Sex'].unique()

How to replace values of particular column in a dataframe?

df['Sex'].replace({'D': 'F', 'N': 'M'})

How to replace values of many columns in a dataframe?

df.replace({

    'Sex': {             #sex is the column

        'D': 'F',

        'N': 'M'

    },

    'Age': {            #age is another column

        290: 29

    }

}) #this is like a dictionary inside another dictionary


 I explicitly replaced 290 with 29 (assuming it was just an extra 0 entered at data-entry phase). But what if you'd like to remove all the extra 0s from the ages columns? (example, 150 > 15490 > 49).

The first step would be to just set the limit of the "not possible" age. Is it 100? 120? Let's say that anything above 100 isn't credible for our dataset. We can then combine boolean selection with the operation:


lets see how many data are there with this mistake of extra 0 at the end;

df[df['Age'] > 100] #running this code to see. 

now lets write a program:

df.loc[df['Age'] > 100, 'Age'] = df.loc[df['Age'] > 100, 'Age'] / 10

#here, i am deviding the age above 100 with 10 so i can get a decent sounding age instead of some superhuman age count. 


Finding and dealing with duplicates

Checking duplicate values is extremely simple. It'll behave differently between Series and DataFrames.


Duplicates in Series:


Let's start with Series. As an example, let's say we're throwing a fancy party and we're inviting Ambassadors from Europe. But can only invite one ambassador per country. This is our original list, and as you can see, both the UK and Germany have duplicated ambassadors:


ambassadors = pd.Series([

    'France',

    'United Kingdom',

    'United Kingdom',

    'Italy',

    'Germany',

    'Germany',

    'Germany',

], index=[

    'GĂ©rard Araud',

    'Kim Darroch',

    'Peter Westmacott',

    'Armando Varricchio',

    'Peter Wittig',

    'Peter Ammon',

    'Klaus Scharioth '

])

The two most important methods to deal with duplicates are df.duplicated() (that will tell you which values are duplicates) and df.drop_duplicates() (which will just get rid of duplicates):

ambassadors.duplicated()



by  default, it'll consider the first occurrence of the value as not-duplicate and the socond one as the duplicate. You can change this behavior with the keep parameter:

ambassadors.duplicated(keep='last')

#this will consider the last one as the original one and the rest will be the duplicates. 

A similar method is drop_duplicates, which just excludes the duplicated values and also accepts the keep parameter:

ambassadors.drop_duplicates()

Just like the .duplicated() one, this one by default keeps the first instance as the original data and drops rest of the recurring data. 

you can also pass in the instruction for changing the deault settings...

ambassadors.drop_duplicates(keep='last')

ambassadors.drop_duplicates(keep=False) #this helps you to keep the unique data those were never repeated. 


Duplicates in Dataframe:


Conceptually speaking, duplicates in a DataFrame happen at "row" level. Two rows with exactly the same values are considered to be duplicates:

let's make a dataframe:


players = pd.DataFrame({

    'Name': [

        'Kobe Bryant',

        'LeBron James',

        'Kobe Bryant',

        'Carmelo Anthony',

        'Kobe Bryant',

    ],

    'Pos': [

        'SG',

        'SF',

        'SG',

        'SF',

        'SF'

    ]

})


In the DataFrame, we clearly see that Kobe is duplicated; but he appears with two different positions. What does duplicated say?

players.duplicated() 

"duplicated" means "all the column values should be duplicates". We can customize this with the subset parameter:

players.duplicated(subset=['Name'])


How to drop duplicated rows of a column in a dataframe? 

players.drop_duplicates(subset=['Name'])


Splitting Columns


you will have to do this a lot of time in the your data analytics career.

Lets say you have a dataframe:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_   IT_1',
        '1985_F_I  T_2'
]})

main code:

df['Data'].str.split('_')

how to create dataframe out of that?

df['Data'].str.split('_', expand=True)


How to set the columns?

first save the new dataframe: 

df = df['Data'].str.split('_', expand=True)

then set the columns:

df.columns = ['Year', 'Sex', 'Country', 'No Children']


How to check if a column contains something?

df['Year'].str.contains('\?')

df['Country'].str.contains('U')


How to remove a space or special character from a field?

df['Country'].str.replace(' ', '')


df['Year'].str.replace(r'(?P<year>\d{4})\?', lambda m: m.group('year'))


My preferred way of doing this: 

df['Year']=df['Year'].str.replace(r"\?", "", regex=True)




No comments:

Post a Comment