Python for Data Analysis: Introduction to Pandas

Introduction to Pandas

Code Source: Google Drive Link

s= pd.Series

Pandas series is a data structure. Inside a pandas series, you can drop a python list. You can make index out of another list just like the picture below:

How to insert index manually to a Series g7_pop?

g7_pop.index = [

'Canada',

'France',

'Germany',

'Italy',

'Japan',

'United Kingdom',

'United States',

]

Numpy array contains a lot of lists in its own list.

arr = np.array([ [1, 2], [3, 4] ])

Pandas Series contains dictionary or lists.

s = pd.Series(arr, arr1, arr2) 
#considering arr, arr1, arr2 as python lists.

Python Dictionary

data = {'a' : 0., 'b' : 1., 'c' : 2.}
data = {'key2': ['Geeks', 'For', 'Geeks'], 'key1': [1, 2]}
#we can put a list in the dict as values and key in inverted comma.

Create a Series from ndarray:

#import the pandas library and aliasing as pd Exmple 1
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data)
print (s)

#import the pandas library and aliasing as pd Exaple 2
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s

Create a Series from Dict:

A dict can be passed as input and if no index is specified, then dict keys are taken as index. If index is passed, the values in data corresponding to the labels of the index will be pulled out.

#import the pandas library and aliasing as pd #keys will be index
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s

#import the pandas library and aliasing as pd #keys wont be index
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print (s)

Indexing:

Lets do the following:

g7_pop = [35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523]

l = ['Canada',

'France',

'Germany',

'Italy',

'Japan',

'United Kingdom',

'United States']

bothseries = pd.Series(g7_pop, index = l)

#making a series from two lists.

bothseries["France"] #indexing

#it also supports multi indexing

bothseries["Italy","France"]

#unlike python list indexing, in series, the upper limit is returned as well.

bothseries['Canada': 'Italy']

#you can always use google search to learn anything you will need in a project.

Conditional Selection:

bothseries[bothseries > 70]

bothseries['France': 'Italy'].mean()

Operations and methods:

bothseries * 1_000_000

bothseries['France': 'Italy'].mean()

np.log(bothseries)

#and,or, not

bothseries[(bothseries > 80) & (bothseries < 200)]

bothseries[~(bothseries > 80)]

bothseries[(bothseries > 80) | (bothseries < 200)]

Modifying the series: 

bothseries['Canada'] = 40.5

bothseries[bothseries < 70] = 99.99

#the line of code will modify the items below 70 to 99.99

DataFrame

Basics:

How to create a DataFrame from the scratch?

# Lets make a dictionary first

#because, you can make a data frame from a dictionary. You just need to drop the dictionary into a DataFrame.

dtx={

'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],

'GDP': [

1785387,

2833687,

3874437,

2167744,

4602367,

2950039,

17348075

'Surface Area': [

9984670,

640679,

357114,

301336,

377930,

242495,

9525067

'HDI': [

0.913,

0.888,

0.916,

0.873,

0.891,

0.907,

0.915

'Continent': [

'America',

'Europe',

'Asia',

'Europe',

'America'

]

}

#this dictionary has a key and values are inside the list.

df=pd.DataFrame(dtx)

#here I dropped the python dictionary into the pandas data frame pd.DataFrame() and we named the DataFrame df.

#output:

This is what the output will look like.

But we want to assign an index name to the DataFrame.

df.index = [

'Canada',

'France',

'Germany',

'Italy',

'Japan',

'United Kingdom',

'United States',

]

See the new index has replaced the numbers.

Indexing:

df.loc["Canada"]

#this one is same as the previous one

df.iloc[0]

df["Population"]#this gives you the column result you pass in.

#all the results being returned are actually series.

#for multidimensional results
df.loc['France': 'Italy', ['Population', 'GDP']]

#Dropping stuff:
df.drop(['Italy', 'Canada'], axis=0)#dropping rows

df.drop(['Population', 'HDI'], axis=1)#dropping column

df3.drop(['Unnamed: 0'], axis=1)#dropping unnamed column

#adding rows and values to dataframe using pandas.Series

df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})

#slicing with column conditions
df.loc["Canada", "GDP":"HDI"].to_frame()

#how to reset and make an index from a column

df.reset_index()

df.set_index()

#adding a new column from one column

df['gdp per capita']=df['GDP'] / df['Population']

#adding new column to database from a list

df['GDP Per Capita'] = ["20","30","40","50","60", "50","80"]

#another way to do this:

langs = pd.Series(

['French', 'German', 'Italian'],

index=['France', 'Germany', 'Italy'],

name='Language'

) #first make a series.

df['Language'] = langs #then make a column out of series.

#Here you can assgin the column value to a particular row.

#adding new indexes/rows with values:

nx = df.append(pd.Series({

'Population': 3,

'GDP': 5

}, name='China'))

#Better Way to do this:

df.loc['China'] = pd.Series({'Population': 1_400_000_000, 'Continent': 'Asia'})

#How do you rename a DataFrame?

df.rename(

columns={

'HDI': 'Human Development Index',

'Anual Popcorn Consumption': 'APC'

}, index={

'United States': 'USA',

'United Kingdom': 'UK',

'Argentina': 'AR'

})

#print all the rows where column population is above 70

df.loc[df['Population'] > 70]

#print only the population column

df.loc[df['Population'] > 70, 'Population']

#Searching with a condition in a DataFrame(SQL like search)

#lets make a dataframe from a dictionary first.
#dict:

raw_data = {'first_name': ['Sheldon', 'Raj', 'Leonard', 'Howard', 'Amy'],

'last_name': ['Copper', 'Koothrappali', 'Hofstadter', 'Wolowitz', 'Fowler'],

'age': [42, 38, 36, 41, 35],

'Comedy_Score': [9, 7, 8, 8, 5],

'Rating_Score': [25, 25, 49, 62, 70]}

#DataFrame:

dfr = pd.DataFrame(raw_data)

#Search Pandas SQL
print(dfr['Comedy_Score'].where(dfr['Rating_Score'] < 50))

#if you want to print both the column,

print(dfr[['Comedy_Score', 'Rating_Score']].where(dfr['Rating_Score'] < 50))

pd.read_csv

If we want to import .csv files:

dftab1=pd.read_csv("group7.csv", index_col=0)

#setting index, setting first column as index.

#where comma is thousand:
pd.read_csv("nx.csv", index_col=0, thousands=",")

CSV file using comma for thousands, then use
tab1 = pd.read_csv("group7.csv", thousands=',', index_col=0)

or replace the , with space:

df['col2'] = (df['col2'].replace('\,','', regex=True)

How to save data into new csv file:
dftab1.to_csv("new.csv", encoding='utf-8')

How to open csv for no header:

df=pd.read_csv("data/btc-market-price.csv", header=None)

How to set column names if there is no column:

df.columns = ['Timestamp', 'Price']

How to change dtype of a datetime object to datetime type:

df['Timestamp']= pd.to_datetime(df['Timestamp'])

How to set the first/any column as index:

df.set_index('Timestamp', inplace=True)

#but there is a better way to do all these:

df = pd.read_csv(

'data/btc-market-price.csv',

header=None, #bcz there is no header

names=['Timestamp', 'Price'], #we are passing the headers

index_col=0, #selecting index

parse_dates=True

)

eth = pd.read_csv('data/eth-price.csv', parse_dates=True, index_col=0) #this one only works when the index column is data and time. or else you have to manually pass in the date and time column in the place of True.

pd.read_csv('data/eth-price.csv', parse_dates=["column_name"])

If you want to change the dtype of a column to numeric:

1. run this code for the column:

k=df["colum_name"].apply(pd.to_numeric, errors='coerce')

another way to do this:

n = df["column_name"].astype(str).astype(float)

2. Replace the Column with the new column k

df["colum_name"]=k #here ke is our new column.

or,

df["colum_name"]=n #here ke is our new column.

Plotting:

always import the right libraries:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

There are two ways to plot a dataframe as df.

df.plot()#this will plot the whole table

plt.plot(df.index, df["column_name"])#this lets you select the columns you wanna plot.

How to plot x and y?

plt.plot(x, y)

How to change the shape and title of a plot?

Each plt function alters the global state. If you want to set settings of your plot you can use the plt.figure function. Others like plt.title keep altering the global plot:

plt.figure(figsize=(14, 7))

plt.plot(x, x ** 2)

plt.plot(x, -1 * (x ** 2))

plt.title('My Nice Plot')

Some of the arguments in plt.figure and plt.plot are available in the pandas' plot interface:

df.plot(figsize=(16, 9), title='Bitcoin Price 2017-2018')

how to import a csv and parse the date immediately?

pd.read_csv('data/eth-price.csv', parse_dates=["column_name"])

dont forget that---

eth = pd.read_csv('data/eth-price.csv', parse_dates=True, index_col=0) #this will explicitly parse the index of a dataframe.

How to get index from another dataframe to your new dataframe?

newdf=pd.DataFrame(index=df.index)

How to Get columns from other dataframes to new one?

newdf['bitcoin']=df['Price']

newdf['ether']=eth['Value']

We can now try plotting both values:

prices.plot(figsize=(12, 6))

How to plot certain indexes of a dataframe?

prices.loc['2017-12-01':'2018-01-01'].plot(figsize=(12, 6))

another way is:

plt.figure(figsize=(16, 6))

plt.plot(prices.loc['2017-12-01':'2018-01-01'])

How to plot few specific indexes of a specific column?

plt.figure(figsize=(16, 6))

plt.plot(df["col"].loc['index5':'index10'])

Python for Data Analysis

Pages

Sunday, 16 May 2021

Introduction to Pandas

Create a Series from ndarray:

Create a Series from Dict:

No comments:

Post a Comment