Python for Data Analysis

Friday, 16 June 2023

Filtering using pandas

This code will filter the rows where the 'Year' column is 2017 and select only the 'Year' and 'Revenue (GBP Millions)' columns. It will print the filtered rows with the corresponding values for both columns.

Make sure to replace 'Year' and 'Revenue (GBP Millions)' with the actual column names in your dataframe. If you have more columns that you want to include in the filtered result, you can add them within the double square brackets, separated by commas.

Note: If you want to perform any calculations or further data analysis on the filtered data, it's recommended to assign it to a new variable, like filtered_data, rather than modifying the original dataframe directly.

filtered_data = df[df['Year'] == 2017][['Year', 'Revenue (GBP Millions)']]

print(filtered_data)

Scalar Subquery in Case When SQL

Lets first create a table:

Create database blog_port;

use blog_portal;

-- Create the Employee table

CREATE TABLE Employee (

employee_id INT,

team_id INT,

PRIMARY KEY (employee_id)

);

-- Insert the data into the Employee table

INSERT INTO Employee (employee_id, team_id)

VALUES (1, 8),

(2, 8),

(3, 8),

(4, 7),

(5, 9),

(6, 9);

Write an SQL query to find the team size of each of the employees. Return result table in any order. The query result format is in the following example.

First intuitive way to solve the problem:

with CTE as (Select

team_id, count(team_id) as team_mem_count

from Employee

group by team_id)

select e.employee_id, CTE.team_mem_count

from Employee as e

Right join CTE

on CTE.team_id = e.team_id

order by e.employee_id

;

Best Way: Case When Scalar Grouping

SELECT employee_id,

team_id,

CASE

WHEN team_id IS NULL THEN 0

ELSE (SELECT COUNT(*) FROM Employee WHERE team_id = e.team_id)

END AS team_size

FROM Employee AS e;

In this case, the subquery acts as a scalar subquery,

which can be used in the SELECT statement without requiring a GROUP BY clause.

The subquery is evaluated independently for each row,

providing the corresponding team_size for each employee without the need for explicit grouping.

For example, when the subquery conditions run - WHERE team_id = e.team_id it will only match the team id of the current row to the team ids of the employee as e table. Therefore, keep in mind, each row will return the count(*) for only the matching rows values to the condition.

Sunday, 11 June 2023

Importing CSV into mysql

conda install -c anaconda sqlalchemy

Pip install pandas

Way 1:

from sqlalchemy import create_engine as ce
mysqlengine= ce("mysql+mysqldb://root:root@127.0.0.1:3306/cancer_data")

import pandas as pd

df = pd.read_csv("D:/DownloadsD/new/health_statistics.csv", header=None)

df=df.iloc[1:] 
#print(df.columns)

df.to_sql('health_statistics', mysqlengine, if_exists='append', index= False)

Way 2:

import MySQLdb as m

# Connect to the MySQL database
mysql_connection = m.connect(
    user="root",
    password="root",
    database="cancer_data"
)

# Create a cursor object to execute SQL queries
cursor = mysql_connection.cursor()

# Path to your CSV file
csv_file = 'C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/health_statistics.csv'

# Name of the table in the database
table_name = 'health_statistics'

# SQL query to load data from CSV into the table
load_query = f"""
    LOAD DATA INFILE '{csv_file}'
    INTO TABLE {table_name}
    FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    IGNORE 1 ROWS
"""

# Execute the query
cursor.execute(load_query)

# Commit the changes
mysql_connection.commit()

# Close the cursor and database connection
cursor.close()
mysql_connection.close()

Way 3:

1 . Put the data in the uploads folder of the mysql programdata file

2. Create the table

3. use the following code:

    LOAD DATA INFILE '{csv_file_path}'
    INTO TABLE {table_name}
    FIELDS TERMINATED BY ','
    ENCLOSED BY '"'
    LINES TERMINATED BY '\n'
    IGNORE 1 ROWS

Way 4: (When Previous use cases fail)

This is similar to the first one. But, one thing to remember is, this step needs to be added to the first Way if the CSV file column names are different. There are multiple ways of dealing with it. We can go to the CSV file and change the columns one by one. If it is not possible to do it everyday, then we will be using the following piece of code.

"""
column_mapping = {
    0: 'column1_name',
    1: 'column2_name',
    # Add more mappings as needed for the specific columns
}

# Rename the columns in the DataFrame using the mapping dictionary
df = df.rename(columns=column_mapping)

"""

Happy Coding.

Pages

Friday, 16 June 2023

Filtering using pandas

Scalar Subquery in Case When SQL

Sunday, 11 June 2023

Importing CSV into mysql