Monday, 18 July 2022

12 steps to get into data analytics without paying a single penny



1. Learn SQL with sqlbolt.com
2. Download public datasets from tableau.com
3. Use PowerBI to practice visualization skills
4. Build a of portfolio of 2-3 pet projects
5. Take Python course on alison.com
6. Get Python certified on alison.com
7. Look for interview tips on LinkedIn or Youtube
8. Be as original as possible in cover letter
9. Apply for at least 10 openings
10. Get invited
11. Showcase acquired skills in interview
12. Get hired

Monday, 11 July 2022

SQL for Business Decision

 Starts here.... 

Working with dates. In the EDA course I have already learnt this though. I will try again. 

Data format

The format of dates is 'YYYY-MM-DD'.

Select all movies rented on October 9th, 2018. From the renting table.

Columns of renting table:

renting_idcustomer_idmovie_idratingdate_renting
SELECT *
FROM renting
Where date_renting = '2018-10-09'; 
-- Movies rented on October 9th, 2018


  • Select all records of movie rentals between beginning of April 2018 till end of August 201
SELECT *
FROM renting
where date_renting between '2018-04-01' and '2018-08-31';
-- from beginning April 2018 to end August 2018

  • Put the most recent records of movie rentals on top of the resulting table and order them in decreasing order.
SELECT *
FROM renting
WHERE date_renting BETWEEN '2018-04-01' AND '2018-08-31'
order by date_renting desc; -- Order by recency in decreasing order

Selecting movies

The table movies contains all movies available on the online platform.

Columns in table Movies: 

movie_idtitlegenreruntimeyear_of_releaserenting_price
SELECT *
FROM movies
where genre <> 'Drama'; -- All genres except drama

  • Select the movies 'Showtime', 'Love Actually' and 'The Fighter'.
SELECT *
FROM movies
where title in ('Showtime', 'Love Actually', 'The Fighter'); 
-- Select all movies with the given titles

  • Order the movies by increasing renting price.

SELECT *
FROM movies
order by renting_price desc ; 
-- Order the movies by increasing renting price



  • Select from table renting all movie rentals from 2018.
  • Filter only those records which have a movie rating.

SELECT *
FROM renting
-- Renting in 2018
WHERE date_renting between '2018-01-01' AND '2018-12-31' 
AND rating Is Not Null; -- Rating exists


Summarizing customer information

In most business decisions customers are analyzed in groups, such as customers per country or customers per age group.

Customers table: 

customer_idnamecountrygenderdate_of_birthdate_account_start

Count the number of total customers born in 1980s

SELECT count(*) -- Count the total number of customers
FROM customers
WHERE date_of_birth between '1980-01-01' and '1989-12-31'; 
-- Select customers born between 1980-01-01 and 1989-12-31

Count( distinct country) where movieNow has customers:

number of countries where movie now has customers.

SELECT count(distinct country) -- Count the number of countries
FROM customers;


  • Select all movie rentals of the movie with movie_id 25 from the table renting.
  • For those records, calculate the minimum, maximum and average rating and count the number of ratings for this movie.
  • SELECT min(rating) as min_rating,
    -- Calculate the minimum rating and use alias min_rating
    max(rating) as max_rating,
    -- Calculate the maximum rating and use alias max_rating
    avg(rating) as avg_rating,
    -- Calculate the average rating and use alias avg_rating
    count(rating) number_ratings
    -- Count the number of ratings and use alias number_ratings
    -- we are specifying the column because we are not counting
    -- the null values 
    FROM renting
    where movie_id = 15;
    -- Select all records of the movie with ID 25


Examining annual rentals

You are asked to provide a report about the development of the company. Specifically, your manager is interested in the total number of movie rentals, the total number of ratings and the average rating of all movies since the beginning of 2019.

  • First, select all records of movie rentals since January 1st 2019.
Select * -- Select all records of movie rentals since January 1st 2019
from renting
where date_renting >= '2019-01-01';

  • Now, count the number of movie rentals and calculate the average rating since the beginning of 2019.
SELECT
count(*), -- Count the total number of rented movies
AVG(rating) -- Add the average rating
FROM renting
WHERE date_renting >= '2019-01-01';

count how many ratings exist since 2019-01-01.

SELECT
COUNT(*) AS number_renting,
AVG(rating) AS average_rating,
count(rating) AS number_ratings 
-- Add the total number of ratings here.
FROM renting
WHERE date_renting >= '2019-01-01';

Group By Applications:

In order to understand the business we can group by country or gender, 

We can investigate the popularity of movies by genre or year of release(this year, all movies were popular

that genre is usually popular...)

Average price of movies by genre. (what is the average price of movies in the so and so genre)

movies_selected table:

title, genre, renting_price

Average renting price per genre

Select genre,

        avg(renting_price) as avg_price  -- because we want to average the whole pricing column...

        Count(*) as number_movies -- will add another column to genre.

        -- used the count all to count everything but now the 'group by' enables us to count only the 

        --number of movies in each genre. 

From movies_selected

group by genre

Having Count(*) >= 2; 

table columns: 

genre, avg_price, number_movies

Wait, what is the difference between having and where?

Having filters by placing the condition on the group by.

Where filters by placing the condition at the row level. 


Conduct an analysis to see when the first customer accounts were created for each country.

  • Create a table with a row for each country and columns for the country name and the date when the first customer account was created.
  • Use the alias first_accountfor the column with the dates.
  • Order by date in ascending order.
SELECT country, 
-- (For each country report the earliest
-- date when an account was created)
min(date_account_start) AS first_account
FROM customers
GROUP BY country
ORDER BY first_account;


Which movie had the highest income?

First we will select movies title.


select

        rm.title,

        SUM(rm.renting_price)

from

        (Select 

        r.renting_price,

        m.title

        from renting as r

       left join movies as m

       on r.movie_id=m.movie_id) as rm

Group by rm.title

Order By income_movies


Age of actors from the USA

Now you will explore the age of American actors and actresses. Report the date of birth of the oldest and youngest US actor and actress.

  • Create a subsequent SELECT statements in the FROM clause to get all information about actors from the USA.
  • Give the subsequent SELECT statement the alias a.
  • Report for actors from the USA the year of birth of the oldest and the year of birth of the youngest actor and actress
SELECT Gender, -- Report for male and female actors from the USA
Min(a.year_of_birth), -- The year of birth of the oldest actor
Max(a.year_of_birth) -- The year of birth of the youngest actor
FROM
(SELECT * 
-- Use a subsequen SELECT to get all information about actors from the USA
From
actors
Where nationality = 'USA') as a -- Give the table the name a
GROUP BY Gender;


Who is the favorite actor for a certain group, in our case by Gender Male?

Select

a.name,

count(*)

from

renting as r

left join customers as c

On r.customer_id = c.customer_id

Left Join actsin as ai

on r.movie_id=ai.movie_id

Left join actors as a

On ai.actor_id=a.actor_id

where c.gender = 'male'

Group by a.name;


Who is the favorite actor with the highest average rating for a certain group, in our case by Gender Male?

(Insert the code later)

Identify favorite movies for a group of customers

Which is the favorite movie on MovieNow? Answer this question for a specific group of customers: for all customers born in the 70s.

  • Augment the table renting with customer information and information about the movies.
  • For each join use the first letter of the table name as alias.
  • Select only those records of customers born in the 70s.
  • For each movie, report the number of times it was rented, as well as the average rating. Limit your results to customers born in the 1970s.
  • Remove those movies from the table with only one rental.
  • Order the result table such that movies with highest rating come first.

SELECT m.title,
COUNT(*),
AVG(r.rating)
FROM renting AS r
LEFT JOIN customers AS c
ON c.customer_id = r.customer_id
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
WHERE c.date_of_birth BETWEEN '1970-01-01' AND '1979-12-31'
GROUP BY m.title
Having count(*) > 1 -- Remove movies with only one rental
Order by Avg(r.rating) Desc; -- Order with highest rating first


Identify favorite actors for Spain

You're now going to explore actor popularity in Spain. Use as alias the first letter of the table, except for the table actsin use ai instead.

  • Augment the table renting with information about customers and actors.
  • Report the number of movie rentals and the average rating for each actor, separately for male and female customers.
  • Report only actors with more than 5 movie rentals.
  • Now, report the favorite actors only for customers from Spain. 
  • Hint
    • In GROUP BY two column names can be listed.
    • Always use where clause after the joins.


    SELECT a.name, c.gender,
    COUNT(*) AS number_views,
    AVG(r.rating) AS avg_rating
    FROM renting as r
    LEFT JOIN customers AS c
    ON r.customer_id = c.customer_id
    LEFT JOIN actsin as ai
    ON r.movie_id = ai.movie_id
    LEFT JOIN actors as a
    ON ai.actor_id = a.actor_id
    WHERE c.country = 'Spain' -- Select only customers from Spain
    GROUP BY a.name, c.gender -- we are grouping by two columns
    HAVING AVG(r.rating) IS NOT NULL
    AND COUNT(*) > 5
    ORDER BY avg_rating DESC, number_views DESC;


  • KPIs per country

    In chapter 1 you were asked to provide a report about the development of the company. This time you have to prepare a similar report with KPIs for each country separately. Your manager is interested in the total number of movie rentals, the average rating of all movies and the total revenue for each country since the beginning of 2019.


  • Augment the table renting with information about customers and movies.
  • Use as alias the first latter of the table name.
  • Select only records about rentals since beginning of 2019.
  • Calculate the number of movie rentals.
  • Calculate the average rating.
  • Calculate the revenue from movie rentals.
  • Report these KPIs for each country.
SELECT
c.country, -- For each country report
COUNT(*) AS number_renting, -- The number of movie rentals
avg(r.rating) AS average_rating, -- The average rating
SUM(m.renting_price) AS revenue -- The revenue from movie rentals
FROM renting AS r
LEFT JOIN customers AS c
ON c.customer_id = r.customer_id
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
WHERE date_renting >= '2019-01-01'
Group By c.country;



Analyzing customer behavior

A new advertising campaign is going to focus on customers who rented fewer than 5 movies. Use a correlated query to extract all customer information for the customers of interest.

Correlated nested subquery

-- Select customers with less than 5 movie rentals
SELECT *
FROM customers as c
WHERE 5 > 
    (SELECT count(*)
    FROM renting as r
    WHERE r.customer_id = c.customer_id);

Customers who gave low ratings

Identify customers who were not satisfied with movies they watched on MovieNow. Report a list of customers with minimum rating smaller than 4.

SELECT *
FROM customers as c
WHERE 4 >  -- Select all customers with a minimum rating smaller than 4 
    (SELECT MIN(rating)
    FROM renting AS r
    WHERE r.customer_id = c.customer_id);


 

Customers with at least one rating

Having active customers is a key performance indicator for MovieNow. Make a list of customers who gave at least one rating.

Select all customers with at least one rating. Use the first letter of the table as an alias.

SELECT *
FROM customers as c -- Select all customers with at least one rating
WHERE Exists
    (SELECT *
    FROM renting AS r
    WHERE rating IS NOT NULL -- there might be at
    --least one rating but we keep that. 
    AND r.customer_id = c.customer_id);


Nationality of actors

In order to analyze the diversity of actors in comedies, first, report a list of actors who play in comedies and then, the number of actors for each nationality playing in comedies.

Select a.nationality,
count(actir.actor_id) as nacount
from
(SELECT actor_id  -- Select the records of all actors who play in a Comedy
FROM actsin AS ai
left join movies as m
on ai.movie_id = m.movie_id
WHERE m.genre = 'Comedy') as actir
Join actors as a
On actir.actor_id = a.actor_id
Group by a.nationality;


But this was my solution. There is another solution which I provided.

SELECT a.nationality,
count(*) -- Report the nationality and the number of actors for each nationality
FROM actors AS a
WHERE EXISTS
    (SELECT ai.actor_id
     FROM actsin AS ai
     LEFT JOIN movies AS m
     ON m.movie_id = ai.movie_id
     WHERE m.genre = 'Comedy'
     AND ai.actor_id = a.actor_id)
Group by a.nationality;


Union and Intersect in SQL

As a #dataanalyst , you might not need to use UNION as often as you might think, but I find UNION and INTERSECTS to be really cool.

In #postgresql , we want to select names, year of birth and also number of icecreams he purchased.
We want to select only those kids who was born in 1996 and also purchased more than 3 ice-creams.
This is how I would solve using UNION or INTERSECT.

SELECT name, 
       year_of_birth,
icecream_count
FROM icecreamtable
WHERE icecream_count > 3
INTERSECT -- we will use intersect because we need name of kid who was bron in 1996 and also he has purchase more than 3 icecreams.
SELECT name,  
       year_of_birth,
icecream_count
FROM actors
WHERE year_of_birth = 1996;

How would you have solved it if you were in my place? #data #SQL
#postgresql

 

Rollup


we drop a level of detail at each step. 

we can have two aggregations in rollup

 

Generate a table with the total number of customers, the number of customers for each country, and the number of female and male customers for each country. 

 

-- Count the total number of customers, the number of customers for each country, and the number of female and male customers for each country
SELECT country,
       gender,
       COUNT(*)
FROM customers
Group BY Rollup (country, gender)
Order BY country, gender; -- Order the result by country and gender

You are asked to study the preferences of genres across countries. Are there particular genres which are more popular in specific countries? Evaluate the preferences of customers by averaging their ratings and counting the number of movies rented from each genre.
this means, it wants specific country wise genre aggregation and also overall rating for
country only. If you flip the oder in rollup, we will still get the insights, but we will not get the 
total counts and numbers for country-only.





-- Group by each county and genre with OLAP extension
SELECT 
    c.country, 
    m.genre, 
    AVG(r.rating) AS avg_rating, 
    COUNT(*) AS num_rating
FROM renting AS r
LEFT JOIN movies AS m
ON m.movie_id = r.movie_id
LEFT JOIN customers AS c
ON r.customer_id = c.customer_id
GROUP BY rollup (c.country, m.genre)
ORDER BY c.country, m.genre;








Grouping Sets



  Now you will investigate the average rating of customers aggregated by country and gender.
SELECT 
    c.country, 
    c.gender,
    AVG(r.rating)
FROM renting AS r
LEFT JOIN customers AS c
ON r.customer_id = c.customer_id
-- Report all info from a Pivot table for country and gender
GROUP BY GROUPING SETS ((country, gender), (c.country), (c.gender), ());