Monday, 15 August 2022

Analyzing business with SQL

 

 

Revenue per customer

You've been hired at Delivr as a data analyst! A customer just called Delivr's Customer Support team; she wants to double-check whether her receipts add up. Going by her receipts, she calculated that her total bill on Delivr is $271, and she wants to make sure of that. Her user ID is 15.

Help the Customer Support team by calculating her total bill! Sum up everything she's spent on Delivr orders; in other words, calculate the total revenue that Delivr has generated from her.

 

-- Calculate revenue
SELECT SUM(orders.order_quantity * meals.meal_price) AS revenue
  FROM meals
  JOIN orders ON meals.meal_id = orders.meal_id
-- Keep only the records of customer ID 15
WHERE orders.user_id = 15;

 

Revenue per week

Delivr's first full month of operations was June 2018. At launch, the Marketing team ran an ad campaign on popular food channels on TV, with the number of ads increasing each week through the end of the month. The Head of Marketing asks you to help her assess that campaign's success.

Get the revenue per week for each week in June and check whether there's any consistent growth in revenue.

Note: Don't be surprised if you get a date in May in the result. DATE_TRUNC('week', '2018-06-02') returns '2018-05-28', since '2018-06-02' is a Saturday and the preceding Monday is on '2018-05-28'.

  • Write the expression for revenue.
  • Keep only the records of June 2018.
SELECT DATE_TRUNC('week', order_date) :: DATE AS delivr_week,
       -- Calculate revenue
       Sum(meals.meal_price * orders.order_quantity) AS revenue
  FROM meals
  JOIN orders ON meals.meal_id = orders.meal_id
-- Keep only the records in June 2018
WHERE orders.order_date between '2018-06-01' And '2018-06-30'
GROUP BY delivr_week
ORDER BY delivr_week ASC;

 

Top meals by cost

Alice from Finance wants to know what Delivr's top 5 meals are by overall cost; in other words, Alice wants to know the 5 meals that Delivr has spent the most on for stocking.

You're provided with an aggregate query; you'll need to fill in the blanks to get the output Alice needs.

Note: Recall that in the meals table, meal_price is what the user pays Delivr for the meal, while meal_cost is what Delivr pays its eateries to stock this meal. 

SELECT
-- Calculate cost per meal ID
meals.meal_id,
sum(meals.meal_cost * stock.stocked_quantity) AS cost
FROM meals
JOIN stock ON meals.meal_id = stock.meal_id
GROUP BY meals.meal_id
ORDER BY cost DESC
-- Only the top 5 meal IDs by purchase cost

LIMIT 5; 

 

 

 

Growth Rate and Delta

 

WITH orders AS (
  SELECT
    DATE_TRUNC('month', order_date) :: DATE AS delivr_month,
    --  Count the unique order IDs
    count (distinct order_id) AS orders
  FROM orders
  GROUP BY delivr_month),

  orders_with_lag AS (
  SELECT
    delivr_month,
    -- Fetch each month's current and previous orders
    orders,
    COALESCE(
      lag(orders) over(order by delivr_month),
    1) AS last_orders
  FROM orders)

SELECT
  delivr_month,
  -- Calculate the MoM order growth rate
  ROUND(
    (orders - last_orders) :: numeric / last_orders,
  2) AS growth
FROM orders_with_lag
ORDER BY delivr_month ASC;

 

 

Retention rate

Bob's requested your help again now that you're done with Carol's MAU monitor. His meeting with potential investors is fast approaching, and he wants to wrap up his pitch deck. You've already helped him with the registrations running total by month and MAU line charts; the investors, Bob says, would be convinced that Delivr is growing both in new users and in MAUs.

However, Bob wants to show that Delivr not only attracts new users but also retains existing users. Send him a table of MoM retention rates so that he can highlight Delivr's high user loyalty.

 

WITH user_monthly_activity AS (
  SELECT DISTINCT
    DATE_TRUNC('month', order_date) :: DATE AS delivr_month,
    user_id
  FROM orders)

SELECT
  -- Calculate the MoM retention rates
  previous.delivr_month,
  round(
(count(DISTINCT current.user_id))::numeric / greatest( 
count(DISTINCT previous.user_id),1),
 2) as retention_rate
FROM user_monthly_activity AS previous
LEFT JOIN user_monthly_activity AS current
-- Fill in the user and month join conditions
ON previous.user_id = current.user_id 
AND previous.delivr_month = (current.delivr_month - interval '1 month')
GROUP BY previous.delivr_month
ORDER BY previous.delivr_month ASC;

 

 

Average revenue per user

Dave from Finance wants to study Delivr's performance in revenue and orders per each of its user base. In other words, he wants to understand its unit economics.

Help Dave kick off his study by calculating the overall average revenue per user (ARPU) using the first way discussed in Lesson 3.1.

 

-- Create a CTE named kpi
with kpi as (
  SELECT
    -- Select the user ID and calculate revenue
    user_id,
    SUM(m.meal_price * o.order_quantity) AS revenue
  FROM meals AS m
  JOIN orders AS o ON m.meal_id = o.meal_id
  GROUP BY user_id)
-- Calculate ARPU
SELECT ROUND(avg(revenue) :: numeric, 2) AS arpu
FROM kpi;

 

ARPU per week

Next, Dave wants to see whether ARPU has increased over time. Even if Delivr's revenue is increasing, it's not scaling well if its ARPU is decreasing—it's generating less revenue from each of its customers.

Send Dave a table of ARPU by week using the second way discussed in Lesson 3.1.

 

WITH kpi AS (
  SELECT
    -- Select the week, revenue, and count of users
    date_trunc('week', o.order_date) :: date AS delivr_week,
    sum(m.meal_price * o.order_quantity ) AS revenue,
    count(distinct o.user_id ) AS users
  FROM meals AS m
  JOIN orders AS o ON m.meal_id = o.meal_id
  GROUP BY delivr_week)

SELECT
  delivr_week,
  -- Calculate ARPU
  ROUND(
    revenue :: numeric / users,
  2) AS arpu
FROM kpi
-- Order by week in ascending order
ORDER BY delivr_week ASC;

 

Average orders per user

Dave wants to add the average orders per user value to his unit economics study, since more orders usually correspond to more revenue.

Calculate the average orders per user for Dave.

Note: The count of distinct orders is different than the sum of ordered meals. One order can have many meals within it. Average orders per user depends on the count of orders, not the sum of ordered meals.

 

WITH kpi AS (
  SELECT
    -- Select the count of orders and users
    count(distinct order_id) AS orders,
    count(distinct user_id) AS users
  FROM orders)

SELECT
  -- Calculate the average orders per user
  ROUND(
    orders :: numeric / users,
  2) AS arpu
FROM kpi;

 

Histogram of revenue

After determining that Delivr is doing well at scaling its business model, Dave wants to explore the distribution of revenues. He wants to see whether the distribution is U-shaped or normal to see how best to categorize users by the revenue they generate.

Send Dave a frequency table of revenues by user.

 

WITH user_revenues AS (
  SELECT
    -- Select the user ID and revenue
    o.user_id,
    sum(m.meal_price * o.order_quantity) AS revenue
  FROM meals AS m
  JOIN orders AS o ON m.meal_id = o.meal_id
  GROUP BY user_id)

SELECT
  -- Return the frequency table of revenues by user
  round(revenue::numeric,-2) AS revenue_100,
  count(distinct user_id) AS users
FROM user_revenues
GROUP BY revenue_100
ORDER BY revenue_100 ASC;

 

Histogram of orders

Dave also wants to plot the histogram of orders to see if it matches the shape of the histogram of revenues.

Send Dave a frequency table of orders by user.

 

WITH user_orders AS (
  SELECT
    user_id,
    COUNT(DISTINCT order_id) AS orders
  FROM orders
  GROUP BY user_id)

SELECT
  -- Return the frequency table of orders by user
  orders,
  count(distinct user_id) AS users
FROM user_orders
GROUP BY orders
ORDER BY orders ASC;

 

Bucketing users by revenue

Based on his analysis, Dave identified that $150 is a good cut-off for low-revenue users, and $300 is a good cut-off for mid-revenue users. He wants to find the number of users in each category to tweak Delivr's business model.

Split the users into low, mid, and high-revenue buckets, and return the count of users in each group.

 

with user_revenues as(
  select
  user_id,
  sum(m.meal_price * o.order_quantity) as revenue
  from meals as m
  join orders as o
  on m.meal_id = o.meal_id
  group by o.user_id
)

select
CASE 
when revenue <150 then 'Low-revenue users'
when revenue <300 then 'Mid-revenue users'
Else 'High-revenue users'
END as revenue_group,
count(distinct user_id) as users

from user_revenues
group by revenue_group; 

 

 

Bucketing users by orders

Dave is repeating his bucketing analysis on orders to have a more complete profile of each group. He determined that 8 orders is a good cut-off for the low-orders group, and 15 is a good cut-off for the medium orders group.

Send Dave a table of each order group and how many users are in it.

 

-- Store each user's count of orders in a CTE named user_orders
with user_orders as (
  SELECT
    user_id,
    count(distinct order_id) AS orders
  FROM orders
  GROUP BY user_id)

SELECT
  -- Write the conditions for the three buckets
  CASE
    WHEN orders<8 THEN 'Low-orders users'
    WHEN orders<15 THEN 'Mid-orders users'
    ELSE 'High-orders users'
  END AS order_group,
  -- Count the distinct users in each bucket
  count(distinct user_id) AS users
FROM user_orders
GROUP BY order_group;

 How to find percentiles/quartiles.

 

 

Revenue quartiles

Dave is wrapping up his study, and wants to calculate a few more figures. He wants to find out the first, second, and third revenue quartiles. He also wants to find the average to see in which direction the data is skewed.

Calculate the first, second, and third revenue quartiles, as well as the average.

Note: You can calculate the 30th percentile for a column named column_a by using PERCENTILE_CONT(0.30) WITHIN GROUP (ORDER BY column_a ASC).

 

WITH user_revenues AS (
  -- Select the user IDs and their revenues
  SELECT
    o.user_id,
    sum(m.meal_price * o.order_quantity) AS revenue
  FROM meals AS m
  JOIN orders AS o ON m.meal_id = o.meal_id
  GROUP BY user_id)

SELECT
  -- Calculate the first, second, and third quartile
  ROUND(
    percentile_cont(0.25) within group (order by revenue ASC) :: numeric,
  2) AS revenue_p25,
  ROUND(
    percentile_cont(0.50) within group (order by revenue ASC) :: numeric,
  2) AS revenue_p50,
  ROUND(
    percentile_cont (0.75) within group (order by revenue ASC):: numeric,
  2) AS revenue_p75,
  -- Calculate the average
  ROUND(avg(revenue) :: numeric, 2) AS avg_revenue
FROM user_revenues;

 

 25 percent of the users have revenue below 120.69. To interpret the data, we must remember that we always want to find the opposite of the column we are working with. 120 is the revenue. so the opposite column is users. so 25 percent users are bellow 120... it must be always percentile + opposite column + below + result.

 

Interquartile range

The final value that Dave wants is the count of users in the revenue interquartile range (IQR). Users outside the revenue IQR are outliers, and Dave wants to know the number of "typical" users.

Return the count of users in the revenue IQR.


WITH user_revenues AS (
  SELECT
    -- Select user_id and calculate revenue by user 
    user_id,
    SUM(m.meal_price * o.order_quantity) AS revenue
  FROM meals AS m
  JOIN orders AS o 
    ON m.meal_id = o.meal_id
  GROUP BY user_id),

  quartiles AS (
  SELECT
    -- Calculate the first and third revenue quartiles
    ROUND(
      PERCENTILE_CONT(0.25) WITHIN GROUP
      (ORDER BY revenue ASC) :: NUMERIC,
    2) AS revenue_p25,
    ROUND(
      PERCENTILE_CONT(0.75) WITHIN GROUP
      (ORDER BY revenue ASC) :: NUMERIC,
    2) AS revenue_p75
  FROM user_revenues)

SELECT
  -- Count the number of users in the IQR
  count(distinct user_id) AS users
FROM user_revenues
CROSS JOIN quartiles
-- Only keep users with revenues in the IQR range
WHERE revenue :: NUMERIC >= revenue_p25
  AND revenue :: NUMERIC <= revenue_p75;

 

 Important window functions


to_char function to convert date into full description

 

Rank users by their count of orders

Eve tells you that she wants to report which user IDs have the most orders each month. She doesn't want to display long numbers, which will only distract C-level execs, so she wants to display only their ranks. The top 1 rank goes to the user with the most orders, the second-top 2 rank goes to the user with the second-most orders, and so on.

Send Eve a list of the top 3 user IDs by orders in August 2018 with their ranks.

 -- Set up the user_count_orders CTE

WITH user_count_orders AS (
  SELECT
    user_id,
    COUNT(DISTINCT order_id) AS count_orders
  FROM orders
  -- Only keep orders in August 2018
  WHERE DATE_TRUNC('month', order_date) = '2018-08-01'
  GROUP BY user_id)

SELECT
  -- Select user ID, and rank user ID by count_orders
  user_id,
  RANK() OVER (ORDER BY count_orders DESC) AS count_orders_rank
FROM user_count_orders
ORDER BY count_orders_rank ASC
-- Limit the user IDs selected to 3
LIMIT 3;

 

Pivoting user revenues by month

Next, Eve tells you that the C-level execs prefer wide tables over long ones because they're easier to scan. She prepared a sample report of user revenues by month, detailing the first 5 user IDs' revenues from June to August 2018. The execs told her to pivot the table by month. She's passed that task off to you.

Pivot the user revenues by month query so that the user ID is a row and each month from June to August 2018 is a column.

-- Import tablefunc
CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT * FROM CROSSTAB($$
  SELECT
    user_id,
    DATE_TRUNC('month', order_date) :: DATE AS delivr_month,
    SUM(meal_price * order_quantity) :: FLOAT AS revenue
  FROM meals
  JOIN orders ON meals.meal_id = orders.meal_id
 WHERE user_id IN (0, 1, 2, 3, 4)
   AND order_date < '2018-09-01'
 GROUP BY user_id, delivr_month
 ORDER BY user_id, delivr_month;
$$)
-- Select user ID and the months from June to August 2018
AS ct (user_id INT,
       "2018-06-01" FLOAT,
       "2018-07-01" FLOAT,
       "2018-08-01" FLOAT)
ORDER BY user_id ASC;

Costs

The C-level execs next tell Eve that they want a report on the total costs by eatery in the last two months.

First, write a query to get the total costs by eatery in November and December 2018, then pivot by month.

Note: Recall from Chapter 1 that total cost is the sum of each meal's cost times its stocking quantity.

[we have to note something interesting that is. We can get a clue of what we are looking for by looking at the question. The question will tell us what we need to pivot by and for what measures.- in this question, it already mentioned that we are looking to build a table with " totals costs of each eatery by last two months]

-- Import tablefunc
create extension if not exists tablefunc;

SELECT * FROM CROSSTAB($$
  SELECT
    -- Select eatery and calculate total cost
    eatery,
    DATE_TRUNC('month', stocking_date) :: DATE AS delivr_month,
    SUM(meal_cost * stocked_quantity) :: FLOAT AS cost
  FROM meals
  JOIN stock ON meals.meal_id = stock.meal_id
  -- Keep only the records after October 2018
  WHERE DATE_TRUNC('month', stocking_date) > '2018-10-01'
  GROUP BY eatery, delivr_month
  ORDER BY eatery, delivr_month;
$$)

-- Select the eatery and November and December 2018 as columns
AS ct (eatery TEXT,
       "2018-11-01" FLOAT,
       "2018-12-01" FLOAT)
ORDER BY eatery ASC;

results 


Producing executive reports

Executive summary 1: 

a) aggregation and formatting 

b) use of rank

c) crosstab for pivoting the table

Executive report

Eve wants to produce a final executive report about the rankings of eateries by the number of unique users who order from them by quarter. She said she'll handle the pivoting, so you only need to prepare the source table for her to pivot.

Send Eve a table of unique ordering users by eatery and by quarter.

visit the link to practice on datacamp: Executive report | SQL (datacamp.com)

-- Import tablefunc
Create extension if not exists tablefunc;

-- Pivot the previous query by quarter
select * from crosstab ($$
  WITH eatery_users AS  (
    SELECT
      eatery,
      -- Format the order date so "2018-06-01" becomes "Q2 2018"
      TO_CHAR(order_date, '"Q"Q YYYY') AS delivr_quarter,
      -- Count unique users
      COUNT(DISTINCT user_id) AS users
    FROM meals
    JOIN orders ON meals.meal_id = orders.meal_id
    GROUP BY eatery, delivr_quarter
    ORDER BY delivr_quarter, users)

  SELECT
    -- Select eatery and quarter
    eatery,
    delivr_quarter,
    -- Rank rows, partition by quarter and order by users
    RANK() OVER
      (PARTITION BY delivr_quarter
       ORDER BY users DESC) :: INT AS users_rank
  FROM eatery_users
  ORDER BY eatery, delivr_quarter;
$$)
-- Select the columns of the pivoted table
AS  ct (eatery TEXT,
        "Q2 2018" INT,
        "Q3 2018" INT,
        "Q4 2018" INT)
ORDER BY "Q4 2018";

result