Tuesday, 7 June 2022

SQL Notes

 --We want to look at number of the charts created monthly by group of months

--then we will want to add another column to start counting the charts with the offset of 1 using lag(count(*),1) over timestamp

--we will want to calculate the difference of changes in the percentage format. 100*(count(*)- lag(count(*),1) over(order by timestamp)) / lag(count(*),1)


select date_trunc('month',timestamp) as date,

count(*) as count,

from events

where event_name = 'created charts'

group by  1

order by 1



How many distinct values of zip appear in at least 100 rows?


first of all we will group by zip

then count all rows

use having clause to find distinct zip above 100 counts of all records. 

this will include null as well. 

we will select zip and count all.


select zip, count(*)

from evanston311

group by zip

having count(*)>=100;


How many distinct values of source appear in at least 100 rows?


we will group by source, 

we will count all records, these counts of records will be put in the groups of source,

we will select source and counts of all records.


-- Find values of source that appear in at least 100 rows

-- Also get the count of each value

SELECT source, count(*)

  FROM evanston311

 group by source

having count(*)>=100;


-- Find the 5 most common values of street and the count of each

--...

-- we will first group by street, and put street counts into these groups

-- find out how many times a street name repeated

-- we will order the streets counts in desc order

-- we will then select street and count the number of the streets.

SELECT street, count(*)

  FROM evanston311

 group by street

 order by count(*) Desc

 limit 5;

Upper and lower
We convert all upper and lowercases in order to remove confusions:

select *
from potheads
where upper(names)='shariful';

but, what if we say we are still missing some of the sharifuls that has space at the beginning or at the end? in that case, we might want to use like operators:

so in our case, like operators would be writter similar to bellow:
WHERE name LIKE '%shariful%' ;

 (but, the above sharifuls be all lower cases )

WHERE name ILIKE '%shariful%' ;
If we add i to the like, it will become insensitive to the cases. 

 SELECT trim(' abc '); 
--results: 'abc'  spaces are trimmed ltrim removes the left and rtrim removes the space from right side of it. 
you can also trim other things such as  a letter, exclamation marks and likes of these using the trim functions.

select trim('Wow!, '!'); -- from here, ! -mark will be removed. 

We can combine other functions as well. 
select trim(lower('Wow!'), '!w'); 
--results is going to be -o; because we have been able to remove all the lower case 'double-U's. 

Some of the street values in evanston311 include house numbers with # or / in them. In addition, some street values end in a .

Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the street values as a first attempt at cleaning up the values.

Answer: 
Select Distinct Street, 
           trim(street, '0123456789 #/. ') AS cleaned_street
From evanston311
Order By cleaned_street; --because this is our new streets column. 
Building up the query through the steps below, find inquires that mention trash or garbage in the description without trash or garbage being in the category. What are the most frequent categories for such inquiries?
 lets breakdown the question: 
we will have to find out records (i.e inquiries) those have trash or garbage these two words in the description but not in the category. 
which of these categories were very frequent? in other words, which categories have higher count for having trash and garbage in the description.
 
-- first count the rows where description contains trash and garbage.
select count (*)
from evanston311
where description ILIKE '%trash%'
or description ILIKE '%garbage%' ;


-- secondly we will select the categories where categories do not contain trash and garbage, at the same time previous condition is added to it. 
Select category, count(*)
from evanston311
where (Description ILIKE '%Trash%'
    Or Description ILIKE '%Garbage%')
And category NOT LIKE '%Trash%'
And category NOT LIKE '%Garbage%' -- look at this, I have used AND operator because we do not want either of these two words trash/garbage in our category. 

--Lastly, we want to order by count of all in desc order and get the top most categories containing trash and garbage in the description:

-- Count rows with each category
SELECT category, count(*)
FROM evanston311
WHERE (description ILIKE '%trash%'
OR description ILIKE '%garbage%')
AND category NOT LIKE '%Trash%' ---none of the two words can be here
AND category NOT LIKE '%Garbage%' --- none the two words can be here
-- What are you counting?
GROUP BY category
--- order by most frequent values
ORDER BY count desc
LIMIT 10;


Splitting and concating 

Great! When joining values that might be NULL with a separator between them, consider using the concat_ws()function, which you can read about in the PostgreSQL documentation, to avoid duplicate or unnecessary separators in the result.


SELECT trim(concat_ws(' ', house_num, street)) AS address
FROM evanston311;


select
left('abcde',2) --- =will keep first two character = ab
right('abcde', 2) --- = will keep last two character = de

select substring('abcde' from 2 for 3)  --- = cde 
-- in this case, the first two will be skipped and next three will be returned  

-- Select the first word of the street value
SELECT split_part(street,' ',1) AS street_name,
count(*)
FROM evanston311
GROUP BY street_name
ORDER BY count DESC
LIMIT 20;


The description column of evanston311 can be very long. You can get the length of a string with the length() function.

For displaying or quickly reviewing the data, you might want to only display the first few characters. You can use the left() function to get a specified number of characters at the start of each value. 

To indicate that more data is available, concatenate '...' to the end of any shortened description. To do this, you can use a CASE WHEN statement to add '...' only when the string length is greater than 50.

Select the first 50 characters of description when descriptionstarts with the word "I".


-- Select the first 50 chars when length is greater than 50
SELECT CASE WHEN length(description)>50 --length of the description character 50 theke beshi
THEN left(description, 50) || '...'
-- otherwise just select description
ELSE description
END
FROM evanston311
-- limit to descriptions that start with the word I
WHERE description LIKE 'I %' -- I word diye shuru jegula hoi khali ogulai nibo
ORDER BY description;-- like can be uttered as has (i mean where this
has I as a starting word)


select case when length(description)>50
                   then left(description,50) || '...'
                  ELSE description
                  END
from evanston311
where description like 'I %'
Order by description; 

If we want to summarize Evanston 311 requests by zip code, it would be useful to group all of the low frequency zip codes together in an "other" category. 

Which of the following values, when substituted for ??? in the query, would give the result below?

Query:

SELECT CASE WHEN zipcount < 100 THEN 'other'
       ELSE zip
       END AS zip_recoded,
       sum(zipcount) AS zipsum
  FROM (SELECT zip, count(*) AS zipcount
          FROM evanston311
         GROUP BY zip) AS fullcounts
 GROUP BY zip_recoded
 ORDER BY zipsum DESC;

Group and recode values

There are almost 150 distinct values of evanston311.category. But some of these categories are similar, with the form "Main Category - Details". We can get a better sense of what requests are common if we aggregate by the main category. 

To do this, create a temporary table recode mapping distinct category values to new, standardized values. Make the standardized values the part of the category before a dash ('-'). Extract this value with the split_part() function:

split_part(string text, delimiter text, field int)

You'll also need to do some additional cleanup of a few cases that don't fit this pattern. 

Then the evanston311 table can be joined to recode to group requests by the new standardized category values.

-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
SELECT DISTINCT category,
rtrim(split_part(category, '-', 1)) AS standardized
FROM evanston311;-- made two columns and cleaned
UPDATE recode SET standardized='Trash Cart'
WHERE standardized LIKE 'Trash%Cart';-- je line a trash ar cart ase
UPDATE recode SET standardized='Snow Removal'
WHERE standardized LIKE 'Snow%Removal%';-- jekhane snow removal ase
UPDATE recode SET standardized='UNUSED'
WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart',
'(DO NOT USE) Water Bill',
'DO NOT USE Trash', 'NO LONGER IN USE');-- jekhan
--egula ase

-- Select the recoded categories and the count of each
-- we will conduct a join because temp table cannot print the result
--alone.
SELECT standardized, count(evanston311.category)
-- From the original table and table with recoded values
FROM evanston311
Left JOIN recode
-- What column do they have in common?
ON evanston311.category=recode.category --joining with oldone
--because new one has nothing in common witht the old one.
-- What do you need to group by to count?
GROUP BY standardized
-- Display the most common val values first
ORDER BY count(evanston311.category) desc;

Logical Steps:

-- query result table will show new corrected column and the count column for the count of category.

--for doing so... Need to make a temp table with one original column and another with corrected column

--then clean the temp table's correctedcolumn

--then join the temp table's original column with the actual table's original column. Because, we need to count number of entries from the actual table. In the temp table, we only have distinct values. 

-- then group the result table by new corrected column of the temp table (standardized column in our case) and order by the counts of the old table's original column. 


We have solved the problem in four steps:

1. Create recode with a standardized column; use split_part() and then rtrim() to remove any remaining whitespace on the result of split_part().

2. UPDATE standardized values LIKE 'Trash%Cart'to 'Trash Cart'.
  • UPDATE standardized values of 'Snow Removal/Concerns' and 'Snow/Ice/Hazard Removal' to 'Snow Removal'.

3. UPDATE recode by setting standardized values of 'THIS REQUEST IS INACTIVE…Trash Cart''(DO NOT USE) Water Bill''DO NOT USE Trash', and 'NO LONGER IN USE' to 'UNUSED'.

4. Now, join the evanston311 and recode tables to count the number of requests with each of the standardized values
  • List the most common standardized values first.

  • This is what the result will look like
standardizedcount

Broken Parking Meter6092
Trash3699
Ask A Question / Send A Message2595
Trash Cart1902
Tree Ev ........

Create a table with indicator variables

Determine whether medium and high priority requests in the evanston311 data are more likely to contain requesters' contact information: an email address or phone number.

  • Emails contain an @.
  • Phone numbers have the pattern of three characters, dash, three characters, dash, four characters. For example: 555-555-1212.

Use LIKE to match these patterns. Remember % matches any number of characters (even 0), and _ matches a single character. Enclosing a pattern in % (i.e. before and after your pattern) allows you to locate it within other text. 

For example, '%___.com%'would allow you to search for a reference to a website with the top-level domain '.com' and at least three characters preceding it.

Create and store indicator variables for email and phone in a temporary table.  LIKE produces True or False as a result, but casting a boolean (True or False) as an integer converts True to 1 and False to 0. This makes the values easier to summarize later.


How will I think to solve the problem?

- We have a column of priority and we need to find out which priority group more

likely to contain  email and phone number.

- We need to make a temp table of id, email and phone number.

-This is what we need to show: we need to select from the main table and find out how many rows contain emails and how many rows contain phone number in a proportion. The formula is : rows with email/total number of rows and same for the phone number

- After we have created the three columns we want to see in our final result, we need to join

temp table and original table. Because all the ids are available in the original column.

-Group by the priority at the end.


Steps: Step1
  • Create a temp table indicators from evanston311with three columns: idemail, and phone.

  • Use LIKE comparisons to detect the email and phone patterns that are in the description, and cast the result as an integer with CAST().

    • Your phone indicator should use a combination of underscores _ and dashes - to represent a standard 10-digit phone number format.
    • Remember to start and end your patterns with % so that you can locate the pattern within other text!
Step2
  • Join the indicators table to evanston311, selecting the proportion of reports including an email or phonegrouped by priority
  • Include adjustments to account for issues arising from integer division.
  • -- To clear table if it already exists
    DROP TABLE IF EXISTS indicators;

    -- Create the temp table
    CREATE TEMP TABLE indicators AS
    SELECT id,
    CAST (description LIKE '%@%' AS integer) AS email,
    CAST (description LIKE '%___-___-____%' AS integer) AS phone
    FROM evanston311;
    -- Select the column you'll group by
    SELECT priority,
    -- Compute the proportion of rows with each indicator
    SUM(email)/count(*)::numeric AS email_prop,
    SUM(phone)/count(*)::numeric AS phone_prop
    -- Tables to select from
    FROM evanston311
    left JOIN indicators
    -- Joining condition
    ON evanston311.id=indicators.id
    -- What are you grouping by?
    GROUP BY priority;
Time Series:

Date comparisons

When working with timestamps, sometimes you want to find all observations on a given day. However, if you specify only a date in a comparison, you may get unexpected results. This query:

SELECT count(*) 
  FROM evanston311
 WHERE date_created = '2018-01-02';

returns 0, even though there were 49 requests on January 2, 2018. 

This is because dates are automatically converted to timestamps when compared to a timestamp. The time fields are all set to zero:

SELECT '2018-01-02'::timestamp;
 2018-01-02 00:00:00

When working with both timestamps and dates, you'll need to keep this in mind.

  • Count the number of Evanston 311 requests created on January 31, 2017 by casting date_created to a date.
-- Count requests created on January 31, 2017
SELECT count(*)
FROM evanston311
WHERE date_created::date = '2017-01-31';
  • Count the number of Evanston 311 requests created on February 29, 2016 by using >= and < operators.
  • -- Count requests created on February 29, 2016
  • SELECT count(*)
    FROM evanston311
    WHERE date_created >= '2016-02-29'
    AND date_created < '2016-03-01';

.......

No comments:

Post a Comment