--We want to look at number of the charts created monthly by group of months
--then we will want to add another column to start counting the charts with the offset of 1 using lag(count(*),1) over timestamp
--we will want to calculate the difference of changes in the percentage format. 100*(count(*)- lag(count(*),1) over(order by timestamp)) / lag(count(*),1)
select date_trunc('month',timestamp) as date,
count(*) as count,
from events
where event_name = 'created charts'
group by 1
order by 1
How many distinct values of zip appear in at least 100 rows?
first of all we will group by zip
then count all rows
use having clause to find distinct zip above 100 counts of all records.
this will include null as well.
we will select zip and count all.
select zip, count(*)
from evanston311
group by zip
having count(*)>=100;
How many distinct values of source appear in at least 100 rows?
we will group by source,
we will count all records, these counts of records will be put in the groups of source,
we will select source and counts of all records.
-- Find values of source that appear in at least 100 rows
-- Also get the count of each value
SELECT source, count(*)
FROM evanston311
group by source
having count(*)>=100;
-- Find the 5 most common values of street and the count of each
--...
-- we will first group by street, and put street counts into these groups
-- find out how many times a street name repeated
-- we will order the streets counts in desc order
-- we will then select street and count the number of the streets.
SELECT street, count(*)
FROM evanston311
group by street
order by count(*) Desc
limit 5;
Upper and lower
We convert all upper and lowercases in order to remove confusions:
select *
from potheads
where upper(names)='shariful';
but, what if we say we are still missing some of the sharifuls that has space at the beginning or at the end? in that case, we might want to use like operators:
so in our case, like operators would be writter similar to bellow:
WHERE name LIKE '%shariful%' ;
(but, the above sharifuls be all lower cases )
WHERE name ILIKE '%shariful%' ;
If we add i to the like, it will become insensitive to the cases.
SELECT trim(' abc ');
--results: 'abc' spaces are trimmed ltrim removes the left and rtrim removes the space from right side of it.
you can also trim other things such as a letter, exclamation marks and likes of these using the trim functions.
select trim('Wow!, '!'); -- from here, ! -mark will be removed.
We can combine other functions as well.
select trim(lower('Wow!'), '!w');
--results is going to be -o; because we have been able to remove all the lower case 'double-U's.
Some of the street
values in evanston311
include house numbers with #
or /
in them. In addition, some street values end in a .
.
Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the street
values as a first attempt at cleaning up the values.
Answer:
Select Distinct Street,
trim(street, '0123456789 #/. ') AS cleaned_street
From evanston311
Order By cleaned_street; --because this is our new streets column.
Building up the query through the steps below, find inquires that
mention trash or garbage in the description without trash or garbage
being in the category. What are the most frequent categories for such
inquiries?
lets breakdown the question:
we will have to find out records (i.e inquiries) those have trash or garbage these two words in the description but not in the category.
which of these categories were very frequent? in other words, which categories have higher count for having trash and garbage in the description.
-- first count the rows where description contains trash and garbage.
select count (*)
from evanston311
where description ILIKE '%trash%'
or description ILIKE '%garbage%' ;
-- secondly we will select the categories where categories do not contain trash and garbage, at the same time previous condition is added to it.
Select category, count(*)
from evanston311
where (Description ILIKE '%Trash%'
Or Description ILIKE '%Garbage%')
And category NOT LIKE '%Trash%'
And category NOT LIKE '%Garbage%' -- look at this, I have used AND operator because we do not want either of these two words trash/garbage in our category.
--Lastly, we want to order by count of all in desc order and get the top most categories containing trash and garbage in the description:
-- Count rows with each category
SELECT category, count(*)
FROM evanston311
WHERE (description ILIKE '%trash%'
OR description ILIKE '%garbage%')
AND category NOT LIKE '%Trash%' ---none of the two words can be here
AND category NOT LIKE '%Garbage%' --- none the two words can be here
-- What are you counting?
GROUP BY category
--- order by most frequent values
ORDER BY count desc
LIMIT 10;
Splitting and concating
Great! When joining values that might be NULL
with a separator between them, consider using the concat_ws()
function, which you can read about in the PostgreSQL documentation, to avoid duplicate or unnecessary separators in the result.
SELECT trim(concat_ws(' ', house_num, street)) AS address
FROM evanston311;
select
left('abcde',2) --- =will keep first two character = ab
right('abcde', 2) --- = will keep last two character = de
select substring('abcde' from 2 for 3) --- = cde
-- in this case, the first two will be skipped and next three will be returned
-- Select the first word of the street value
SELECT split_part(street,' ',1) AS street_name,
count(*)
FROM evanston311
GROUP BY street_name
ORDER BY count DESC
LIMIT 20;
The description
column of evanston311
can be very long. You can get the length of a string with the length()
function.
For displaying or quickly reviewing the data, you might want to only display the first few characters. You can use the left()
function to get a specified number of characters at the start of each value.
To indicate that more data is available, concatenate '...'
to the end of any shortened description
. To do this, you can use a CASE WHEN
statement to add '...'
only when the string length is greater than 50.
Select the first 50 characters of description
when description
starts with the word "I".
-- Select the first 50 chars when length is greater than 50
SELECT CASE WHEN length(description)>50 --length of the description character 50 theke beshi
THEN left(description, 50) || '...'
-- otherwise just select description
ELSE description
END
FROM evanston311
-- limit to descriptions that start with the word I
WHERE description LIKE 'I %' -- I word diye shuru jegula hoi khali ogulai nibo
ORDER BY description;-- like can be uttered as has (i mean where this
has I as a starting word)
select case when length(description)>50
then left(description,50) || '...'
ELSE description
END
from evanston311
where description like 'I %'
Order by description;
If we want to summarize Evanston 311 requests by zip code, it would be useful to group all of the low frequency zip codes together in an "other" category.
Which of the following values, when substituted for ???
in the query, would give the result below?
Query:
SELECT CASE WHEN zipcount < 100 THEN 'other'
ELSE zip
END AS zip_recoded,
sum(zipcount) AS zipsum
FROM (SELECT zip, count(*) AS zipcount
FROM evanston311
GROUP BY zip) AS fullcounts
GROUP BY zip_recoded
ORDER BY zipsum DESC;
Group and recode values
There are almost 150 distinct values of evanston311.category
. But some of these categories are similar, with the form "Main Category - Details". We can get a better sense of what requests are common if we aggregate by the main category.
To do this, create a temporary table recode
mapping distinct category
values to new, standardized
values. Make the standardized
values the part of the category before a dash ('-'
). Extract this value with the split_part()
function:
split_part(string text, delimiter text, field int)
You'll also need to do some additional cleanup of a few cases that don't fit this pattern.
Then the evanston311
table can be joined to recode
to group requests by the new standardized
category values.
-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
SELECT DISTINCT category,
rtrim(split_part(category, '-', 1)) AS standardized
FROM evanston311;-- made two columns and cleaned
UPDATE recode SET standardized='Trash Cart'
WHERE standardized LIKE 'Trash%Cart';-- je line a trash ar cart ase
UPDATE recode SET standardized='Snow Removal'
WHERE standardized LIKE 'Snow%Removal%';-- jekhane snow removal ase
UPDATE recode SET standardized='UNUSED'
WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart',
'(DO NOT USE) Water Bill',
'DO NOT USE Trash', 'NO LONGER IN USE');-- jekhan
--egula ase
-- Select the recoded categories and the count of each
-- we will conduct a join because temp table cannot print the result
--alone.
SELECT standardized, count(evanston311.category)
-- From the original table and table with recoded values
FROM evanston311
Left JOIN recode
-- What column do they have in common?
ON evanston311.category=recode.category --joining with oldone
--because new one has nothing in common witht the old one.
-- What do you need to group by to count?
GROUP BY standardized
-- Display the most common val values first
ORDER BY count(evanston311.category) desc;
Logical Steps:
-- query result table will show new corrected column and the count column for the count of category.
--for doing so... Need to make a temp table with one original column and another with corrected column
--then clean the temp table's correctedcolumn
--then join the temp table's original column with the actual table's original column. Because, we need to count number of entries from the actual table. In the temp table, we only have distinct values.
-- then group the result table by new corrected column of the temp table (standardized column in our case) and order by the counts of the old table's original column.
We have solved the problem in four steps:
1. Create recode
with a standardized
column; use split_part()
and then rtrim()
to remove any remaining whitespace on the result of split_part()
.
2. UPDATE
standardized
values LIKE
'Trash%Cart'
to 'Trash Cart'.
UPDATE
standardized
values of 'Snow Removal/Concerns' and 'Snow/Ice/Hazard Removal' to 'Snow Removal'.
3. UPDATE
recode
by setting standardized
values of 'THIS REQUEST IS INACTIVE…Trash Cart', '(DO NOT USE) Water Bill', 'DO NOT USE Trash', and 'NO LONGER IN USE' to 'UNUSED'.
4. Now, join the evanston311
and recode
tables to count the number of requests with each of the standardized
values
- List the most common
standardized
values first.
- This is what the result will look like
Broken Parking Meter | 6092 |
Trash | 3699 |
Ask A Question / Send A Message | 2595 |
Trash Cart | 1902 |
Tree Ev
........
Create a table with indicator variablesDetermine whether medium and high priority requests in the evanston311 data are more likely to contain requesters' contact information: an email address or phone number. - Emails contain an @.
- Phone numbers have the pattern of three characters, dash, three characters, dash, four characters. For example: 555-555-1212.
Use LIKE to match these patterns. Remember % matches any number of characters (even 0), and _ matches a single character. Enclosing a pattern in % (i.e. before and after your pattern) allows you to locate it within other text. For example, '%___.com%' would allow you to search for a reference to a website with the top-level domain '.com' and at least three characters preceding it. Create and store indicator variables for email and phone in a temporary table. LIKE produces True or False as a result, but casting a boolean (True or False) as an integer converts True to 1 and False to 0. This makes the values easier to summarize later.
How will I think to solve the problem? - We have a column of priority and we need to find out which priority group more likely to contain email and phone number. - We need to make a temp table of id, email and phone number. -This is what we need to show: we need to select from the main table and find out how many rows contain emails and how many rows contain phone number in a proportion. The formula is : rows with email/total number of rows and same for the phone number
- After we have created the three columns we want to see in our final result, we need to join temp table and original table. Because all the ids are available in the original column. -Group by the priority at the end.
Steps:
Step1 Create a temp table indicators from evanston311 with three columns: id , email , and phone . Use LIKE comparisons to detect the email and phone patterns that are in the description , and cast the result as an integer with CAST() . - Your phone indicator should use a combination of underscores
_ and dashes - to represent a standard 10-digit phone number format. - Remember to start and end your patterns with
% so that you can locate the pattern within other text!
Step2
- Join the
indicators table to evanston311 , selecting the proportion of reports including an email or phone grouped by priority . - Include adjustments to account for issues arising from integer division.
-- To clear table if it already exists DROP TABLE IF EXISTS indicators; -- Create the temp table CREATE TEMP TABLE indicators AS SELECT id, CAST (description LIKE '%@%' AS integer) AS email, CAST (description LIKE '%___-___-____%' AS integer) AS phone FROM evanston311; -- Select the column you'll group by SELECT priority, -- Compute the proportion of rows with each indicator SUM(email)/count(*)::numeric AS email_prop, SUM(phone)/count(*)::numeric AS phone_prop -- Tables to select from FROM evanston311 left JOIN indicators -- Joining condition ON evanston311.id=indicators.id -- What are you grouping by? GROUP BY priority;
Time Series:
Date comparisonsWhen working with timestamps, sometimes you want to find all observations on a given day. However, if you specify only a date in a comparison, you may get unexpected results. This query: SELECT count(*)
FROM evanston311
WHERE date_created = '2018-01-02';
returns 0, even though there were 49 requests on January 2, 2018. This is because dates are automatically converted to timestamps when compared to a timestamp. The time fields are all set to zero: SELECT '2018-01-02'::timestamp;
2018-01-02 00:00:00
When working with both timestamps and dates, you'll need to keep this in mind. - Count the number of Evanston 311 requests created on January 31, 2017 by casting
date_created to a date .
-- Count requests created on January 31, 2017 SELECT count(*) FROM evanston311 WHERE date_created::date = '2017-01-31';
- Count the number of Evanston 311 requests created on February 29, 2016 by using
>= and < operators. - -- Count requests created on February 29, 2016
SELECT count(*) FROM evanston311 WHERE date_created >= '2016-02-29' AND date_created < '2016-03-01';
|
....... |
No comments:
Post a Comment