My SQL Progress Throughout the Summer (Part 1)
Table of Contents
Introduction
As summer comes to an end, I want to recount all the things that I've learned, specifically for a language called SQL, which allows a user to communicate with a database. I started learning SQL in the beginning of the summer because, next to Excel, it is one of the most in-demand skills that are required to become a data scientist. I've been using this Udemy course called "The Complete SQL Bootcamp 2022: Go from Zero to Hero" that has helped me learn and practice the essential SQL concepts. I want to provide a quick summary, or cheat sheet, of some important keywords that you can use for the SELECT statement! This is a Part 1 of my post, and I will create a Part 2 in the future covering creating and modifying a table, so stay tuned for that!
Anyways, without furtherado, here is a recap of a few important things about the SELECT statement:
The SELECT Statement
The SELECT keyword is the bread and butter for writing queries. It is very simple yet powerful. You can do so many things with the SELECT statement, but I will save those for later.
For right now, here's the syntax for selecting one column:
SELECT column_name FROM table_name;
Syntax for selecting multiple columns:
SELECT col_name1, col_name2 FROM table_name;
Syntax for selecting everything:
SELECT * FROM table_name;
What a SELECT statement will do is return the column(s) that you want from a table. From here on out, things are going to be more complicated than this...
SELECT DISTINCT
Now it's getting a bit more complicated. You can use the DISTINCT keyword combined with the SELECT statement to return all the UNIQUE VALUES from a single column of a table.
Syntax for SELECT DISTINCT:
SELECT DISTINCT(col_name) FROM table_name;
Let's say that we have a table called customers that stores who the customers from a certain store are:
customer_id | customer_name |
---|---|
1 | David |
2 | Mary |
3 | Sarah |
4 | David |
If we want to find all of the unique customer names, we can use a SELECT DISTINCT statement like so:
SELECT DISTINCT(customer_name) FROM customers;
Running this query will return a list of unique names:
customer_name |
---|
David |
Mary |
Sarah |
Note that this list only returns 3 names, whereas the original table had 4 names. This is because "David" appears in the table twice, and since SELECT DISTINCT only returns the unique values, "David" is only returned once.
COUNT
The COUNT keyword does just what it sounds. It counts the rows in the table. In more complex terms, it returns the number of input rows that match a specific condition of a query.
You use COUNT with the SELECT statement. Here's what the syntax looks like:
SELECT COUNT(*) FROM table_name;
Let's use the customers table from the previous example again:
customer_id | customer_name |
---|---|
1 | David |
2 | Mary |
3 | Sarah |
4 | David |
To use the COUNT keyword, we can do this in our query:
SELECT COUNT(*) FROM customers;
What this code does is return the number of rows in the table. It's that simple.
Now you can make it more complex by using COUNT and DISTINCT together. Doing this will count the number of unique values in the table.
For example,
SELECT COUNT(DISTINCT(customer_name)) FROM customers;
will return 3 because there are 3 unique values in the column called "customer_name".
In most cases, using the COUNT keyword won't be this easy. You will often use a WHERE statement to specify the condition for the count of number of rows.
SELECT WHERE
The SELECT WHERE statement is probably the most important out of all the other keywords that I talk about in this post.
What the WHERE statement does is that it allows you to specify the conditions for the rows in the table to be returned. If you have some experience in a programming lanaguage such as Python, then think of WHERE as an "if" statement.
General syntax:
SELECT column_name
FROM table_name
WHERE condition(s);
Let's use an example to make this more clear, using a table called "customer" (similar to the previous example):
customer_id | customer_name | customer_age |
---|---|---|
1 | David | 25 |
2 | Mary | 40 |
3 | Sarah | 60 |
4 | David | 65 |
Let's say we say something like this:
SELECT customer_name
FROM customer
WHERE customer_age >= 60;
This query will return two names: "Sarah" and "David". This is because the customers with a name of Sarah and David have an age at least 60, meaning that they meet the condition in the WHERE statement.
Another cool thing we can do is use something called the AND keyword. This basically allows you to combine multiple conditions in a single WHERE statement. Keep in mind that if you use AND, then all of the conditions must be true for the rows to be returned.
Using the same example above, if we say
SELECT customer_name
FROM customer
WHERE customer_age < 63 AND customer_age > 35;
the query will return "Mary" and "Sarah". This is because those people are the only ones that meet BOTH of the age requirements listed in the WHERE statement.
GROUP BY
In order to use the GROUP BY keyword, you must understand what aggregation functions are.
Aggregation functions take multiple inputs from a whole column and return a single result. Examples include AVG(), COUNT(), MAX(), MIN(), SUM().
Let's have a sample table called "players" to show how aggregation functions work:
player_id | player_height_inches | player_location |
---|---|---|
1 | 70 | Seattle, Washington |
2 | 75 | San Francisco, California |
3 | 80 | Toronto, Ontario |
4 | 67 | Boston, Massachusetts |
Doing
SELECT AVG(player_height_inches)
FROM players
will return 73 because the mean was returned.
Ok, now back to the GROUP BY.
Basically, GROUP BY groups the values in a CATEGORICAL column and perfoms an aggregation function on a QUANTITATIVE column and returns the aggregated quantitative values for each value in the categorical column.
If that sounded confusing, then consider this table called "players":
player_id | player_height_inches | player_location |
---|---|---|
1 | 70 | Seattle, Washington |
2 | 75 | San Francisco, California |
3 | 80 | Toronto, Ontario |
4 | 67 | Boston, Massachusetts |
5 | 65 | Seattle, Washington |
6 | 71 | Seattle, Washington |
7 | 74 | Toronto, Ontario |
8 | 66 | San Francisco, California |
9 | 69 | Toronto, Ontario |
10 | 73 | Boston, Massachusetts |
If we wanted to find the maximum height for each player from a specific location, then we can use GROUP BY. Here's how we do it:
SELECT player_location, MAX(player_height_inches)
FROM players
GROUP BY player_location;
This will return something that looks like this:
player_location | player_height_inches |
---|---|
Seattle, Washington | 71 |
San Francisco, California | 75 |
Toronto, Ontario | 80 |
Boston, Massachusetts | 73 |
Notice that in the query, the categorical column that is NOT being aggregated "player_location" is the one that is being grouped. NEVER group by the quantitative column, which in this case is "player_height_inches".
Also, the GROUP BY keyword should come right after the FROM statement. While the query above looks quite simple, it can get pretty complicated once add WHERE statements and other SQL commands.
Conclusion
I hope you've learned some of the basics of SQL in this post! All of these things are NOT a comprehensive list of the whole language. I only covered a glimpse of SQL, and to dive deeper, you can view a Postgres documentation at https://www.postgresql.org/docs/15/index.html. If you want, please comment down below what SQL commands you use the most often in your job.
Thank you for reading!