Table of Contents
Introduction to Databases and SQL
1.1 What is a Database?
1.2 Why SQL for Data Analysis?
1.3 Types of Databases: Relational vs. Non-Relational
1.4 Real-World Example: Managing a Retail Inventory Database
Basic SQL Queries: SELECT, WHERE, ORDER BY, GROUP BY
2.1 Understanding SELECT Statements
2.2 Filtering with WHERE
2.3 Sorting with ORDER BY
2.4 Aggregating with GROUP BY
2.5 Real-World Example: Analyzing Sales Data
2.6 Best Practices and Exception Handling
Joins: INNER, LEFT, RIGHT, FULL
3.1 What are Joins?
3.2 INNER JOIN: Combining Matching Data
3.3 LEFT, RIGHT, and FULL OUTER Joins
3.4 Real-World Example: Customer and Order Analysis
3.5 Best Practices and Exception Handling
Aggregations and Subqueries
4.1 Aggregate Functions: COUNT, SUM, AVG, MAX, MIN
4.2 Writing Subqueries
4.3 Real-World Example: Employee Performance Metrics
4.4 Best Practices and Exception Handling
Window Functions and Advanced SQL Features (SQL 2022/2025)
5.1 Introduction to Window Functions
5.2 Common Window Functions: ROW_NUMBER, RANK, DENSE_RANK
5.3 Advanced SQL Features in 2022/2025
5.4 Real-World Example: Sales Ranking and Moving Averages
5.5 Best Practices and Exception Handling
Connecting Python with SQL: sqlite3, SQLAlchemy, PyODBC
6.1 Why Connect Python with SQL?
6.2 Using sqlite3 for Lightweight Databases
6.3 SQLAlchemy: ORM for Flexibility
6.4 PyODBC for Enterprise Databases
6.5 Real-World Example: Automating Financial Reports
6.6 Best Practices and Exception Handling
6.7 Pros, Cons, and Alternatives
Conclusion
7.1 Recap of Key SQL Skills
7.2 Next Steps in Your Data Analysis Journey
1. Introduction to Databases and SQL
1.1 What is a Database?
A database is an organized collection of data, typically stored and accessed electronically from a computer system. Databases allow businesses, analysts, and developers to store, manage, and retrieve data efficiently. Relational databases, which use tables to store data, are the most common for data analysis due to their structured nature and compatibility with SQL (Structured Query Language).
Key Characteristics of Databases:
Structured Data: Organized into tables with rows and columns.
Scalability: Handles large datasets, from small applications to enterprise systems.
Querying: Allows users to extract specific data using SQL.
Example: A retail company uses a database to store customer information, product inventory, and sales transactions, enabling quick analysis of purchasing trends.
1.2 Why SQL for Data Analysis?
SQL is the standard language for querying and manipulating data in relational databases. It’s widely used by data analysts, data scientists, and engineers due to its:
Simplicity: Easy-to-learn syntax for querying data.
Versatility: Works with various database systems like MySQL, PostgreSQL, SQL Server, and SQLite.
Efficiency: Optimized for handling large datasets and complex queries.
SQL is essential for tasks like filtering data, aggregating metrics, and joining tables to uncover insights, making it a cornerstone of data analysis.
1.3 Types of Databases: Relational vs. Non-Relational
Relational Databases:
Store data in tables with predefined schemas.
Use SQL for querying (e.g., MySQL, PostgreSQL, SQL Server).
Ideal for structured data with clear relationships (e.g., customer orders).
Non-Relational Databases:
Store unstructured or semi-structured data (e.g., MongoDB, Cassandra).
Use NoSQL for flexible querying.
Suitable for big data, real-time analytics, or unstructured data like social media posts.
When to Use SQL:
Analyzing structured data (e.g., sales records, employee data).
Performing complex joins and aggregations.
Working with enterprise systems requiring robust querying.
1.4 Real-World Example: Managing a Retail Inventory Database
Scenario: A retail chain, "ShopEasy," manages its inventory using a relational database. The database contains tables for products, suppliers, and sales. Analysts use SQL to:
Track stock levels.
Identify top-selling products.
Optimize supplier orders based on sales trends.
Sample Database Schema:
-- Products table
CREATE TABLE products (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50),
price DECIMAL(10, 2),
stock_quantity INT
);
-- Suppliers table
CREATE TABLE suppliers (
supplier_id INT PRIMARY KEY,
supplier_name VARCHAR(100),
contact_info VARCHAR(100)
);
-- Sales table
CREATE TABLE sales (
sale_id INT PRIMARY KEY,
product_id INT,
sale_date DATE,
quantity_sold INT,
total_amount DECIMAL(10, 2),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
Explanation:
The products table stores product details like name, category, price, and stock.
The suppliers table tracks supplier information.
The sales table records sales transactions, linked to products via product_id.
This schema allows analysts to query inventory levels, sales performance, and supplier relationships, demonstrating SQL’s power in real-world applications.
2. Basic SQL Queries: SELECT, WHERE, ORDER BY, GROUP BY
2.1 Understanding SELECT Statements
The SELECT statement retrieves data from one or more tables. It’s the foundation of SQL querying, allowing analysts to specify columns and conditions.
Syntax:
SELECT column1, column2
FROM table_name
WHERE condition;
Example:
SELECT product_name, price
FROM products
WHERE category = 'Electronics';
Explanation:
Retrieves product_name and price for products in the "Electronics" category.
The FROM clause specifies the table, and WHERE filters the results.
2.2 Filtering with WHERE
The WHERE clause filters rows based on conditions, such as equality, ranges, or logical operators (AND, OR, NOT).
Example:
SELECT product_name, stock_quantity
FROM products
WHERE stock_quantity < 10
AND category = 'Clothing';
Explanation:
Filters products with low stock (stock_quantity < 10) in the "Clothing" category.
Combines conditions using AND for precise filtering.
2.3 Sorting with ORDER BY
The ORDER BY clause sorts query results by one or more columns, in ascending (ASC) or descending (DESC) order.
Example:
SELECT product_name, price
FROM products
ORDER BY price DESC;
Explanation:
Sorts products by price in descending order (highest to lowest).
Useful for identifying top-priced items.
2.4 Aggregating with GROUP BY
The GROUP BY clause groups rows with similar values and applies aggregate functions like COUNT, SUM, AVG, MAX, or MIN.
Example:
SELECT category, COUNT(*) as product_count
FROM products
GROUP BY category;
Explanation:
Groups products by category and counts the number of products in each category.
The alias product_count makes the output user-friendly.
2.5 Real-World Example: Analyzing Sales Data
Scenario: ShopEasy wants to analyze sales performance to identify top-selling categories and low-stock products for restocking.
Code Example:
-- Top-selling categories by total sales amount
SELECT p.category, SUM(s.total_amount) as total_sales
FROM products p
JOIN sales s ON p.product_id = s.product_id
WHERE s.sale_date >= '2025-01-01'
GROUP BY p.category
ORDER BY total_sales DESC;
-- Low-stock products
SELECT product_name, stock_quantity
FROM products
WHERE stock_quantity < 10
ORDER BY stock_quantity ASC;
Explanation:
The first query joins products and sales tables, filters sales from 2025, groups by category, and sorts by total sales.
The second query identifies products with stock below 10, sorted by stock quantity.
Real-World Application: Helps ShopEasy prioritize restocking and focus marketing on high-performing categories.
2.6 Best Practices and Exception Handling
Best Practices:
Use Aliases: Improve readability (e.g., COUNT(*) as product_count).
Explicit Column Names: Avoid SELECT * for clarity and performance.
Filter Early: Apply WHERE before GROUP BY to reduce processed rows.
Indexing: Create indexes on frequently queried columns (e.g., product_id) for faster queries.
Exception Handling:
NULL Values: Handle NULLs with COALESCE or IS NULL.
SELECT product_name, COALESCE(stock_quantity, 0) as stock FROM products WHERE stock_quantity IS NULL;
Error Checking: Verify table/column existence before querying.
Query Optimization: Avoid overly complex queries that strain database performance.
3. Joins: INNER, LEFT, RIGHT, FULL
3.1 What are Joins?
Joins combine data from multiple tables based on a related column (e.g., a foreign key). They’re essential for analyzing relationships, like linking customers to their orders.
Types of Joins:
INNER JOIN: Returns only matching rows from both tables.
LEFT JOIN: Returns all rows from the left table, with NULLs for non-matching rows in the right table.
RIGHT JOIN: Returns all rows from the right table, with NULLs for non-matching rows in the left table.
FULL JOIN: Returns all rows from both tables, with NULLs for non-matches.
3.2 INNER JOIN: Combining Matching Data
Syntax:
SELECT columns
FROM table1
INNER JOIN table2 ON table1.column = table2.column;
Example:
SELECT p.product_name, s.sale_date, s.total_amount
FROM products p
INNER JOIN sales s ON p.product_id = s.product_id;
Explanation:
Retrieves product names and sales details for products that have been sold.
Only includes rows where product_id matches in both tables.
3.3 LEFT, RIGHT, and FULL OUTER Joins
LEFT JOIN Example:
SELECT p.product_name, s.sale_date
FROM products p
LEFT JOIN sales s ON p.product_id = s.product_id;
Explanation:
Returns all products, including those without sales (where sale_date is NULL).
Useful for identifying unsold products.
RIGHT JOIN Example:
SELECT p.product_name, s.sale_date
FROM products p
RIGHT JOIN sales s ON p.product_id = s.product_id;
Explanation:
Returns all sales, even if the product no longer exists in the products table.
Rarely used but helpful for auditing sales data.
FULL JOIN Example:
SELECT p.product_name, s.sale_date
FROM products p
FULL JOIN sales s ON p.product_id = s.product_id;
Explanation:
Returns all products and sales, with NULLs for non-matches in either table.
Useful for comprehensive data reconciliation.
3.4 Real-World Example: Customer and Order Analysis
Scenario: ShopEasy wants to analyze customer purchasing behavior by linking customer data with their orders.
Sample Schema:
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(100),
email VARCHAR(100)
);
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
total_amount DECIMAL(10, 2),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
Code Example:
-- Customers with their orders (including customers without orders)
SELECT c.customer_name, o.order_date, o.total_amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
-- Orders without matching customers (data integrity check)
SELECT c.customer_name, o.order_date
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id
WHERE c.customer_id IS NULL;
Explanation:
The LEFT JOIN query lists all customers, showing NULL for those without orders, helping identify inactive customers.
The RIGHT JOIN query checks for orders without matching customers, flagging potential data issues.
Real-World Application: Helps ShopEasy target inactive customers for marketing and clean up orphaned order records.
3.5 Best Practices and Exception Handling
Best Practices:
Specify Join Conditions Clearly: Use explicit ON clauses to avoid ambiguous joins.
Use Aliases: Shorten table names (e.g., products p) for readability.
Optimize Joins: Ensure indexes on join columns (e.g., product_id, customer_id).
Choose the Right Join: Use INNER JOIN for strict matches, LEFT JOIN for inclusive queries.
Exception Handling:
Handle NULLs: Account for NULL values in outer joins.
SELECT c.customer_name, COALESCE(o.total_amount, 0) as total FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id;
Check Data Integrity: Validate foreign key relationships before joining.
Avoid Cartesian Products: Ensure ON conditions prevent unintended row multiplication.
4. Aggregations and Subqueries
4.1 Aggregate Functions: COUNT, SUM, AVG, MAX, MIN
Aggregate functions summarize data across rows, often used with GROUP BY.
Common Functions:
COUNT(*): Counts rows.
SUM(column): Sums values.
AVG(column): Calculates average.
MAX(column): Finds maximum value.
MIN(column): Finds minimum value.
Example:
SELECT category, AVG(price) as avg_price
FROM products
GROUP BY category;
Explanation:
Calculates the average price per product category.
Groups results by category for meaningful insights.
4.2 Writing Subqueries
Subqueries are nested queries that return results for use in the outer query.
Example:
SELECT product_name, price
FROM products
WHERE price > (SELECT AVG(price) FROM products);
Explanation:
The subquery calculates the average price across all products.
The outer query selects products with above-average prices.
4.3 Real-World Example: Employee Performance Metrics
Scenario: A company tracks employee sales performance to award bonuses.
Sample Schema:
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(100),
department VARCHAR(50)
);
CREATE TABLE sales_records (
sale_id INT PRIMARY KEY,
employee_id INT,
sale_amount DECIMAL(10, 2),
sale_date DATE,
FOREIGN KEY (employee_id) REFERENCES employees(employee_id)
);
Code Example:
-- Top-performing employees by total sales
SELECT e.employee_name, SUM(sr.sale_amount) as total_sales
FROM employees e
JOIN sales_records sr ON e.employee_id = sr.employee_id
WHERE sr.sale_date >= '2025-01-01'
GROUP BY e.employee_name
HAVING SUM(sr.sale_amount) > (SELECT AVG(SUM(sale_amount))
FROM sales_records
GROUP BY employee_id)
ORDER BY total_sales DESC;
Explanation:
Joins employees and sales_records to calculate total sales per employee.
Uses a subquery to find the average total sales across employees.
Filters for employees with above-average sales using HAVING.
Real-World Application: Identifies employees eligible for bonuses based on performance.
4.4 Best Practices and Exception Handling
Best Practices:
Use HAVING for Aggregates: Filter grouped results with HAVING instead of WHERE.
Optimize Subqueries: Replace subqueries with joins where possible for performance.
Clear Aliasing: Use meaningful aliases for aggregate columns (e.g., total_sales).
Exception Handling:
Handle NULLs in Aggregates: Use COALESCE to manage NULL values.
SELECT category, SUM(COALESCE(price, 0)) as total_price FROM products GROUP BY category;
Avoid Division by Zero: Check denominators in AVG or custom calculations.
Subquery Errors: Ensure subqueries return a single value when used in comparisons.
5. Window Functions and Advanced SQL Features (SQL 2022/2025)
5.1 Introduction to Window Functions
Window functions perform calculations across a set of rows (a "window") without collapsing the result set, unlike GROUP BY.
Syntax:
function() OVER (PARTITION BY column ORDER BY column)
Common Window Functions:
ROW_NUMBER(): Assigns a unique number to each row.
RANK(): Assigns a rank, with ties receiving the same rank.
DENSE_RANK(): Similar to RANK, but no gaps in ranking numbers.
SUM() OVER(): Calculates a running total.
5.2 Common Window Functions: ROW_NUMBER, RANK, DENSE_RANK
Example:
SELECT product_name, price,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) as price_rank
FROM products;
Explanation:
Assigns a rank to products within each category based on price.
PARTITION BY category groups rows by category; ORDER BY price DESC ranks by price.
5.3 Advanced SQL Features in 2022/2025
Modern SQL standards (SQL 2022/2025) introduce features like:
JSON Support: Query and manipulate JSON data within SQL.
SELECT JSON_VALUE(product_details, '$.color') as color FROM products WHERE JSON_VALUE(product_details, '$.color') = 'Blue';
APPROX_COUNT_DISTINCT: Estimates unique values for large datasets.
SELECT APPROX_COUNT_DISTINCT(customer_id) as unique_customers FROM orders;
Enhanced Window Functions: Improved syntax for complex analytics.
Note: Availability depends on the database system (e.g., SQL Server 2022, PostgreSQL).
5.4 Real-World Example: Sales Ranking and Moving Averages
Scenario: ShopEasy wants to rank products by sales within each category and calculate a 30-day moving average of sales.
Code Example:
SELECT
p.product_name,
p.category,
SUM(s.total_amount) as total_sales,
RANK() OVER (PARTITION BY p.category ORDER BY SUM(s.total_amount) DESC) as sales_rank,
AVG(s.total_amount) OVER (
PARTITION BY p.product_id
ORDER BY s.sale_date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as moving_avg
FROM products p
JOIN sales s ON p.product_id = s.product_id
WHERE s.sale_date >= '2025-01-01'
GROUP BY p.product_id, p.product_name, p.category, s.sale_date;
Explanation:
Ranks products by total sales within each category using RANK().
Calculates a 30-day moving average of sales per product using AVG() OVER.
Real-World Application: Helps ShopEasy identify top performers and monitor sales trends for inventory planning.
5.5 Best Practices and Exception Handling
Best Practices:
Define Window Scope Clearly: Specify PARTITION BY and ORDER BY for accurate results.
Optimize Performance: Limit window function scope to necessary rows.
Use Descriptive Aliases: Name window function columns clearly (e.g., sales_rank).
Exception Handling:
Handle Large Windows: Avoid excessive row processing in window functions.
Check Database Support: Ensure window functions are supported (e.g., not all features in SQLite).
NULL Handling:
SELECT product_name, COALESCE(SUM(total_amount) OVER (PARTITION BY category), 0) as total_sales FROM products p JOIN sales s ON p.product_id = s.product_id;
6. Connecting Python with SQL: sqlite3, SQLAlchemy, PyODBC
6.1 Why Connect Python with SQL?
Connecting Python with SQL enables automated data analysis, combining SQL’s querying power with Python’s data manipulation and visualization capabilities. Use cases include:
Automating reports.
Building data pipelines.
Integrating with visualization tools like Matplotlib or Seaborn.
6.2 Using sqlite3 for Lightweight Databases
The sqlite3 module is Python’s built-in library for working with SQLite databases, ideal for small-scale applications.
Example:
import sqlite3
# Connect to database
try:
conn = sqlite3.connect('shopeasy.db')
cursor = conn.cursor()
# Query data
cursor.execute("SELECT product_name, price FROM products WHERE category = 'Electronics'")
results = cursor.fetchall()
for row in results:
print(f"Product: {row[0]}, Price: {row[1]}")
except sqlite3.Error as e:
print(f"Database error: {e}")
finally:
conn.close()
Explanation:
Connects to an SQLite database (shopeasy.db).
Executes a query to retrieve electronics products.
Handles errors using try-except.
Closes the connection to prevent resource leaks.
6.3 SQLAlchemy: ORM for Flexibility
SQLAlchemy is a Python ORM (Object-Relational Mapping) library that simplifies database interactions and supports multiple databases (MySQL, PostgreSQL, etc.).
Example:
from sqlalchemy import create_engine, text
from sqlalchemy.exc import SQLAlchemyError
# Create engine
engine = create_engine('sqlite:///shopeasy.db')
# Execute query
try:
with engine.connect() as conn:
result = conn.execute(text("SELECT category, COUNT(*) as product_count FROM products GROUP BY category"))
for row in result:
print(f"Category: {row[0]}, Count: {row[1]}")
except SQLAlchemyError as e:
print(f"SQLAlchemy error: {e}")
Explanation:
Uses create_engine to connect to an SQLite database.
Executes a grouped query using text for raw SQL.
Handles errors with SQLAlchemyError.
6.4 PyODBC for Enterprise Databases
PyODBC connects Python to enterprise databases like SQL Server or Oracle, commonly used in corporate environments.
Example:
import pyodbc
from pyodbc import Error
# Connection string
conn_str = (
"DRIVER={SQL Server};"
"SERVER=server_name;"
"DATABASE=shopeasy;"
"Trusted_Connection=yes;"
)
try:
conn = pyodbc.connect(conn_str)
cursor = conn.cursor()
cursor.execute("SELECT product_name, stock_quantity FROM products WHERE stock_quantity < 10")
for row in cursor.fetchall():
print(f"Product: {row[0]}, Stock: {row[1]}")
except Error as e:
print(f"PyODBC error: {e}")
finally:
conn.close()
Explanation:
Connects to a SQL Server database using a connection string.
Queries low-stock products.
Handles errors and ensures the connection is closed.
6.5 Real-World Example: Automating Financial Reports
Scenario: ShopEasy needs an automated monthly sales report generated from its database and visualized using Python.
Code Example:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy.exc import SQLAlchemyError
# Connect to database
try:
conn = sqlite3.connect('shopeasy.db')
# Query sales data
query = """
SELECT p.category, SUM(s.total_amount) as total_sales
FROM products p
JOIN sales s ON p.product_id = s.product_id
WHERE s.sale_date LIKE '2025-08%'
GROUP BY p.category
"""
# Load data into DataFrame
df = pd.read_sql_query(query, conn)
# Visualize results
plt.bar(df['category'], df['total_sales'])
plt.title('August 2025 Sales by Category')
plt.xlabel('Category')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('sales_report.png')
plt.show()
except sqlite3.Error as e:
print(f"Database error: {e}")
finally:
conn.close()
Explanation:
Queries total sales by category for August 2025.
Uses pandas to load data into a DataFrame.
Creates a bar chart with Matplotlib for visualization.
Real-World Application: Automates report generation for ShopEasy’s management team.
6.6 Best Practices and Exception Handling
Best Practices:
Close Connections: Always close database connections to avoid leaks.
Use Context Managers: Employ with statements for automatic resource management.
Parameterize Queries: Prevent SQL injection with parameterized queries.
cursor.execute("SELECT * FROM products WHERE category = ?", ('Electronics',))
Choose the Right Tool: Use sqlite3 for lightweight tasks, SQLAlchemy for flexibility, and PyODBC for enterprise systems.
Exception Handling:
Catch Specific Errors: Handle sqlite3.Error, SQLAlchemyError, or pyodbc.Error.
Log Errors: Use logging to track issues in production.
import logging logging.basicConfig(level=logging.ERROR) try: conn = sqlite3.connect('shopeasy.db') # Query code except sqlite3.Error as e: logging.error(f"Database error: {e}")
Validate Inputs: Check query parameters to avoid runtime errors.
6.7 Pros, Cons, and Alternatives
sqlite3:
Pros: Lightweight, no server setup, built into Python.
Cons: Limited scalability, not suitable for concurrent access.
Alternatives: MySQL Connector, psycopg2 for PostgreSQL.
SQLAlchemy:
Pros: ORM support, database-agnostic, robust for complex applications.
Cons: Steeper learning curve, overhead for simple tasks.
Alternatives: Django ORM, Peewee.
PyODBC:
Pros: Works with enterprise databases, reliable for Windows environments.
Cons: Requires ODBC drivers, complex setup.
Alternatives: pymssql, cx_Oracle.
7. Conclusion
7.1 Recap of Key SQL Skills
Module 3 covers essential SQL skills for data analysis:
Basic Queries: Master SELECT, WHERE, ORDER BY, and GROUP BY for data retrieval and summarization.
Joins: Use INNER, LEFT, RIGHT, and FULL joins to combine data.
Aggregations and Subqueries: Summarize data and nest queries for complex analysis.
Window Functions: Leverage advanced analytics with ROW_NUMBER, RANK, and moving averages.
Python Integration: Connect SQL with Python using sqlite3, SQLAlchemy, or PyODBC for automation and visualization.
No comments:
Post a Comment
Thanks for your valuable comment...........
Md. Mominul Islam