Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Post Top Ad

Responsive Ads Here

Friday, August 22, 2025

Master Data Analysis: Module 9 – Combining Python & SQL for Powerful Data Workflows

 




Table of Contents

  1. Introduction to Combining Python & SQL

    • Why Combine Python and SQL?

    • Real-World Applications

    • Tools and Setup

  2. Extracting Data from SQL to Python DataFrames

    • Connecting to Databases

    • Fetching Data with SQL Queries

    • Converting to Pandas DataFrames

    • Example: Retail Sales Analysis

    • Best Practices and Exception Handling

  3. Performing Transformations Using Pandas

    • Data Cleaning and Preprocessing

    • Aggregations and GroupBy Operations

    • Example: Customer Segmentation

    • Pros, Cons, and Alternatives

  4. Joining SQL Data with External Datasets

    • Merging SQL Data with CSV/Excel

    • Handling Data Discrepancies

    • Example: Market Basket Analysis

    • Best Practices and Exception Handling

  5. Automating SQL Queries Using Python

    • Creating Reusable Query Functions

    • Scheduling Automated Workflows

    • Example: Daily Sales Reporting

    • Pros, Cons, and Alternatives

  6. Building End-to-End Analysis Workflows

    • Designing a Complete Pipeline

    • Example: E-Commerce Analytics Dashboard

    • Best Practices and Exception Handling

  7. Conclusion

    • Key Takeaways

    • Next Steps in Your Data Analysis Journey


1. Introduction to Combining Python & SQL

Why Combine Python and SQL?

Python and SQL are a powerhouse duo in data analysis. SQL excels at querying large datasets stored in relational databases, while Python, with libraries like Pandas, offers unmatched flexibility for data manipulation, visualization, and automation. Combining them allows you to leverage SQL’s efficiency for data retrieval and Python’s versatility for complex transformations and automation.

Real-World Applications

  • Retail: Analyze sales data from a SQL database, enrich it with external demographic data, and automate daily reports.

  • Finance: Extract transaction data, perform risk analysis with Pandas, and schedule automated compliance checks.

  • Healthcare: Combine patient records from SQL with external research data for predictive analytics.

  • E-Commerce: Build recommendation systems by joining SQL order data with user behavior logs.

Tools and Setup

  • Python Libraries: pandas, sqlalchemy, psycopg2 (for PostgreSQL), pyodbc (for SQL Server).

  • Databases: PostgreSQL, MySQL, SQLite, or SQL Server.

  • Setup: Install Python (3.8+), required libraries (pip install pandas sqlalchemy psycopg2), and a database system.

pip install pandas sqlalchemy psycopg2-binary

2. Extracting Data from SQL to Python DataFrames

Connecting to Databases

Use sqlalchemy to create a connection to your database. It supports multiple database types and ensures secure connections.

from sqlalchemy import create_engine
import pandas as pd

# Database connection
try:
    engine = create_engine('postgresql://username:password@localhost:5432/retail_db')
except Exception as e:
    print(f"Connection failed: {e}")

Fetching Data with SQL Queries

Write SQL queries to extract specific data. Use pandas.read_sql() to execute queries and load results into a DataFrame.

# Fetch sales data
query = """
SELECT order_id, customer_id, order_date, total_amount
FROM orders
WHERE order_date >= '2024-01-01';
"""
try:
    df = pd.read_sql(query, engine)
    print(df.head())
except Exception as e:
    print(f"Query failed: {e}")

Converting to Pandas DataFrames

The read_sql() function automatically converts SQL query results into a Pandas DataFrame, ready for further analysis.

Example: Retail Sales Analysis

Scenario: A retail chain wants to analyze sales performance for 2024. The data resides in a PostgreSQL database.

# Retail sales analysis
from sqlalchemy import create_engine
import pandas as pd

# Connect to database
try:
    engine = create_engine('postgresql://user:pass@localhost:5432/retail_db')
except Exception as e:
    print(f"Connection error: {e}")
    exit()

# Fetch data
query = """
SELECT o.order_id, o.customer_id, o.order_date, o.total_amount, c.customer_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31';
"""
try:
    sales_df = pd.read_sql(query, engine)
    print(sales_df.head())
except Exception as e:
    print(f"Query error: {e}")

Output:

   order_id  customer_id  order_date  total_amount  customer_name
0         1         101  2024-01-02        150.50       John Doe
1         2         102  2024-01-03        200.75      Jane Smith
...

Best Practices and Exception Handling

  • Best Practices:

    • Use parameterized queries to prevent SQL injection.

    • Close connections using engine.dispose() to free resources.

    • Validate data types before processing.

  • Exception Handling:

    • Catch connection errors (OperationalError).

    • Handle query syntax errors (ProgrammingError).

  • Alternatives:

    • Use psycopg2 directly for low-level control.

    • Use pyodbc for SQL Server databases.


3. Performing Transformations Using Pandas

Data Cleaning and Preprocessing

Pandas excels at cleaning and transforming data. Common tasks include handling missing values, converting data types, and filtering.

# Clean sales data
sales_df['order_date'] = pd.to_datetime(sales_df['order_date'])
sales_df['total_amount'].fillna(sales_df['total_amount'].mean(), inplace=True)
sales_df = sales_df[sales_df['total_amount'] > 0]

Aggregations and GroupBy Operations

Aggregate data to summarize insights, such as total sales by customer or month.

# Aggregate sales by customer
customer_sales = sales_df.groupby('customer_name')['total_amount'].sum().reset_index()
print(customer_sales.head())

Example: Customer Segmentation

Scenario: Segment customers based on their purchase frequency and total spend.

# Customer segmentation
from sqlalchemy import create_engine
import pandas as pd

# Fetch data
engine = create_engine('postgresql://user:pass@localhost:5432/retail_db')
query = """
SELECT o.customer_id, c.customer_name, o.order_date, o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;
"""
try:
    df = pd.read_sql(query, engine)
except Exception as e:
    print(f"Query error: {e}")
    exit()

# Transformations
df['order_date'] = pd.to_datetime(df['order_date'])
df['total_amount'].fillna(df['total_amount'].mean(), inplace=True)

# Calculate purchase frequency and total spend
customer_summary = df.groupby('customer_name').agg({
    'order_id': 'count',
    'total_amount': 'sum'
}).rename(columns={'order_id': 'purchase_count'}).reset_index()

# Segment customers
def segment_customer(row):
    if row['purchase_count'] > 10 and row['total_amount'] > 1000:
        return 'High-Value'
    elif row['purchase_count'] > 5:
        return 'Regular'
    else:
        return 'Occasional'

customer_summary['segment'] = customer_summary.apply(segment_customer, axis=1)
print(customer_summary.head())

Output:

  customer_name  purchase_count  total_amount     segment
0      John Doe              12        1500.75  High-Value
1     Jane Smith               7         800.50     Regular
...

Pros, Cons, and Alternatives

  • Pros:

    • Pandas is intuitive and flexible for complex transformations.

    • Integrates seamlessly with visualization libraries like Matplotlib.

  • Cons:

    • Memory-intensive for large datasets.

    • Slower than SQL for simple aggregations.

  • Alternatives:

    • Use SQL for aggregations when possible to reduce memory usage.

    • Consider Dask or Polars for large-scale data processing.


4. Joining SQL Data with External Datasets

Merging SQL Data with CSV/Excel

Combine SQL data with external datasets (e.g., CSV files) using Pandas’ merge function.

# Merge SQL data with CSV
external_data = pd.read_csv('demographics.csv')
merged_df = pd.merge(sales_df, external_data, on='customer_id', how='left')

Handling Data Discrepancies

  • Check for missing keys or mismatched data types.

  • Use how='left' or how='inner' to control merge behavior.

Example: Market Basket Analysis

Scenario: Combine SQL order data with a CSV file containing product categories to analyze purchase patterns.

# Market basket analysis
from sqlalchemy import create_engine
import pandas as pd

# Fetch SQL data
engine = create_engine('postgresql://user:pass@localhost:5432/retail_db')
query = """
SELECT order_id, customer_id, product_id, quantity
FROM order_details;
"""
try:
    order_df = pd.read_sql(query, engine)
except Exception as e:
    print(f"Query error: {e}")
    exit()

# Load external data
product_data = pd.read_csv('product_categories.csv')

# Merge datasets
merged_df = pd.merge(order_df, product_data, on='product_id', how='left')

# Analyze purchase patterns
basket_analysis = merged_df.groupby(['order_id', 'category'])['quantity'].sum().unstack().fillna(0)
print(basket_analysis.head())

Output:

category     Electronics  Clothing  Home
order_id
1                  2.0       0.0   1.0
2                  0.0       3.0   0.0
...

Best Practices and Exception Handling

  • Best Practices:

    • Validate key columns before merging.

    • Use how='left' to preserve SQL data.

  • Exception Handling:

    • Catch KeyError for missing columns.

    • Handle mismatched data types with astype().


5. Automating SQL Queries Using Python

Creating Reusable Query Functions

Encapsulate SQL queries in Python functions for reusability.

def fetch_sales_data(engine, start_date, end_date):
    query = f"""
    SELECT order_id, customer_id, order_date, total_amount
    FROM orders
    WHERE order_date BETWEEN '{start_date}' AND '{end_date}';
    """
    try:
        return pd.read_sql(query, engine)
    except Exception as e:
        print(f"Query error: {e}")
        return None

Scheduling Automated Workflows

Use schedule or APScheduler to run queries periodically.

from apscheduler.schedulers.background import BackgroundScheduler
import datetime

# Schedule daily report
scheduler = BackgroundScheduler()
scheduler.add_job(
    lambda: fetch_sales_data(engine, '2024-01-01', datetime.date.today()),
    'interval',
    days=1
)
scheduler.start()

Example: Daily Sales Reporting

Scenario: Automate a daily sales report emailed to stakeholders.

from sqlalchemy import create_engine
import pandas as pd
import smtplib
from email.mime.text import MIMEText
from apscheduler.schedulers.background import BackgroundScheduler
import datetime

# Database connection
engine = create_engine('postgresql://user:pass@localhost:5432/retail_db')

# Fetch daily sales
def fetch_daily_sales():
    query = f"""
    SELECT order_date, SUM(total_amount) as total_sales
    FROM orders
    WHERE order_date = '{datetime.date.today()}'
    GROUP BY order_date;
    """
    try:
        df = pd.read_sql(query, engine)
        return df
    except Exception as e:
        print(f"Query error: {e}")
        return None

# Send email
def send_report():
    df = fetch_daily_sales()
    if df is not None and not df.empty:
        msg = MIMEText(df.to_string())
        msg['Subject'] = 'Daily Sales Report'
        msg['From'] = 'your_email@example.com'
        msg['To'] = 'stakeholder@example.com'
        
        try:
            with smtplib.SMTP('smtp.example.com', 587) as server:
                server.starttls()
                server.login('your_email@example.com', 'password')
                server.send_message(msg)
        except Exception as e:
            print(f"Email error: {e}")

# Schedule report
scheduler = BackgroundScheduler()
scheduler.add_job(send_report, 'interval', days=1)
scheduler.start()

Pros, Cons, and Alternatives

  • Pros:

    • Saves time by automating repetitive tasks.

    • Reduces human error in report generation.

  • Cons:

    • Requires robust error handling for reliability.

    • Email server setup can be complex.

  • Alternatives:

    • Use cron jobs for scheduling on Linux.

    • Use cloud services like AWS Lambda for serverless automation.


6. Building End-to-End Analysis Workflows

Designing a Complete Pipeline

An end-to-end workflow involves:

  1. Extracting data from SQL.

  2. Transforming data with Pandas.

  3. Joining with external data.

  4. Automating the process.

  5. Visualizing or exporting results.

Example: E-Commerce Analytics Dashboard

Scenario: Build a dashboard to visualize sales trends, customer segments, and product performance.

from sqlalchemy import create_engine
import pandas as pd
import plotly.express as px
from apscheduler.schedulers.background import BackgroundScheduler
import datetime

# Database connection
engine = create_engine('postgresql://user:pass@localhost:5432/retail_db')

# Fetch and process data
def build_dashboard_data():
    # Fetch sales data
    sales_query = """
    SELECT o.order_id, o.customer_id, o.order_date, o.total_amount, c.customer_name
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    WHERE o.order_date >= '2024-01-01';
    """
    try:
        sales_df = pd.read_sql(sales_query, engine)
    except Exception as e:
        print(f"Query error: {e}")
        return None

    # Fetch product data
    product_query = """
    SELECT order_id, product_id, quantity
    FROM order_details;
    """
    try:
        product_df = pd.read_sql(product_query, engine)
    except Exception as e:
        print(f"Query error: {e}")
        return None

    # Merge with external data
    external_data = pd.read_csv('product_categories.csv')
    merged_df = pd.merge(product_df, external_data, on='product_id', how='left')

    # Transform data
    sales_df['order_date'] = pd.to_datetime(sales_df['order_date'])
    sales_trend = sales_df.groupby(sales_df['order_date'].dt.to_period('M'))['total_amount'].sum().reset_index()

    # Create visualizations
    fig = px.line(sales_trend, x='order_date', y='total_amount', title='Monthly Sales Trend')
    fig.write()

    # Save results
    sales_trend.to_csv('sales_trend.csv')
    return sales_trend

# Schedule dashboard update
scheduler = BackgroundScheduler()
scheduler.add_job(build_dashboard_data, 'interval', days=1)
scheduler.start()

Best Practices and Exception Handling

  • Best Practices:

    • Modularize code into functions for reusability.

    • Log errors to a file for debugging.

  • Exception Handling:

    • Handle database connection failures.

    • Validate data before visualization.


7. Conclusion

Key Takeaways

  • Combining Python and SQL enables powerful, automated data workflows.

  • Use sqlalchemy and Pandas for seamless data extraction and transformation.

  • Automate repetitive tasks to save time and reduce errors.

  • Build robust pipelines with proper exception handling.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam

Post Bottom Ad

Responsive Ads Here