Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Friday, August 22, 2025

Master Data Analysis: Module 5 - Comprehensive Guide to Data Cleaning & Preprocessing in Python & SQL



Table of Contents

  1. Introduction to Data Cleaning & Preprocessing

    • Importance in Data Analysis

    • Real-World Relevance

  2. Handling Missing and Null Values

    • Understanding Missing Data

    • Techniques for Handling Missing Values

    • Python and SQL Examples

    • Best Practices and Alternatives

  3. Data Type Conversion and Normalization

    • Importance of Data Types

    • Conversion Techniques

    • Normalization Methods

    • Real-Life Examples and Code

  4. Handling Categorical Data

    • One-Hot Encoding

    • Label Encoding

    • Practical Examples in Python and SQL

    • Pros, Cons, and Alternatives

  5. Feature Scaling and Standardization

    • Why Scale Features?

    • Techniques: Min-Max Scaling, Standardization

    • Python and SQL Implementations

    • Real-World Applications

  6. Working with Dates and Timestamps

    • Parsing and Formatting Dates

    • Time Zone Handling

    • Python and SQL Code Examples

    • Best Practices

  7. Data Validation and Integrity Checks

    • Ensuring Data Quality

    • Validation Techniques

    • Python and SQL Implementations

    • Exception Handling

  8. Conclusion

    • Recap of Key Concepts

    • Next Steps in Data Analysis


1. Introduction to Data Cleaning & Preprocessing

Importance in Data Analysis

Data cleaning and preprocessing are critical steps in the data analysis pipeline, ensuring that datasets are accurate, consistent, and ready for modeling. Raw data often contains inconsistencies, missing values, or incorrect formats that can skew results or break algorithms. In Module 5 of our Master Data Analysis series, we dive deep into preparing data using Python and SQL, two powerful tools for modern data analysts.

Real-World Relevance

Imagine you're a data analyst at an e-commerce company analyzing customer purchase data. The dataset might include missing entries for customer addresses, inconsistent date formats, or categorical variables like product categories that need encoding. Without proper cleaning and preprocessing, your analysis could lead to incorrect insights, such as misidentifying top-selling products or failing to segment customers accurately.

This guide provides detailed, real-world examples, such as cleaning sales data for a retail chain or preprocessing patient records for a hospital, to make the concepts relatable and actionable. We’ll use Python’s pandas and scikit-learn libraries alongside SQL queries to demonstrate robust, production-ready solutions.


2. Handling Missing and Null Values

Understanding Missing Data

Missing data can arise due to human error, system failures, or incomplete data collection. For example, in a healthcare dataset, patient blood pressure readings might be missing due to equipment failure. Missing values can be:

  • MCAR (Missing Completely at Random): No pattern to missingness.

  • MAR (Missing at Random): Missingness depends on observed data.

  • MNAR (Missing Not at Random): Missingness depends on unobserved data.

Techniques for Handling Missing Values

  1. Removal: Drop rows or columns with missing values if they’re insignificant.

  2. Imputation: Replace missing values with statistical measures (mean, median, mode) or predictive models.

  3. Flagging: Add a binary column to indicate missingness.

Python Example: Handling Missing Values in a Retail Dataset

Let’s consider a retail dataset with customer purchases, where some entries for price and customer_rating are missing.

import pandas as pd
import numpy as np

# Sample retail dataset
data = {
    'product_id': [1, 2, 3, 4, 5],
    'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
    'price': [999.99, None, 299.99, 199.99, None],
    'customer_rating': [4.5, 4.0, None, 3.8, 4.2]
}
df = pd.DataFrame(data)

# 1. Drop rows with any missing values
df_dropped = df.dropna()
print("After dropping rows with missing values:\n", df_dropped)

# 2. Impute missing price with mean
df['price'].fillna(df['price'].mean(), inplace=True)
print("\nAfter imputing price with mean:\n", df)

# 3. Impute missing rating with median
df['customer_rating'].fillna(df['customer_rating'].median(), inplace=True)
print("\nAfter imputing rating with median:\n", df)

# 4. Add missing flag for price
df['price_missing'] = df['price'].isna().astype(int)
print("\nWith missing flag:\n", df)

Output:

After dropping rows with missing values:
   product_id product_name   price  customer_rating
0          1      Laptop  999.99            4.5
3          4     Monitor  199.99            3.8

After imputing price with mean:
   product_id product_name      price  customer_rating
0          1      Laptop   999.9900            4.5
1          2       Phone   499.9933            4.0
2          3      Tablet   299.9900           None
3          4     Monitor   199.9900            3.8
4          5    Keyboard   499.9933            4.2

After imputing rating with median:
   product_id product_name      price  customer_rating
0          1      Laptop   999.9900            4.5
1          2       Phone   499.9933            4.0
2          3      Tablet   299.9900            4.2
3          4     Monitor   199.9900            3.8
4          5    Keyboard   499.9933            4.2

With missing flag:
   product_id product_name      price  customer_rating  price_missing
0          1      Laptop   999.9900            4.5              0
1          2       Phone   499.9933            4.0              1
2          3      Tablet   299.9900            4.2              0
3          4     Monitor   199.9900            3.8              0
4          5    Keyboard   499.9933            4.2              1

SQL Example: Handling Missing Values

In a SQL database, you might handle missing values using COALESCE or CASE statements.

-- Sample table
CREATE TABLE retail (
    product_id INT,
    product_name VARCHAR(50),
    price DECIMAL(10, 2),
    customer_rating DECIMAL(3, 1)
);

INSERT INTO retail VALUES
(1, 'Laptop', 999.99, 4.5),
(2, 'Phone', NULL, 4.0),
(3, 'Tablet', 299.99, NULL),
(4, 'Monitor', 199.99, 3.8),
(5, 'Keyboard', NULL, 4.2);

-- Impute missing price with average
UPDATE retail
SET price = (SELECT AVG(price) FROM retail WHERE price IS NOT NULL)
WHERE price IS NULL;

-- Impute missing rating with median
UPDATE retail
SET customer_rating = (
    SELECT customer_rating
    FROM (
        SELECT customer_rating,
               ROW_NUMBER() OVER (ORDER BY customer_rating) AS rn,
               COUNT(*) OVER () AS cnt
        FROM retail
        WHERE customer_rating IS NOT NULL
    ) t
    WHERE rn = (cnt + 1) / 2
)
WHERE customer_rating IS NULL;

-- Add missing flag column
ALTER TABLE retail ADD price_missing INT;
UPDATE retail SET price_missing = CASE WHEN price IS NULL THEN 1 ELSE 0 END;

SELECT * FROM retail;

Best Practices

  • Assess Missingness: Use visualization (e.g., missingno in Python) to understand patterns.

  • Choose Imputation Wisely: Use mean for numerical data with low variance, median for skewed data, or mode for categorical data.

  • Document Changes: Keep track of imputation methods for reproducibility.

Pros, Cons, and Alternatives

  • Pros of Imputation: Preserves data size, maintains statistical power.

  • Cons: Can introduce bias if missingness is MNAR.

  • Alternatives: Advanced methods like K-Nearest Neighbors (KNN) imputation or predictive modeling.


3. Data Type Conversion and Normalization

Importance of Data Types

Correct data types ensure algorithms process data accurately. For instance, a price column stored as a string ("$999.99") needs conversion to a float for calculations.

Conversion Techniques

  • Python: Use pandas methods like astype() or to_numeric().

  • SQL: Use CAST or CONVERT.

Normalization Methods

Normalization scales data to a specific range, often [0, 1], to ensure consistency.

Real-Life Example: Normalizing Sales Data

In a sales dataset, revenue might be in dollars, but machine learning models require normalized values.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample sales dataset
data = {
    'order_id': [101, 102, 103, 104],
    'revenue': ['1000.50', '2500.75', 'invalid', '3000.00'],
    'quantity': ['5', '10', '8', '15']
}
df = pd.DataFrame(data)

# Convert revenue to float, handle errors
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
print("After type conversion:\n", df)

# Normalize revenue
scaler = MinMaxScaler()
df['revenue_normalized'] = scaler.fit_transform(df[['revenue']].fillna(df['revenue'].mean()))
print("\nAfter normalization:\n", df)

Output:

After type conversion:
   order_id  revenue quantity
0      101  1000.50       5
1      102  2500.75      10
2      103      NaN       8
3      104  3000.00      15

After normalization:
   order_id  revenue quantity  revenue_normalized
0      101  1000.50       5           0.000000
1      102  2500.75      10           0.833583
2      103      NaN       8           0.666889
3      104  3000.00      15           1.000000

SQL Example

-- Sample sales table
CREATE TABLE sales (
    order_id INT,
    revenue VARCHAR(10),
    quantity VARCHAR(5)
);

INSERT INTO sales VALUES
(101, '1000.50', '5'),
(102, '2500.75', '10'),
(103, 'invalid', '8'),
(104, '3000.00', '15');

-- Convert revenue to DECIMAL
ALTER TABLE sales ADD revenue_numeric DECIMAL(10, 2);
UPDATE sales
SET revenue_numeric = CAST(CASE WHEN revenue ~ '^[0-9]+\\.?[0-9]*$' THEN revenue ELSE NULL END AS DECIMAL);

-- Normalize revenue
UPDATE sales
SET revenue_numeric = (
    (revenue_numeric - (SELECT MIN(revenue_numeric) FROM sales))
    / (SELECT MAX(revenue_numeric) - MIN(revenue_numeric) FROM sales)
)
WHERE revenue_numeric IS NOT NULL;

SELECT * FROM sales;

Best Practices

  • Validate Before Conversion: Check for invalid entries using regex or error handling.

  • Preserve Original Data: Create new columns for converted values.

  • Handle Errors: Use errors='coerce' in pandas or CASE in SQL.

Pros, Cons, and Alternatives

  • Pros: Ensures compatibility with algorithms, improves model performance.

  • Cons: Risk of data loss during coercion.

  • Alternatives: RobustScaler for outlier-heavy data.


4. Handling Categorical Data

One-Hot Encoding

Converts categories into binary columns (e.g., color: [red, blue] becomes is_red, is_blue).

Label Encoding

Assigns integers to categories (e.g., color: [red, blue] becomes 0, 1).

Real-Life Example: Encoding Product Categories

In an e-commerce dataset, category (e.g., Electronics, Clothing) needs encoding for machine learning.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample dataset
data = {
    'product_id': [1, 2, 3, 4],
    'category': ['Electronics', 'Clothing', 'Electronics', 'Books']
}
df = pd.DataFrame(data)

# One-hot encoding
df_onehot = pd.get_dummies(df, columns=['category'], prefix='cat')
print("One-hot encoded:\n", df_onehot)

# Label encoding
le = LabelEncoder()
df['category_label'] = le.fit_transform(df['category'])
print("\nLabel encoded:\n", df)

Output:

One-hot encoded:
   product_id  cat_Books  cat_Clothing  cat_Electronics
0          1          0             0                1
1          2          0             1                0
2          3          0             0                1
3          4          1             0                0

Label encoded:
   product_id   category  category_label
0          1  Electronics              1
1          2     Clothing              0
2          3  Electronics              1
3          4        Books              2

SQL Example

-- Sample table
CREATE TABLE products (
    product_id INT,
    category VARCHAR(50)
);

INSERT INTO products VALUES
(1, 'Electronics'),
(2, 'Clothing'),
(3, 'Electronics'),
(4, 'Books');

-- One-hot encoding
ALTER TABLE products
ADD cat_electronics INT,
ADD cat_clothing INT,
ADD cat_books INT;

UPDATE products
SET cat_electronics = CASE WHEN category = 'Electronics' THEN 1 ELSE 0 END,
    cat_clothing = CASE WHEN category = 'Clothing' THEN 1 ELSE 0 END,
    cat_books = CASE WHEN category = 'Books' THEN 1 ELSE 0 END;

-- Label encoding
ALTER TABLE products ADD category_label INT;
UPDATE products
SET category_label = CASE
    WHEN category = 'Electronics' THEN 1
    WHEN category = 'Clothing' THEN 0
    WHEN category = 'Books' THEN 2
END;

SELECT * FROM products;

Pros, Cons, and Alternatives

  • One-Hot Encoding Pros: No ordinal assumption, works with most algorithms.

  • Cons: Increases dimensionality, not ideal for high-cardinality data.

  • Label Encoding Pros: Compact representation.

  • Cons: Implies ordinality, unsuitable for non-ordinal categories.

  • Alternatives: Target encoding, frequency encoding.


5. Feature Scaling and Standardization

Why Scale Features?

Machine learning algorithms like SVM or neural networks are sensitive to feature scales. For example, a salary column (range: 30,000–150,000) and an age column (range: 18–80) need scaling to ensure equal contribution.

Techniques

  • Min-Max Scaling: Scales to [0, 1].

  • Standardization: Centers data around mean with unit variance.

Real-Life Example: Scaling Customer Data

In a marketing dataset, income and age need scaling for a clustering model.

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample dataset
data = {
    'customer_id': [1, 2, 3, 4],
    'age': [25, 45, 35, 60],
    'income': [30000, 75000, 50000, 120000]
}
df = pd.DataFrame(data)

# Min-Max Scaling
minmax_scaler = MinMaxScaler()
df[['age_minmax', 'income_minmax']] = minmax_scaler.fit_transform(df[['age', 'income']])
print("Min-Max Scaled:\n", df)

# Standardization
std_scaler = StandardScaler()
df[['age_std', 'income_std']] = std_scaler.fit_transform(df[['age', 'income']])
print("\nStandardized:\n", df)

Output:

Min-Max Scaled:
   customer_id  age  income  age_minmax  income_minmax
0          1   25   30000    0.000000       0.000000
1          2   45   75000    0.571429       0.500000
2          3   35   50000    0.285714       0.222222
3          4   60  120000    1.000000       1.000000

Standardized:
   customer_id  age  income  age_minmax  income_minmax  age_std  income_std
0          1   25   30000    0.000000       0.000000   -1.17       -1.09
1          2   45   75000    0.571429       0.500000    0.29        0.11
2          3   35   50000    0.285714       0.222222   -0.44       -0.55
3          4   60  120000    1.000000       1.000000    1.32        1.53

SQL Example

-- Sample table
CREATE TABLE customers (
    customer_id INT,
    age INT,
    income DECIMAL(10, 2)
);

INSERT INTO customers VALUES
(1, 25, 30000),
(2, 45, 75000),
(3, 35, 50000),
(4, 60, 120000);

-- Min-Max Scaling
ALTER TABLE customers
ADD age_minmax DECIMAL(5, 4),
ADD income_minmax DECIMAL(5, 4);

UPDATE customers
SET age_minmax = (age - (SELECT MIN(age) FROM customers)) / (SELECT MAX(age) - MIN(age) FROM customers),
    income_minmax = (income - (SELECT MIN(income) FROM customers)) / (SELECT MAX(income) - MIN(income) FROM customers);

-- Standardization
ALTER TABLE customers
ADD age_std DECIMAL(5, 2),
ADD income_std DECIMAL(5, 2);

UPDATE customers
SET age_std = (age - (SELECT AVG(age) FROM customers)) / (SELECT STDDEV(age) FROM customers),
    income_std = (income - (SELECT AVG(income) FROM customers)) / (SELECT STDDEV(income) FROM customers);

SELECT * FROM customers;

Best Practices

  • Choose Appropriate Scaler: Use Min-Max for bounded ranges, standardization for unbounded data.

  • Fit on Training Data Only: Avoid data leakage by fitting scalers on training data.

  • Handle Outliers: Use RobustScaler for outlier-heavy data.

Pros, Cons, and Alternatives

  • Pros: Improves model convergence, ensures fair feature contribution.

  • Cons: Sensitive to outliers (Min-Max), assumes normality (Standardization).

  • Alternatives: RobustScaler, QuantileTransformer.


6. Working with Dates and Timestamps

Parsing and Formatting Dates

Dates often come in inconsistent formats (e.g., "2023-01-01" vs. "01/01/23"). Parsing standardizes them for analysis.

Time Zone Handling

For global datasets, handling time zones is crucial to align timestamps.

Real-Life Example: Analyzing Sales Timestamps

In a sales dataset, order_date needs parsing and time zone conversion.

import pandas as pd
from datetime import datetime
import pytz

# Sample dataset
data = {
    'order_id': [1, 2, 3],
    'order_date': ['2023-01-01 10:00', '01/02/2023 15:30', '2023-01-03']
}
df = pd.DataFrame(data)

# Parse dates
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce', format='mixed')
print("Parsed dates:\n", df)

# Convert to UTC
df['order_date_utc'] = df['order_date'].dt.tz_localize('US/Pacific').dt.tz_convert('UTC')
print("\nWith UTC conversion:\n", df)

# Extract components
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
print("\nWith extracted components:\n", df)

Output:

Parsed dates:
   order_id order_date
0        1 2023-01-01 10:00:00
1        2 2023-01-02 15:30:00
2        3 2023-01-03 00:00:00

With UTC conversion:
   order_id order_date         order_date_utc
0        1 2023-01-01 10:00:00 2023-01-01 18:00:00+00:00
1        2 2023-01-02 15:30:00 2023-01-02 23:30:00+00:00
2        3 2023-01-03 00:00:00 2023-01-03 08:00:00+00:00

With extracted components:
   order_id order_date         order_date_utc  year  month
0        1 2023-01-01 10:00:00 2023-01-01 18:00:00+00:00  2023      1
1        2 2023-01-02 15:30:00 2023-01-02 23:30:00+00:00  2023      1
2        3 2023-01-03 00:00:00 2023-01-03 08:00:00+00:00  2023      1

SQL Example

-- Sample table
CREATE TABLE sales_orders (
    order_id INT,
    order_date VARCHAR(50)
);

INSERT INTO sales_orders VALUES
(1, '2023-01-01 10:00'),
(2, '01/02/2023 15:30'),
(3, '2023-01-03');

-- Parse dates
ALTER TABLE sales_orders ADD order_date_parsed TIMESTAMP;
UPDATE sales_orders
SET order_date_parsed = TO_TIMESTAMP(order_date, 'YYYY-MM-DD HH24:MI');

-- Convert to UTC (assuming Pacific time)
ALTER TABLE sales_orders ADD order_date_utc TIMESTAMP;
UPDATE sales_orders
SET order_date_utc = order_date_parsed AT TIME ZONE 'US/Pacific' AT TIME ZONE 'UTC';

-- Extract components
ALTER TABLE sales_orders
ADD year INT,
ADD month INT;
UPDATE sales_orders
SET year = EXTRACT(YEAR FROM order_date_parsed),
    month = EXTRACT(MONTH FROM order_date_parsed);

SELECT * FROM sales_orders;

Best Practices

  • Use Standard Formats: ISO 8601 (e.g., 2023-01-01T10:00:00Z) for consistency.

  • Handle Time Zones: Always convert to UTC for global datasets.

  • Validate Dates: Check for invalid formats using errors='coerce' or TRY_TO_TIMESTAMP.

Pros, Cons, and Alternatives

  • Pros: Enables time-based analysis, ensures consistency.

  • Cons: Time zone conversions can be complex.

  • Alternatives: Libraries like arrow or pendulum for Python.


7. Data Validation and Integrity Checks

Ensuring Data Quality

Validation ensures data meets business rules (e.g., age > 0, email format).

Validation Techniques

  • Range Checks: Ensure values fall within acceptable ranges.

  • Format Checks: Validate formats like emails or phone numbers.

  • Uniqueness Checks: Ensure no duplicates in key columns.

Real-Life Example: Validating Customer Data

In a customer dataset, validate email and age.

import pandas as pd
import re

# Sample dataset
data = {
    'customer_id': [1, 2, 3],
    'email': ['john@example.com', 'invalid', 'jane@domain.com'],
    'age': [25, -5, 30]
}
df = pd.DataFrame(data)

# Validate email format
def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

df['email_valid'] = df['email'].apply(is_valid_email)
print("Email validation:\n", df)

# Validate age
df['age_valid'] = df['age'].apply(lambda x: x > 0 and x < 120)
print("\nAge validation:\n", df)

# Handle invalid data
df.loc[~df['email_valid'], 'email'] = 'invalid@example.com'
df.loc[~df['age_valid'], 'age'] = df['age'].mean()
print("\nAfter handling invalid data:\n", df)

Output:

Email validation:
   customer_id             email  age  email_valid
0          1   john@example.com   25         True
1          2           invalid   -5        False
2          3    jane@domain.com   30         True

Age validation:
   customer_id             email  age  email_valid  age_valid
0          1   john@example.com   25         True      True
1          2           invalid   -5        False     False
2          3    jane@domain.com   30         True      True

After handling invalid data:
   customer_id                 email        age  email_valid  age_valid
0          1     john@example.com  25.000000         True      True
1          2  invalid@example.com  27.500000        False     False
2          3      jane@domain.com  30.000000         True      True

SQL Example

-- Sample table
CREATE TABLE customers (
    customer_id INT,
    email VARCHAR(100),
    age INT
);

INSERT INTO customers VALUES
(1, 'john@example.com', 25),
(2, 'invalid', -5),
(3, 'jane@domain.com', 30);

-- Add validation columns
ALTER TABLE customers
ADD email_valid BOOLEAN,
ADD age_valid BOOLEAN;

-- Validate email
UPDATE customers
SET email_valid = email ~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';

-- Validate age
UPDATE customers
SET age_valid = age > 0 AND age < 120;

-- Handle invalid data
UPDATE customers
SET email = 'invalid@example.com' WHERE email_valid = FALSE,
    age = (SELECT AVG(age) FROM customers WHERE age_valid = TRUE) WHERE age_valid = FALSE;

SELECT * FROM customers;

Best Practices

  • Automate Validation: Use scripts to check data regularly.

  • Log Issues: Store validation failures for auditing.

  • Handle Exceptions: Use try-except blocks in Python or TRY_ functions in SQL.

Pros, Cons, and Alternatives

  • Pros: Ensures data reliability, prevents downstream errors.

  • Cons: Time-consuming for large datasets.

  • Alternatives: Data quality tools like Great Expectations or Talend.


8. Conclusion

Recap of Key Concepts

Module 5 covered essential data cleaning and preprocessing techniques:

  • Handling missing values with imputation or removal.

  • Converting and normalizing data types for consistency.

  • Encoding categorical data using one-hot or label encoding.

  • Scaling features for machine learning compatibility.

  • Parsing and standardizing dates and timestamps.

  • Validating data to ensure quality and integrity.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam