Table of Contents
Introduction to Data Cleaning & Preprocessing
Importance in Data Analysis
Real-World Relevance
Handling Missing and Null Values
Understanding Missing Data
Techniques for Handling Missing Values
Python and SQL Examples
Best Practices and Alternatives
Data Type Conversion and Normalization
Importance of Data Types
Conversion Techniques
Normalization Methods
Real-Life Examples and Code
Handling Categorical Data
One-Hot Encoding
Label Encoding
Practical Examples in Python and SQL
Pros, Cons, and Alternatives
Feature Scaling and Standardization
Why Scale Features?
Techniques: Min-Max Scaling, Standardization
Python and SQL Implementations
Real-World Applications
Working with Dates and Timestamps
Parsing and Formatting Dates
Time Zone Handling
Python and SQL Code Examples
Best Practices
Data Validation and Integrity Checks
Ensuring Data Quality
Validation Techniques
Python and SQL Implementations
Exception Handling
Conclusion
Recap of Key Concepts
Next Steps in Data Analysis
1. Introduction to Data Cleaning & Preprocessing
Importance in Data Analysis
Data cleaning and preprocessing are critical steps in the data analysis pipeline, ensuring that datasets are accurate, consistent, and ready for modeling. Raw data often contains inconsistencies, missing values, or incorrect formats that can skew results or break algorithms. In Module 5 of our Master Data Analysis series, we dive deep into preparing data using Python and SQL, two powerful tools for modern data analysts.
Real-World Relevance
Imagine you're a data analyst at an e-commerce company analyzing customer purchase data. The dataset might include missing entries for customer addresses, inconsistent date formats, or categorical variables like product categories that need encoding. Without proper cleaning and preprocessing, your analysis could lead to incorrect insights, such as misidentifying top-selling products or failing to segment customers accurately.
This guide provides detailed, real-world examples, such as cleaning sales data for a retail chain or preprocessing patient records for a hospital, to make the concepts relatable and actionable. We’ll use Python’s pandas and scikit-learn libraries alongside SQL queries to demonstrate robust, production-ready solutions.
2. Handling Missing and Null Values
Understanding Missing Data
Missing data can arise due to human error, system failures, or incomplete data collection. For example, in a healthcare dataset, patient blood pressure readings might be missing due to equipment failure. Missing values can be:
MCAR (Missing Completely at Random): No pattern to missingness.
MAR (Missing at Random): Missingness depends on observed data.
MNAR (Missing Not at Random): Missingness depends on unobserved data.
Techniques for Handling Missing Values
Removal: Drop rows or columns with missing values if they’re insignificant.
Imputation: Replace missing values with statistical measures (mean, median, mode) or predictive models.
Flagging: Add a binary column to indicate missingness.
Python Example: Handling Missing Values in a Retail Dataset
Let’s consider a retail dataset with customer purchases, where some entries for price and customer_rating are missing.
import pandas as pd
import numpy as np
# Sample retail dataset
data = {
'product_id': [1, 2, 3, 4, 5],
'product_name': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'],
'price': [999.99, None, 299.99, 199.99, None],
'customer_rating': [4.5, 4.0, None, 3.8, 4.2]
}
df = pd.DataFrame(data)
# 1. Drop rows with any missing values
df_dropped = df.dropna()
print("After dropping rows with missing values:\n", df_dropped)
# 2. Impute missing price with mean
df['price'].fillna(df['price'].mean(), inplace=True)
print("\nAfter imputing price with mean:\n", df)
# 3. Impute missing rating with median
df['customer_rating'].fillna(df['customer_rating'].median(), inplace=True)
print("\nAfter imputing rating with median:\n", df)
# 4. Add missing flag for price
df['price_missing'] = df['price'].isna().astype(int)
print("\nWith missing flag:\n", df)
Output:
After dropping rows with missing values:
product_id product_name price customer_rating
0 1 Laptop 999.99 4.5
3 4 Monitor 199.99 3.8
After imputing price with mean:
product_id product_name price customer_rating
0 1 Laptop 999.9900 4.5
1 2 Phone 499.9933 4.0
2 3 Tablet 299.9900 None
3 4 Monitor 199.9900 3.8
4 5 Keyboard 499.9933 4.2
After imputing rating with median:
product_id product_name price customer_rating
0 1 Laptop 999.9900 4.5
1 2 Phone 499.9933 4.0
2 3 Tablet 299.9900 4.2
3 4 Monitor 199.9900 3.8
4 5 Keyboard 499.9933 4.2
With missing flag:
product_id product_name price customer_rating price_missing
0 1 Laptop 999.9900 4.5 0
1 2 Phone 499.9933 4.0 1
2 3 Tablet 299.9900 4.2 0
3 4 Monitor 199.9900 3.8 0
4 5 Keyboard 499.9933 4.2 1
SQL Example: Handling Missing Values
In a SQL database, you might handle missing values using COALESCE or CASE statements.
-- Sample table
CREATE TABLE retail (
product_id INT,
product_name VARCHAR(50),
price DECIMAL(10, 2),
customer_rating DECIMAL(3, 1)
);
INSERT INTO retail VALUES
(1, 'Laptop', 999.99, 4.5),
(2, 'Phone', NULL, 4.0),
(3, 'Tablet', 299.99, NULL),
(4, 'Monitor', 199.99, 3.8),
(5, 'Keyboard', NULL, 4.2);
-- Impute missing price with average
UPDATE retail
SET price = (SELECT AVG(price) FROM retail WHERE price IS NOT NULL)
WHERE price IS NULL;
-- Impute missing rating with median
UPDATE retail
SET customer_rating = (
SELECT customer_rating
FROM (
SELECT customer_rating,
ROW_NUMBER() OVER (ORDER BY customer_rating) AS rn,
COUNT(*) OVER () AS cnt
FROM retail
WHERE customer_rating IS NOT NULL
) t
WHERE rn = (cnt + 1) / 2
)
WHERE customer_rating IS NULL;
-- Add missing flag column
ALTER TABLE retail ADD price_missing INT;
UPDATE retail SET price_missing = CASE WHEN price IS NULL THEN 1 ELSE 0 END;
SELECT * FROM retail;
Best Practices
Assess Missingness: Use visualization (e.g., missingno in Python) to understand patterns.
Choose Imputation Wisely: Use mean for numerical data with low variance, median for skewed data, or mode for categorical data.
Document Changes: Keep track of imputation methods for reproducibility.
Pros, Cons, and Alternatives
Pros of Imputation: Preserves data size, maintains statistical power.
Cons: Can introduce bias if missingness is MNAR.
Alternatives: Advanced methods like K-Nearest Neighbors (KNN) imputation or predictive modeling.
3. Data Type Conversion and Normalization
Importance of Data Types
Correct data types ensure algorithms process data accurately. For instance, a price column stored as a string ("$999.99") needs conversion to a float for calculations.
Conversion Techniques
Python: Use pandas methods like astype() or to_numeric().
SQL: Use CAST or CONVERT.
Normalization Methods
Normalization scales data to a specific range, often [0, 1], to ensure consistency.
Real-Life Example: Normalizing Sales Data
In a sales dataset, revenue might be in dollars, but machine learning models require normalized values.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Sample sales dataset
data = {
'order_id': [101, 102, 103, 104],
'revenue': ['1000.50', '2500.75', 'invalid', '3000.00'],
'quantity': ['5', '10', '8', '15']
}
df = pd.DataFrame(data)
# Convert revenue to float, handle errors
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')
print("After type conversion:\n", df)
# Normalize revenue
scaler = MinMaxScaler()
df['revenue_normalized'] = scaler.fit_transform(df[['revenue']].fillna(df['revenue'].mean()))
print("\nAfter normalization:\n", df)
Output:
After type conversion:
order_id revenue quantity
0 101 1000.50 5
1 102 2500.75 10
2 103 NaN 8
3 104 3000.00 15
After normalization:
order_id revenue quantity revenue_normalized
0 101 1000.50 5 0.000000
1 102 2500.75 10 0.833583
2 103 NaN 8 0.666889
3 104 3000.00 15 1.000000
SQL Example
-- Sample sales table
CREATE TABLE sales (
order_id INT,
revenue VARCHAR(10),
quantity VARCHAR(5)
);
INSERT INTO sales VALUES
(101, '1000.50', '5'),
(102, '2500.75', '10'),
(103, 'invalid', '8'),
(104, '3000.00', '15');
-- Convert revenue to DECIMAL
ALTER TABLE sales ADD revenue_numeric DECIMAL(10, 2);
UPDATE sales
SET revenue_numeric = CAST(CASE WHEN revenue ~ '^[0-9]+\\.?[0-9]*$' THEN revenue ELSE NULL END AS DECIMAL);
-- Normalize revenue
UPDATE sales
SET revenue_numeric = (
(revenue_numeric - (SELECT MIN(revenue_numeric) FROM sales))
/ (SELECT MAX(revenue_numeric) - MIN(revenue_numeric) FROM sales)
)
WHERE revenue_numeric IS NOT NULL;
SELECT * FROM sales;
Best Practices
Validate Before Conversion: Check for invalid entries using regex or error handling.
Preserve Original Data: Create new columns for converted values.
Handle Errors: Use errors='coerce' in pandas or CASE in SQL.
Pros, Cons, and Alternatives
Pros: Ensures compatibility with algorithms, improves model performance.
Cons: Risk of data loss during coercion.
Alternatives: RobustScaler for outlier-heavy data.
4. Handling Categorical Data
One-Hot Encoding
Converts categories into binary columns (e.g., color: [red, blue] becomes is_red, is_blue).
Label Encoding
Assigns integers to categories (e.g., color: [red, blue] becomes 0, 1).
Real-Life Example: Encoding Product Categories
In an e-commerce dataset, category (e.g., Electronics, Clothing) needs encoding for machine learning.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample dataset
data = {
'product_id': [1, 2, 3, 4],
'category': ['Electronics', 'Clothing', 'Electronics', 'Books']
}
df = pd.DataFrame(data)
# One-hot encoding
df_onehot = pd.get_dummies(df, columns=['category'], prefix='cat')
print("One-hot encoded:\n", df_onehot)
# Label encoding
le = LabelEncoder()
df['category_label'] = le.fit_transform(df['category'])
print("\nLabel encoded:\n", df)
Output:
One-hot encoded:
product_id cat_Books cat_Clothing cat_Electronics
0 1 0 0 1
1 2 0 1 0
2 3 0 0 1
3 4 1 0 0
Label encoded:
product_id category category_label
0 1 Electronics 1
1 2 Clothing 0
2 3 Electronics 1
3 4 Books 2
SQL Example
-- Sample table
CREATE TABLE products (
product_id INT,
category VARCHAR(50)
);
INSERT INTO products VALUES
(1, 'Electronics'),
(2, 'Clothing'),
(3, 'Electronics'),
(4, 'Books');
-- One-hot encoding
ALTER TABLE products
ADD cat_electronics INT,
ADD cat_clothing INT,
ADD cat_books INT;
UPDATE products
SET cat_electronics = CASE WHEN category = 'Electronics' THEN 1 ELSE 0 END,
cat_clothing = CASE WHEN category = 'Clothing' THEN 1 ELSE 0 END,
cat_books = CASE WHEN category = 'Books' THEN 1 ELSE 0 END;
-- Label encoding
ALTER TABLE products ADD category_label INT;
UPDATE products
SET category_label = CASE
WHEN category = 'Electronics' THEN 1
WHEN category = 'Clothing' THEN 0
WHEN category = 'Books' THEN 2
END;
SELECT * FROM products;
Pros, Cons, and Alternatives
One-Hot Encoding Pros: No ordinal assumption, works with most algorithms.
Cons: Increases dimensionality, not ideal for high-cardinality data.
Label Encoding Pros: Compact representation.
Cons: Implies ordinality, unsuitable for non-ordinal categories.
Alternatives: Target encoding, frequency encoding.
5. Feature Scaling and Standardization
Why Scale Features?
Machine learning algorithms like SVM or neural networks are sensitive to feature scales. For example, a salary column (range: 30,000–150,000) and an age column (range: 18–80) need scaling to ensure equal contribution.
Techniques
Min-Max Scaling: Scales to [0, 1].
Standardization: Centers data around mean with unit variance.
Real-Life Example: Scaling Customer Data
In a marketing dataset, income and age need scaling for a clustering model.
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Sample dataset
data = {
'customer_id': [1, 2, 3, 4],
'age': [25, 45, 35, 60],
'income': [30000, 75000, 50000, 120000]
}
df = pd.DataFrame(data)
# Min-Max Scaling
minmax_scaler = MinMaxScaler()
df[['age_minmax', 'income_minmax']] = minmax_scaler.fit_transform(df[['age', 'income']])
print("Min-Max Scaled:\n", df)
# Standardization
std_scaler = StandardScaler()
df[['age_std', 'income_std']] = std_scaler.fit_transform(df[['age', 'income']])
print("\nStandardized:\n", df)
Output:
Min-Max Scaled:
customer_id age income age_minmax income_minmax
0 1 25 30000 0.000000 0.000000
1 2 45 75000 0.571429 0.500000
2 3 35 50000 0.285714 0.222222
3 4 60 120000 1.000000 1.000000
Standardized:
customer_id age income age_minmax income_minmax age_std income_std
0 1 25 30000 0.000000 0.000000 -1.17 -1.09
1 2 45 75000 0.571429 0.500000 0.29 0.11
2 3 35 50000 0.285714 0.222222 -0.44 -0.55
3 4 60 120000 1.000000 1.000000 1.32 1.53
SQL Example
-- Sample table
CREATE TABLE customers (
customer_id INT,
age INT,
income DECIMAL(10, 2)
);
INSERT INTO customers VALUES
(1, 25, 30000),
(2, 45, 75000),
(3, 35, 50000),
(4, 60, 120000);
-- Min-Max Scaling
ALTER TABLE customers
ADD age_minmax DECIMAL(5, 4),
ADD income_minmax DECIMAL(5, 4);
UPDATE customers
SET age_minmax = (age - (SELECT MIN(age) FROM customers)) / (SELECT MAX(age) - MIN(age) FROM customers),
income_minmax = (income - (SELECT MIN(income) FROM customers)) / (SELECT MAX(income) - MIN(income) FROM customers);
-- Standardization
ALTER TABLE customers
ADD age_std DECIMAL(5, 2),
ADD income_std DECIMAL(5, 2);
UPDATE customers
SET age_std = (age - (SELECT AVG(age) FROM customers)) / (SELECT STDDEV(age) FROM customers),
income_std = (income - (SELECT AVG(income) FROM customers)) / (SELECT STDDEV(income) FROM customers);
SELECT * FROM customers;
Best Practices
Choose Appropriate Scaler: Use Min-Max for bounded ranges, standardization for unbounded data.
Fit on Training Data Only: Avoid data leakage by fitting scalers on training data.
Handle Outliers: Use RobustScaler for outlier-heavy data.
Pros, Cons, and Alternatives
Pros: Improves model convergence, ensures fair feature contribution.
Cons: Sensitive to outliers (Min-Max), assumes normality (Standardization).
Alternatives: RobustScaler, QuantileTransformer.
6. Working with Dates and Timestamps
Parsing and Formatting Dates
Dates often come in inconsistent formats (e.g., "2023-01-01" vs. "01/01/23"). Parsing standardizes them for analysis.
Time Zone Handling
For global datasets, handling time zones is crucial to align timestamps.
Real-Life Example: Analyzing Sales Timestamps
In a sales dataset, order_date needs parsing and time zone conversion.
import pandas as pd
from datetime import datetime
import pytz
# Sample dataset
data = {
'order_id': [1, 2, 3],
'order_date': ['2023-01-01 10:00', '01/02/2023 15:30', '2023-01-03']
}
df = pd.DataFrame(data)
# Parse dates
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce', format='mixed')
print("Parsed dates:\n", df)
# Convert to UTC
df['order_date_utc'] = df['order_date'].dt.tz_localize('US/Pacific').dt.tz_convert('UTC')
print("\nWith UTC conversion:\n", df)
# Extract components
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
print("\nWith extracted components:\n", df)
Output:
Parsed dates:
order_id order_date
0 1 2023-01-01 10:00:00
1 2 2023-01-02 15:30:00
2 3 2023-01-03 00:00:00
With UTC conversion:
order_id order_date order_date_utc
0 1 2023-01-01 10:00:00 2023-01-01 18:00:00+00:00
1 2 2023-01-02 15:30:00 2023-01-02 23:30:00+00:00
2 3 2023-01-03 00:00:00 2023-01-03 08:00:00+00:00
With extracted components:
order_id order_date order_date_utc year month
0 1 2023-01-01 10:00:00 2023-01-01 18:00:00+00:00 2023 1
1 2 2023-01-02 15:30:00 2023-01-02 23:30:00+00:00 2023 1
2 3 2023-01-03 00:00:00 2023-01-03 08:00:00+00:00 2023 1
SQL Example
-- Sample table
CREATE TABLE sales_orders (
order_id INT,
order_date VARCHAR(50)
);
INSERT INTO sales_orders VALUES
(1, '2023-01-01 10:00'),
(2, '01/02/2023 15:30'),
(3, '2023-01-03');
-- Parse dates
ALTER TABLE sales_orders ADD order_date_parsed TIMESTAMP;
UPDATE sales_orders
SET order_date_parsed = TO_TIMESTAMP(order_date, 'YYYY-MM-DD HH24:MI');
-- Convert to UTC (assuming Pacific time)
ALTER TABLE sales_orders ADD order_date_utc TIMESTAMP;
UPDATE sales_orders
SET order_date_utc = order_date_parsed AT TIME ZONE 'US/Pacific' AT TIME ZONE 'UTC';
-- Extract components
ALTER TABLE sales_orders
ADD year INT,
ADD month INT;
UPDATE sales_orders
SET year = EXTRACT(YEAR FROM order_date_parsed),
month = EXTRACT(MONTH FROM order_date_parsed);
SELECT * FROM sales_orders;
Best Practices
Use Standard Formats: ISO 8601 (e.g., 2023-01-01T10:00:00Z) for consistency.
Handle Time Zones: Always convert to UTC for global datasets.
Validate Dates: Check for invalid formats using errors='coerce' or TRY_TO_TIMESTAMP.
Pros, Cons, and Alternatives
Pros: Enables time-based analysis, ensures consistency.
Cons: Time zone conversions can be complex.
Alternatives: Libraries like arrow or pendulum for Python.
7. Data Validation and Integrity Checks
Ensuring Data Quality
Validation ensures data meets business rules (e.g., age > 0, email format).
Validation Techniques
Range Checks: Ensure values fall within acceptable ranges.
Format Checks: Validate formats like emails or phone numbers.
Uniqueness Checks: Ensure no duplicates in key columns.
Real-Life Example: Validating Customer Data
In a customer dataset, validate email and age.
import pandas as pd
import re
# Sample dataset
data = {
'customer_id': [1, 2, 3],
'email': ['john@example.com', 'invalid', 'jane@domain.com'],
'age': [25, -5, 30]
}
df = pd.DataFrame(data)
# Validate email format
def is_valid_email(email):
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return bool(re.match(pattern, email))
df['email_valid'] = df['email'].apply(is_valid_email)
print("Email validation:\n", df)
# Validate age
df['age_valid'] = df['age'].apply(lambda x: x > 0 and x < 120)
print("\nAge validation:\n", df)
# Handle invalid data
df.loc[~df['email_valid'], 'email'] = 'invalid@example.com'
df.loc[~df['age_valid'], 'age'] = df['age'].mean()
print("\nAfter handling invalid data:\n", df)
Output:
Email validation:
customer_id email age email_valid
0 1 john@example.com 25 True
1 2 invalid -5 False
2 3 jane@domain.com 30 True
Age validation:
customer_id email age email_valid age_valid
0 1 john@example.com 25 True True
1 2 invalid -5 False False
2 3 jane@domain.com 30 True True
After handling invalid data:
customer_id email age email_valid age_valid
0 1 john@example.com 25.000000 True True
1 2 invalid@example.com 27.500000 False False
2 3 jane@domain.com 30.000000 True True
SQL Example
-- Sample table
CREATE TABLE customers (
customer_id INT,
email VARCHAR(100),
age INT
);
INSERT INTO customers VALUES
(1, 'john@example.com', 25),
(2, 'invalid', -5),
(3, 'jane@domain.com', 30);
-- Add validation columns
ALTER TABLE customers
ADD email_valid BOOLEAN,
ADD age_valid BOOLEAN;
-- Validate email
UPDATE customers
SET email_valid = email ~ '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$';
-- Validate age
UPDATE customers
SET age_valid = age > 0 AND age < 120;
-- Handle invalid data
UPDATE customers
SET email = 'invalid@example.com' WHERE email_valid = FALSE,
age = (SELECT AVG(age) FROM customers WHERE age_valid = TRUE) WHERE age_valid = FALSE;
SELECT * FROM customers;
Best Practices
Automate Validation: Use scripts to check data regularly.
Log Issues: Store validation failures for auditing.
Handle Exceptions: Use try-except blocks in Python or TRY_ functions in SQL.
Pros, Cons, and Alternatives
Pros: Ensures data reliability, prevents downstream errors.
Cons: Time-consuming for large datasets.
Alternatives: Data quality tools like Great Expectations or Talend.
8. Conclusion
Recap of Key Concepts
Module 5 covered essential data cleaning and preprocessing techniques:
Handling missing values with imputation or removal.
Converting and normalizing data types for consistency.
Encoding categorical data using one-hot or label encoding.
Scaling features for machine learning compatibility.
Parsing and standardizing dates and timestamps.
Validating data to ensure quality and integrity.
No comments:
Post a Comment
Thanks for your valuable comment...........
Md. Mominul Islam