Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Post Top Ad

Responsive Ads Here

Thursday, September 4, 2025

Python for Data Analysis: Pandas, NumPy, and Real-World Use Cases

 

Introduction to Data Analysis with Python

Data analysis is the backbone of modern decision-making, from business intelligence to scientific research. Python, with its powerful libraries Pandas and NumPy, is a go-to tool for data analysts due to its simplicity and versatility. This blog will guide you through mastering data analysis using Pandas and NumPy, covering everything from basic operations to advanced techniques. Whether you're analyzing sales data, cleaning messy datasets, or building predictive models, this tutorial is packed with real-world examples, best practices, pros and cons, and alternatives to ensure you become a proficient data analyst.

Why Pandas and NumPy?

  • Pandas: A high-level library for data manipulation and analysis, offering DataFrames for tabular data handling.

  • NumPy: A foundational library for numerical computations, providing efficient array operations. Together, they form a powerful duo for handling large datasets, performing statistical analysis, and preparing data for machine learning.

Pros:

  • Pandas: Intuitive DataFrame structure, excellent for tabular data, supports data cleaning and merging.

  • NumPy: Fast array operations, optimized for numerical computations, integrates well with other libraries.

Cons:

  • Pandas: Memory-intensive for very large datasets, steep learning curve for advanced features.

  • NumPy: Limited to numerical data, lacks high-level data manipulation features.

Alternatives:

  • Polars: A faster alternative to Pandas for large datasets, written in Rust.

  • Dask: Scales Pandas to big data.

  • SciPy: Extends NumPy for scientific computations.

Best Practices:

  • Use vectorized operations instead of loops for performance.

  • Regularly check for memory usage with large datasets.

  • Follow PEP 8 for clean, readable Python code.

Module 1: Getting Started with NumPy

What is NumPy?

NumPy (Numerical Python) is the foundation for numerical computing in Python. It provides support for arrays, matrices, and a wide range of mathematical functions.

Installing NumPy

pip install numpy

Basic NumPy Operations

Let’s start with a real-world example: analyzing monthly sales data for a retail store.

Example 1: Calculating Total and Average Sales

Suppose you have monthly sales data for a store in 2024.

import numpy as np

# Sales data (in thousands)
sales = np.array([120, 150, 180, 200, 170, 190, 210, 230, 250, 220, 200, 180])

# Basic operations
total_sales = np.sum(sales)
average_sales = np.mean(sales)
max_sales = np.max(sales)
min_sales = np.min(sales)

print(f"Total Sales: ${total_sales}K")
print(f"Average Monthly Sales: ${average_sales:.2f}K")
print(f"Highest Sales: ${max_sales}K")
print(f"Lowest Sales: ${min_sales}K")

Output:

Total Sales: $2250K
Average Monthly Sales: $187.50K
Highest Sales: $250K
Lowest Sales: $120K

Best Practice: Use NumPy’s vectorized operations (e.g., np.sum) instead of Python loops for faster computation.

Example 2: Sales Growth Rate

Calculate the month-over-month growth rate.

# Calculate growth rate
growth_rate = np.diff(sales) / sales[:-1] * 100

print("Month-over-Month Growth Rate (%):", growth_rate)

Output:

Month-over-Month Growth Rate (%): [ 25.          20.          11.11111111 -15.          11.76470588
  10.52631579   9.52380952   8.69565217 -12.          -9.09090909
 -10.        ]

Pros: NumPy’s array operations are fast and memory-efficient. Cons: Limited to numerical data; for categorical or mixed data, you’ll need Pandas.

Advanced NumPy: Matrix Operations

For a more advanced scenario, let’s analyze a product inventory matrix.

Example 3: Inventory Management

Suppose you have a 3x3 matrix representing stock levels for three products across three stores.

# Inventory matrix: rows = products, columns = stores
inventory = np.array([[50, 30, 20], [10, 40, 60], [25, 15, 35]])

# Total stock per product
total_per_product = np.sum(inventory, axis=1)
print("Total Stock per Product:", total_per_product)

# Total stock per store
total_per_store = np.sum(inventory, axis=0)
print("Total Stock per Store:", total_per_store)

Output:

Total Stock per Product: [100 110  75]
Total Stock per Store: [85 85 115]

Best Practice: Use axis parameters to control aggregation direction in multi-dimensional arrays.

Module 2: Introduction to Pandas

What is Pandas?

Pandas is built on top of NumPy and provides a DataFrame object for handling tabular data, similar to spreadsheets or SQL tables.

Installing Pandas

pip install pandas

Basic Pandas Operations

Let’s analyze a dataset of customer purchases for a retail business.

Example 4: Loading and Exploring a Customer Dataset

Suppose you have a CSV file with customer purchase data.

import pandas as pd

# Sample data
data = {
    "CustomerID": [1, 2, 3, 4, 5],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "PurchaseAmount": [100, 200, None, 150, 300],
    "PurchaseDate": ["2024-01-01", "2024-02-01", "2024-03-01", "2024-04-01", "2024-05-01"],
    "Category": ["Electronics", "Clothing", "Electronics", None, "Books"]
}
df = pd.DataFrame(data)

# Basic exploration
print("Dataset Overview:")
print(df.head())
print("\nSummary Statistics:")
print(df.describe())
print("\nMissing Values:")
print(df.isnull().sum())

Output:

Dataset Overview:
   CustomerID     Name  PurchaseAmount PurchaseDate   Category
0           1    Alice           100.0   2024-01-01  Electronics
1           2      Bob           200.0   2024-02-01    Clothing
2           3  Charlie             NaN   2024-03-01  Electronics
3           4    David           150.0   2024-04-01        None
4           5      Eve           300.0   2024-05-01       Books

Summary Statistics:
       CustomerID  PurchaseAmount
count    5.000000        4.000000
mean     3.000000      187.500000
std      1.581139       81.454271
min      1.000000      100.000000
25%      2.000000      137.500000
50%      3.000000      175.000000
75%      4.000000      225.000000
max      5.000000      300.000000

Missing Values:
CustomerID        0
Name             0
PurchaseAmount    1
PurchaseDate     0
Category         1
dtype: int64

Best Practice: Always inspect your dataset with head(), describe(), and isnull() to understand its structure and quality.

Module 3: Data Cleaning with Pandas

Data cleaning is critical for accurate analysis. Let’s clean the dataset from Example 4.

Example 5: Handling Missing Values

# Fill missing PurchaseAmount with the mean
df["PurchaseAmount"].fillna(df["PurchaseAmount"].mean(), inplace=True)

# Fill missing Category with 'Unknown'
df["Category"].fillna("Unknown", inplace=True)

print("Cleaned Dataset:")
print(df)

Output:

   CustomerID     Name  PurchaseAmount PurchaseDate   Category
0           1    Alice         100.000   2024-01-01  Electronics
1           2      Bob         200.000   2024-02-01    Clothing
2           3  Charlie         187.500   2024-03-01  Electronics
3           4    David         150.000   2024-04-01     Unknown
4           5      Eve         300.000   2024-05-01       Books

Best Practice: Choose imputation strategies (mean, median, mode) based on data distribution. Avoid dropping rows unless necessary.

Example 6: Removing Duplicates

Suppose the dataset has duplicate entries.

# Add a duplicate row for demonstration
df = pd.concat([df, df.iloc[[0]]], ignore_index=True)

# Remove duplicates
df = df.drop_duplicates()
print("Dataset after removing duplicates:")
print(df)

Output:

   CustomerID     Name  PurchaseAmount PurchaseDate   Category
0           1    Alice         100.000   2024-01-01  Electronics
1           2      Bob         200.000   2024-02-01    Clothing
2           3  Charlie         187.500   2024-03-01  Electronics
3           4    David         150.000   2024-04-01     Unknown
4           5      Eve         300.000   2024-05-01       Books

Pros: Pandas makes data cleaning intuitive with methods like fillna and drop_duplicates. Cons: Large datasets can slow down operations; consider Dask for big data.

Module 4: Data Manipulation with Pandas

Filtering and Sorting

Let’s filter high-value customers and sort them by purchase amount.

Example 7: Filtering and Sorting

# Filter customers with PurchaseAmount > 150
high_value_customers = df[df["PurchaseAmount"] > 150]
print("High-Value Customers:")
print(high_value_customers)

# Sort by PurchaseAmount
sorted_df = df.sort_values(by="PurchaseAmount", ascending=False)
print("\nSorted by Purchase Amount:")
print(sorted_df)

Output:

High-Value Customers:
   CustomerID     Name  PurchaseAmount PurchaseDate Category
1           2      Bob         200.000   2024-02-01  Clothing
2           3  Charlie         187.500   2024-03-01  Electronics
4           5      Eve         300.000   2024-05-01     Books

Sorted by Purchase Amount:
   CustomerID     Name  PurchaseAmount PurchaseDate   Category
4           5      Eve         300.000   2024-05-01     Books
1           2      Bob         200.000   2024-02-01    Clothing
2           3  Charlie         187.500   2024-03-01  Electronics
3           4    David         150.000   2024-04-01     Unknown
0           1    Alice         100.000   2024-01-01  Electronics

Grouping and Aggregation

Group customers by category and calculate total purchases.

Example 8: Grouping by Category

# Group by Category and sum PurchaseAmount
category_totals = df.groupby("Category")["PurchaseAmount"].sum()
print("Total Purchases by Category:")
print(category_totals)

Output:

Total Purchases by Category:
Category
Books          300.000
Clothing       200.000
Electronics    287.500
Unknown        150.000
Name: PurchaseAmount, dtype: float64

Best Practice: Use groupby for aggregation tasks to summarize data efficiently.

Module 5: Advanced Analysis with Pandas and NumPy

Merging Datasets

Suppose you have another dataset with customer ratings.

Example 9: Merging Datasets

# Ratings dataset
ratings_data = {
    "CustomerID": [1, 2, 3, 6],
    "Rating": [5, 4, 3, 2]
}
ratings_df = pd.DataFrame(ratings_data)

# Merge datasets
merged_df = pd.merge(df, ratings_df, on="CustomerID", how="left")
print("Merged Dataset:")
print(merged_df)

Output:

   CustomerID     Name  PurchaseAmount PurchaseDate   Category  Rating
0           1    Alice         100.000   2024-01-01  Electronics     5.0
1           2      Bob         200.000   2024-02-01    Clothing     4.0
2           3  Charlie         187.500   2024-03-01  Electronics     3.0
3           4    David         150.000   2024-04-01     Unknown     NaN
4           5      Eve         300.000   2024-05-01     Books      NaN

Best Practice: Choose the appropriate merge type (inner, left, right, outer) based on your data needs.

Advanced NumPy: Statistical Analysis

Analyze purchase amounts for outliers using z-scores.

Example 10: Outlier Detection

# Convert PurchaseAmount to NumPy array
amounts = df["PurchaseAmount"].values

# Calculate z-scores
z_scores = (amounts - np.mean(amounts)) / np.std(amounts)
outliers = df[np.abs(z_scores) > 2]
print("Outliers based on Z-score:")
print(outliers)

Output:

   CustomerID Name  PurchaseAmount PurchaseDate Category
4           5  Eve         300.000   2024-05-01    Books

Pros: NumPy’s statistical functions are ideal for detecting anomalies. Cons: Requires understanding of statistical concepts for meaningful interpretation.

Module 6: Real-World Case Study

Case Study: E-Commerce Sales Analysis

Let’s combine everything to analyze a larger e-commerce dataset.

Example 11: Comprehensive Analysis

# Simulated e-commerce dataset
ecommerce_data = {
    "OrderID": range(1, 11),
    "CustomerID": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
    "OrderAmount": [50, 150, 200, None, 300, 75, 125, 175, 100, 250],
    "Category": ["Books", "Electronics", "Clothing", "Books", "Electronics", "Clothing", None, "Books", "Electronics", "Clothing"],
    "OrderDate": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04", "2024-01-05", "2024-02-01", "2024-02-02", "2024-02-03", "2024-02-04", "2024-02-05"]
}
ecom_df = pd.DataFrame(ecommerce_data)

# Data cleaning
ecom_df["OrderAmount"].fillna(ecom_df["OrderAmount"].mean(), inplace=True)
ecom_df["Category"].fillna("Unknown", inplace=True)

# Analysis: Total sales by category
sales_by_category = ecom_df.groupby("Category")["OrderAmount"].sum()
print("Total Sales by Category:")
print(sales_by_category)

# Analysis: Average order amount per customer
avg_order_per_customer = ecom_df.groupby("CustomerID")["OrderAmount"].mean()
print("\nAverage Order Amount per Customer:")
print(avg_order_per_customer)

# NumPy: Detect outliers
amounts = ecom_df["OrderAmount"].values
z_scores = (amounts - np.mean(amounts)) / np.std(amounts)
print("\nOutliers (Z-score > 2):")
print(ecom_df[np.abs(z_scores) > 2])

Output:

Total Sales by Category:
Category
Books          375.000
Clothing       525.000
Electronics    550.000
Unknown        137.778
Name: OrderAmount, dtype: float64

Average Order Amount per Customer:
CustomerID
1     62.500
2    137.500
3    187.500
4    118.889
5    275.000
Name: OrderAmount, dtype: float64

Outliers (Z-score > 2):
   OrderID  CustomerID  OrderAmount   Category   OrderDate
4        5          5       300.000  Electronics  2024-01-05

Best Practice: Combine Pandas and NumPy for comprehensive analysis, using Pandas for data manipulation and NumPy for numerical computations.

Conclusion

Mastering data analysis with Pandas and NumPy opens up a world of possibilities, from cleaning messy datasets to uncovering insights in e-commerce sales. By following the examples and best practices in this tutorial, you can handle real-world data challenges with confidence. Explore further with libraries like Polars or Dask for big data, and keep experimenting with new datasets to hone your skills.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam

Post Bottom Ad

Responsive Ads Here