Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Post Top Ad

Responsive Ads Here

Friday, August 29, 2025

The Ultimate Python Pandas Data Analysis Course: Module - 1 : From Beginner to Advanced with Real-Life Examples

 


Module 1: What is Pandas and Its Role in Data Analysis

Understanding Pandas

Pandas is an open-source Python library built on top of NumPy, designed for data manipulation and analysis. It provides flexible data structures like Series (1D) and DataFrame (2D) to handle tabular data, such as spreadsheets or SQL tables, with ease. Pandas is widely used in data science, machine learning, and business analytics for tasks like data cleaning, transformation, and visualization.

Real-Life Relevance

Imagine you’re a retail analyst at an e-commerce company. You receive daily sales data in CSV files with thousands of rows, including customer purchases, product categories, and prices. Pandas allows you to load, clean, and analyze this data to identify trends, such as which products sell best during holidays or which regions generate the most revenue.

Pros

  • Intuitive Data Structures: Series and DataFrames make working with tabular data straightforward.

  • Versatile: Supports multiple file formats (CSV, Excel, JSON, SQL, etc.).

  • Fast and Efficient: Built on NumPy, optimized for performance.

  • Extensive Functionality: Offers operations for filtering, grouping, merging, and more.

  • Community Support: Large community and extensive documentation.

Cons

  • Memory Usage: Can be memory-intensive for very large datasets.

  • Learning Curve: Advanced features require understanding of Python and NumPy.

  • Performance: Slower than specialized tools like SQL for certain database operations.

Alternatives

  • NumPy: For numerical computations but lacks Pandas’ high-level data manipulation.

  • Dask: For big data, handling datasets larger than memory.

  • Polars: A faster, Rust-based alternative for data manipulation.

  • SQL: For database queries but less flexible for complex transformations.

  • R (dplyr): For data manipulation in the R ecosystem.

Best Practices

  • Use Pandas for datasets that fit in memory (typically <10 GB).

  • Leverage vectorized operations instead of loops for performance.

  • Keep Pandas updated to benefit from the latest features and optimizations.

Example: Why Pandas Matters

Let’s say you’re analyzing customer feedback data for a restaurant chain. The dataset includes ratings, comments, and locations. With Pandas, you can quickly load the data, filter low ratings, and summarize feedback by location—tasks that would be cumbersome in raw Python or Excel.

import pandas as pd

# Example: Loading a small customer feedback dataset
data = {
    'Customer': ['Alice', 'Bob', 'Charlie'],
    'Rating': [5, 3, 4],
    'Location': ['Downtown', 'Suburb', 'Downtown']
}
df = pd.DataFrame(data)
print(df)

Output:

   Customer  Rating  Location
0    Alice       5  Downtown
1      Bob       3   Suburb
2  Charlie       4  Downtown

This simple example shows how Pandas organizes data into a readable DataFrame, setting the stage for deeper analysis.


Module 2: Installing Pandas and Compatible Python Versions

Getting Started with Installation

To use Pandas, you need Python installed (preferably 3.8+ for compatibility with the latest Pandas versions). Pandas can be installed via pip or conda, and it’s critical to ensure your Python environment supports it.

Real-Life Scenario

As a data analyst at a healthcare startup, you need to set up Pandas to analyze patient records. Ensuring the correct Python version and dependencies avoids compatibility issues when sharing your code with colleagues.

Installation Steps

  1. Install Python: Download from python.org (3.8 or higher recommended).

  2. Install Pandas:

    • Using pip: pip install pandas

    • Using conda: conda install pandas

  3. Install Optional Dependencies:

    • For Excel: pip install openpyxl xlrd

    • For visualization: pip install matplotlib seaborn

Pros

  • Easy Installation: Simple commands via pip or conda.

  • Cross-Platform: Works on Windows, macOS, and Linux.

  • Dependency Management: Integrates well with virtual environments.

Cons

  • Dependency Conflicts: Older Python versions or conflicting packages can cause issues.

  • Large Footprint: Installing with dependencies (e.g., NumPy, openpyxl) increases disk usage.

Alternatives

  • Conda Environments: Use conda for isolated environments to avoid conflicts.

  • Docker: Run Pandas in a containerized environment for consistency.

  • Cloud Platforms: Use Google Colab or JupyterHub for pre-installed Pandas.

Best Practices

  • Use virtual environments (venv or conda) to isolate projects.

  • Regularly update Pandas: pip install --upgrade pandas.

  • Test compatibility with Python version before deployment.

Example: Setting Up Pandas

Here’s how to set up a Python environment and verify Pandas installation.

# Create and activate a virtual environment
# On Windows: python -m venv myenv & myenv\Scripts\activate
# On macOS/Linux: python3 -m venv myenv && source myenv/bin/activate

# Install Pandas
pip install pandas

# Verify installation
import pandas as pd
print(pd.__version__)  # Check Pandas version

Output (example):

2.2.3

Advanced Example: Setting Up with Conda

For a team project, you might use conda to manage dependencies.

conda create -n pandas_env python=3.9 pandas openpyxl matplotlib
conda activate pandas_env
python -c "import pandas as pd; print(pd.__version__)"

This ensures a clean environment for collaborative data analysis.


Module 3: Overview of Latest Pandas Features (1.6+ / 2.x Preview)

What’s New in Pandas

Pandas has evolved significantly, with versions 1.6+ introducing performance improvements and 2.x (preview) offering major enhancements like the PyArrow backend for faster operations and better memory efficiency.

Real-Life Relevance

As a financial analyst, you might use Pandas 2.x’s PyArrow backend to process large transaction datasets faster, enabling real-time fraud detection.

Key Features in 1.6+

  • Nullable Integer Data Types: Int64, UInt32, etc., for handling missing values in integers.

  • Improved CSV Reading: Faster parsing with pyarrow engine.

  • Styler Enhancements: Better formatting for DataFrame visualization.

Pandas 2.x Preview

  • PyArrow Backend: Replaces NumPy with Apache Arrow for faster computations.

  • Copy-on-Write (CoW): Reduces memory usage by avoiding unnecessary copies.

  • Enhanced String Operations: Improved performance for string manipulations.

Pros

  • Performance Gains: PyArrow and CoW make Pandas faster and more memory-efficient.

  • Modern Data Types: Nullable types improve data integrity.

  • Future-Proof: 2.x aligns with big data trends.

Cons

  • Breaking Changes: 2.x may require code updates.

  • Experimental: Some 2.x features are not fully stable.

  • Learning Curve: New backends require familiarity with Arrow.

Alternatives

  • Polars: Rust-based, faster for large datasets.

  • Dask: Scales Pandas operations for big data.

  • Vaex: Optimized for out-of-core data processing.

Best Practices

  • Test 2.x features in a sandbox environment before production.

  • Use pyarrow engine for CSV/Parquet files with large datasets.

  • Monitor Pandas release notes for breaking changes.

Example: Using Nullable Data Types

Handle missing values in a dataset of patient ages.

import pandas as pd

# Dataset with missing ages
data = {'Patient': ['John', 'Jane', 'Bob'], 'Age': [25, None, 30]}
df = pd.DataFrame(data, dtype='Int64')  # Nullable integer type
print(df)

Output:

  Patient   Age
0    John    25
1    Jane  <NA>
2     Bob    30

Advanced Example: PyArrow Backend

Use the PyArrow backend to read a large CSV file efficiently.

import pandas as pd

# Read CSV with PyArrow engine
df = pd.read_csv('large_dataset.csv', engine='pyarrow')
print(df.head())

This leverages PyArrow for faster parsing, especially for datasets with millions of rows.


Module 4: Importing Pandas and Checking Version

Importing Pandas

Pandas is typically imported with the alias pd for brevity. Checking the version ensures compatibility with your code.

Real-Life Scenario

You’re a data scientist collaborating on a project. Verifying the Pandas version ensures your scripts run consistently across team members’ environments.

Pros

  • Standard Alias: pd is universally recognized.

  • Version Checking: Prevents compatibility issues.

Cons

  • Dependency Mismatch: Version differences can break code.

  • Manual Verification: Requires explicit checks.

Best Practices

  • Always import as pd: import pandas as pd.

  • Check version at the start of scripts: pd.__version__.

  • Document required Pandas version in project README.

Example: Basic Import and Version Check

import pandas as pd
print(f"Pandas Version: {pd.__version__}")

Output (example):

Pandas Version: 2.2.3

Advanced Example: Conditional Version Check

Ensure your script requires Pandas 2.0 or higher.

import pandas as pd
import sys

required_version = '2.0.0'
if pd.__version__ < required_version:
    sys.exit(f"Error: Pandas version {pd.__version__} found, {required_version} or higher required.")
print("Pandas version is compatible!")

This prevents runtime errors due to outdated features.


Module 5: Understanding Series and DataFrame Objects

Series vs. DataFrame

  • Series: A one-dimensional labeled array, like a column in a spreadsheet.

  • DataFrame: A two-dimensional table with rows and columns, like a spreadsheet.

Real-Life Scenario

As a marketing analyst, you use a Series to store customer ratings and a DataFrame to combine ratings with demographics for segmentation.

Pros

  • Series: Simple, lightweight, and ideal for single-column operations.

  • DataFrame: Versatile for complex, multi-column datasets.

  • Label-Based Indexing: Both support intuitive indexing.

Cons

  • Series Limitations: Lacks multi-column functionality.

  • DataFrame Complexity: Can be overwhelming for simple tasks.

Alternatives

  • NumPy Arrays: For numerical data without labels.

  • Lists/Dictionaries: For small, simple datasets.

  • SQL Tables: For database-driven analysis.

Best Practices

  • Use Series for single-column data; DataFrames for multi-column.

  • Define meaningful index labels for clarity.

  • Avoid mixing data types in a Series for performance.

Example: Creating a Series

Track daily temperatures for a weather app.

import pandas as pd

temperatures = pd.Series([72, 75, 68, 70], index=['Mon', 'Tue', 'Wed', 'Thu'])
print(temperatures)

Output:

Mon    72
Tue    75
Wed    68
Thu    70
dtype: int64

Example: Creating a DataFrame

Store employee data for HR analysis.

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 60000, 55000],
    'Department': ['HR', 'IT', 'Marketing']
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Salary Department
0    Alice   50000        HR
1      Bob   60000        IT
2  Charlie   55000 Marketing

Advanced Example: Series with Custom Index

Analyze stock prices with dates as the index.

import pandas as pd

prices = pd.Series([100.5, 102.3, 99.8], index=['2025-08-01', '2025-08-02', '2025-08-03'])
print(prices)
print(f"Average price: {prices.mean()}")

Output:

2025-08-01    100.5
2025-08-02    102.3
2025-08-03     99.8
dtype: float64
Average price: 100.86666666666666

Module 6: Creating Series from Lists, Arrays, Dictionaries

Creating Series

Series can be created from lists, NumPy arrays, or dictionaries, offering flexibility for different data sources.

Real-Life Scenario

A logistics company tracks delivery times. You create a Series from a list of delivery durations or a dictionary of driver performance metrics.

Pros

  • Flexible Inputs: Supports multiple data sources.

  • Automatic Indexing: Simplifies data creation.

  • Integration: Works seamlessly with NumPy and DataFrames.

Cons

  • Data Type Issues: Mixed types can lead to object dtype, reducing performance.

  • Index Duplicates: Non-unique indices can cause errors in some operations.

Best Practices

  • Specify dtype to optimize performance (e.g., float64, Int64).

  • Use meaningful index names for clarity.

  • Validate input data to avoid mixed types.

Example: Series from a List

Track sales quantities.

import pandas as pd

sales = pd.Series([150, 200, 180], index=['Product A', 'Product B', 'Product C'])
print(sales)

Output:

Product A    150
Product B    200
Product C    180
dtype: int64

Example: Series from a NumPy Array

Use NumPy for random data generation.

import pandas as pd
import numpy as np

data = np.random.randn(5)
series = pd.Series(data, index=['A', 'B', 'C', 'D', 'E'], dtype='float64')
print(series)

Output (example):

A    0.123456
B   -0.789012
C    1.234567
D   -0.345678
E    0.901234
dtype: float64

Example: Series from a Dictionary

Store employee hours by name.

hours = pd.Series({'Alice': 40, 'Bob': 35, 'Charlie': 42})
print(hours)

Output:

Alice      40
Bob        35
Charlie    42
dtype: int64

Advanced Example: Series with Nullable Types

Handle missing data in a survey dataset.

import pandas as pd

responses = pd.Series([10, None, 8, 9], index=['Q1', 'Q2', 'Q3', 'Q4'], dtype='Int64')
print(responses)

Output:

Q1     10
Q2    <NA>
Q3      8
Q4      9
dtype: Int64

Module 7: Creating DataFrames from Dictionaries, Lists, CSV, Excel, JSON

Creating DataFrames

DataFrames can be created from dictionaries, lists, or external files like CSV, Excel, or JSON, making Pandas versatile for real-world data.

Real-Life Scenario

A retail manager receives sales data in CSV, Excel, and JSON formats from different stores. Pandas unifies these into a single DataFrame for analysis.

Pros

  • Multiple Formats: Supports CSV, Excel, JSON, SQL, and more.

  • Flexible Construction: Easily create from Python data structures.

  • Scalable: Handles large datasets with proper configuration.

Cons

  • File Format Dependencies: Requires additional libraries (e.g., openpyxl for Excel).

  • Memory Usage: Large files can consume significant memory.

  • Parsing Errors: Malformed files can cause issues.

Best Practices

  • Use engine='pyarrow' for faster CSV/Parquet reading.

  • Specify dtype to optimize memory usage.

  • Handle encoding issues (e.g., encoding='utf-8') for text files.

Example: DataFrame from a Dictionary

Create a DataFrame for student grades.

import pandas as pd

data = {
    'Student': ['Alice', 'Bob', 'Charlie'],
    'Math': [85, 90, 78],
    'Science': [92, 88, 95]
}
df = pd.DataFrame(data)
print(df)

Output:

   Student  Math  Science
0    Alice    85       92
1      Bob    90       88
2  Charlie    78       95

Example: DataFrame from a List of Lists

Track inventory levels.

data = [['Laptop', 50, 999.99], ['Phone', 100, 499.99], ['Tablet', 30, 299.99]]
df = pd.DataFrame(data, columns=['Product', 'Quantity', 'Price'])
print(df)

Output:

   Product  Quantity   Price
0  Laptop        50  999.99
1   Phone       100  499.99
2  Tablet        30  299.99

Example: DataFrame from CSV

Load sales data from a CSV file.

# Assume sales.csv contains: Date,Product,Units,Revenue
df = pd.read_csv('sales.csv', engine='pyarrow')
print(df.head())

Output (example):

         Date Product  Units  Revenue
0  2025-08-01  Laptop     10   9999.9
1  2025-08-01   Phone     20   9999.8
2  2025-08-02  Tablet     15   4499.85

Example: DataFrame from Excel

Analyze employee data from an Excel file.

df = pd.read_excel('employees.xlsx', engine='openpyxl')
print(df.head())

Example: DataFrame from JSON

Load customer data from a JSON file.

df = pd.read_json('customers.json')
print(df.head())

Advanced Example: Large CSV with Dtype Specification

Optimize memory for a large dataset.

dtypes = {'CustomerID': 'int32', 'PurchaseAmount': 'float32', 'Category': 'category'}
df = pd.read_csv('large_sales.csv', dtype=dtypes, engine='pyarrow')
print(df.info())

This reduces memory usage by specifying efficient data types.


Module 8: Indexing, Slicing, and Subsetting Data (Series & DataFrames)

Indexing and Slicing

Pandas supports label-based (loc) and position-based (iloc) indexing, along with boolean indexing for subsetting data.

Real-Life Scenario

A logistics analyst needs to filter delivery data for late shipments or specific regions. Indexing and slicing make this efficient.

Pros

  • Flexible Indexing: Supports labels, positions, and conditions.

  • Intuitive: Similar to Python slicing but with added functionality.

  • Powerful Filtering: Boolean indexing simplifies complex queries.

Cons

  • Index Confusion: Mixing loc and iloc can lead to errors.

  • Performance: Large datasets require careful indexing for speed.

Best Practices

  • Use loc for label-based indexing, iloc for position-based.

  • Avoid chained indexing (e.g., df['col'][0]) to prevent SettingWithCopyWarning.

  • Use boolean indexing for conditional filtering.

Example: Series Indexing

Access temperatures for specific days.

import pandas as pd

temps = pd.Series([72, 75, 68, 70], index=['Mon', 'Tue', 'Wed', 'Thu'])
print(temps['Mon'])  # Label-based
print(temps[0])      # Position-based
print(temps['Mon':'Wed'])  # Slicing

Output:

72
72
Mon    72
Tue    75
Wed    68
dtype: int64

Example: DataFrame Indexing with loc

Filter employees by department.

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 60000, 55000],
    'Department': ['HR', 'IT', 'Marketing']
}
df = pd.DataFrame(data)
print(df.loc[df['Department'] == 'IT'])

Output:

   Name  Salary Department
1   Bob   60000        IT

Example: DataFrame Slicing with iloc

Select the first two rows and columns.

print(df.iloc[0:2, 0:2])

Output:

    Name  Salary
0  Alice   50000
1    Bob   60000

Advanced Example: Boolean Indexing

Find high-salary employees.

high_salary = df[df['Salary'] > 55000]
print(high_salary)

Output:

   Name  Salary Department
1   Bob   60000        IT

Module 9: Accessing Rows and Columns (loc, iloc)

loc and iloc in Depth

  • loc: Access by labels (row/column names).

  • iloc: Access by integer positions.

Real-Life Scenario

A data analyst at a bank needs to extract specific customer transactions by account ID (labels) or row numbers for auditing.

Pros

  • loc: Intuitive for labeled data.

  • iloc: Ideal for programmatic access.

  • Flexible: Supports slicing, lists, and boolean arrays.

Cons

  • Performance: loc can be slower on large datasets.

  • Errors: Invalid labels or indices cause exceptions.

Best Practices

  • Use loc for readable, label-based queries.

  • Use iloc for loops or position-based access.

  • Validate indices before accessing to avoid errors.

Example: loc for Specific Rows

Access employee data by name index.

df = pd.DataFrame({
    'Salary': [50000, 60000, 55000],
    'Department': ['HR', 'IT', 'Marketing']
}, index=['Alice', 'Bob', 'Charlie'])
print(df.loc['Alice'])

Output:

Salary       50000
Department      HR
Name: Alice, dtype: object

Example: iloc for Specific Columns

Select the Salary column for the first two rows.

print(df.iloc[0:2, 0])

Output:

Alice    50000
Bob      60000
Name: Salary, dtype: int64

Advanced Example: Combining loc and iloc

Filter high-salary IT employees.

it_employees = df.loc[df['Department'] == 'IT', ['Salary']]
print(it_employees)

Output:

     Salary
Bob   60000

Module 10: Adding, Renaming, and Deleting Columns

Modifying DataFrames

Pandas makes it easy to add, rename, or delete columns to adapt datasets for analysis.

Real-Life Scenario

A retail analyst adds a column for profit margins, renames ambiguous column names, and removes irrelevant columns from a sales dataset.

Pros

  • Simple Syntax: Intuitive methods like assign, rename, drop.

  • Flexible: Supports dynamic column creation.

  • In-Place or Copy: Options to modify DataFrame or create a new one.

Cons

  • Memory Overhead: Adding columns increases memory usage.

  • Errors: Dropping non-existent columns raises errors unless handled.

Best Practices

  • Use assign for functional programming style (avoids modifying original).

  • Use meaningful column names for clarity.

  • Check for column existence before dropping.

Example: Adding a Column

Calculate profit margins.

import pandas as pd

df = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet'],
    'Revenue': [9999.9, 4999.8, 4499.85]
})
df['ProfitMargin'] = df['Revenue'] * 0.2
print(df)

Output:

   Product  Revenue  ProfitMargin
0  Laptop  9999.90     1999.980
1   Phone  4999.80      999.960
2  Tablet  4499.85      899.970

Example: Renaming Columns

Make column names more descriptive.

df = df.rename(columns={'Revenue': 'SalesRevenue', 'ProfitMargin': 'Profit'})
print(df)

Output:

   Product  SalesRevenue    Profit
0  Laptop      9999.90  1999.980
1   Phone      4999.80   999.960
2  Tablet      4499.85   899.970

Example: Deleting Columns

Remove the Profit column.

df = df.drop(columns='Profit')
print(df)

Output:

   Product  SalesRevenue
0  Laptop      9999.90
1   Phone      4999.80
2  Tablet      4499.85

Advanced Example: Dynamic Column Creation

Add a category column based on revenue thresholds.

df['Category'] = df['SalesRevenue'].apply(lambda x: 'High' if x > 5000 else 'Low')
print(df)

Output:

   Product  SalesRevenue Category
0  Laptop      9999.90    High
1   Phone      4999.80     Low
2  Tablet      4499.85     Low

Module 11: Sorting and Ranking Data

Sorting and Ranking

Pandas provides sort_values, sort_index, and rank for organizing and ranking data.

Real-Life Scenario

A sales manager sorts a dataset by revenue to identify top-performing products or ranks employees by performance metrics.

Pros

  • Flexible Sorting: Sort by multiple columns, ascending/descending.

  • Ranking: Assign ranks to data points for comparison.

  • In-Place or Copy: Control whether to modify the original DataFrame.

Cons

  • Performance: Sorting large datasets can be slow.

  • Ties in Ranking: Requires handling for equal values.

Best Practices

  • Use sort_values for column-based sorting, sort_index for index-based.

  • Specify na_position to handle missing values in sorting.

  • Use rank(method='dense') for consistent ranking with ties.

Example: Sorting by Column

Sort products by revenue.

import pandas as pd

df = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet'],
    'Revenue': [9999.9, 4999.8, 4499.85]
})
df_sorted = df.sort_values('Revenue', ascending=False)
print(df_sorted)

Output:

   Product  Revenue
0  Laptop  9999.90
1   Phone  4999.80
2  Tablet  4499.85

Example: Sorting by Multiple Columns

Sort by revenue and product name.

df = df.sort_values(['Revenue', 'Product'], ascending=[False, True])
print(df)

Example: Ranking Data

Rank products by revenue.

df['Rank'] = df['Revenue'].rank(ascending=False)
print(df)

Output:

   Product  Revenue  Rank
0  Laptop  9999.90   1.0
1   Phone  4999.80   2.0
2  Tablet  4499.85   3.0

Advanced Example: Sorting with Missing Values

Handle missing data in sorting.

df = pd.DataFrame({
    'Product': ['Laptop', 'Phone', 'Tablet'],
    'Revenue': [9999.9, None, 4499.85]
})
df_sorted = df.sort_values('Revenue', na_position='first')
print(df_sorted)

Output:

   Product  Revenue
1   Phone      NaN
2  Tablet  4499.85
0  Laptop  9999.90

Conclusion

This blog post has covered the foundational topics of Pandas for data analysis, from understanding its role to mastering Series, DataFrames, indexing, and data manipulation. Each section included real-life examples, pros, cons, alternatives, best practices, and multiple code snippets to ensure clarity for beginners and advanced users alike. By applying these concepts to real-world scenarios like retail, finance, and healthcare, you’re now equipped to handle diverse data challenges.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam

Post Bottom Ad

Responsive Ads Here