Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Post Top Ad

Responsive Ads Here

Friday, August 22, 2025

Master Data Analysis: Module 4 - Python Libraries for Data Analysis (NumPy, Pandas, Data Cleaning, and Transformation)

 


Table of Contents

  1. Introduction to Module 4: Python Libraries for Data Analysis

    • 1.1 Why NumPy and Pandas?

    • 1.2 Real-World Applications of Data Analysis

    • 1.3 What to Expect in This Module

  2. NumPy: The Foundation of Numerical Computing

    • 2.1 Understanding NumPy Arrays

    • 2.2 Creating Arrays: Methods and Best Practices

    • 2.3 Indexing and Slicing in NumPy

    • 2.4 Array Operations: Arithmetic, Logical, and Statistical

    • 2.5 Real-Life Example: Analyzing Sales Data with NumPy

    • 2.6 Exception Handling in NumPy

    • 2.7 Pros, Cons, and Alternatives to NumPy

  3. Pandas: Mastering DataFrames and Series

    • 3.1 Introduction to Pandas Series and DataFrames

    • 3.2 Indexing, Selection, and Filtering in Pandas

    • 3.3 Real-Life Example: Customer Data Analysis with Pandas

    • 3.4 Exception Handling in Pandas

    • 3.5 Pros, Cons, and Alternatives to Pandas

  4. Data Cleaning: Preparing Data for Analysis

    • 4.1 Handling Missing Data

    • 4.2 Removing Duplicates

    • 4.3 Detecting and Managing Outliers

    • 4.4 Real-Life Example: Cleaning E-Commerce Transaction Data

    • 4.5 Best Practices for Data Cleaning

    • 4.6 Exception Handling in Data Cleaning

  5. Data Transformation: Shaping Data for Insights

    • 5.1 Merging DataFrames

    • 5.2 Concatenating DataFrames

    • 5.3 Pivoting and Melting Data

    • 5.4 Real-Life Example: Transforming Retail Inventory Data

    • 5.5 Best Practices for Data Transformation

    • 5.6 Exception Handling in Data Transformation

  6. Latest Pandas 2.x Features

    • 6.1 Performance Improvements with Apache Arrow

    • 6.2 Enhanced Data Types and Nullable Dtypes

    • 6.3 New Functions and Methods

    • 6.4 Real-Life Example: Using Pandas 2.x for Financial Data Analysis

    • 6.5 Best Practices for Leveraging Pandas 2.x

  7. Conclusion and Next Steps

    • 7.1 Recap of Module 4

    • 7.2 How to Apply These Skills in Real Projects

    • 7.3 Resources for Further Learning


1. Introduction to Module 4: Python Libraries for Data Analysis

1.1 Why NumPy and Pandas?

NumPy and Pandas are the backbone of data analysis in Python, offering powerful tools for numerical computations and data manipulation. NumPy excels in handling multi-dimensional arrays and performing efficient mathematical operations, while Pandas provides flexible data structures like Series and DataFrames for structured data analysis. Together, they enable data analysts to clean, transform, and analyze data with ease, making them essential for any data science workflow.

1.2 Real-World Applications of Data Analysis

From predicting customer churn in e-commerce to optimizing supply chains in retail, Python’s data analysis libraries are used across industries. Real-world applications include:

  • Finance: Analyzing stock market trends and portfolio performance.

  • Retail: Managing inventory and forecasting demand.

  • Healthcare: Cleaning patient data for predictive modeling.

  • Marketing: Segmenting customers for targeted campaigns.

1.3 What to Expect in This Module

This module dives deep into NumPy and Pandas, covering array operations, DataFrame manipulation, data cleaning, and transformation techniques. We’ll explore real-life examples, provide detailed code snippets, and discuss best practices, exception handling, and the latest Pandas 2.x features to ensure you’re equipped for practical data analysis tasks.


2. NumPy: The Foundation of Numerical Computing

2.1 Understanding NumPy Arrays

NumPy (Numerical Python) is a library for working with multi-dimensional arrays and matrices. Its core object, the ndarray, is a fast, memory-efficient container for numerical data. Arrays are homogeneous, meaning all elements must be of the same data type, enabling optimized computations.

2.2 Creating Arrays: Methods and Best Practices

NumPy offers various methods to create arrays:

  • From Lists: Convert Python lists to arrays.

  • Zeros/Ones: Create arrays filled with zeros or ones.

  • Arange/Linspace: Generate sequences of numbers.

  • Random: Create arrays with random values.

Code Example: Creating Arrays

import numpy as np

# From a list
array_from_list = np.array([1, 2, 3, 4])

# Zeros array
zeros_array = np.zeros((3, 4))

# Arange
range_array = np.arange(0, 10, 2)

# Random array
random_array = np.random.rand(2, 3)

print("List Array:\n", array_from_list)
print("Zeros Array:\n", zeros_array)
print("Range Array:\n", range_array)
print("Random Array:\n", random_array)

Best Practices:

  • Specify dtype (e.g., int32, float64) to optimize memory.

  • Use np.array() for small datasets and np.zeros() or np.ones() for larger, initialized arrays.

  • Avoid nested Python lists for large datasets to prevent performance issues.

2.3 Indexing and Slicing in NumPy

NumPy arrays support advanced indexing and slicing:

  • Basic Indexing: Access elements using indices.

  • Slicing: Extract subarrays using slice notation.

  • Boolean Indexing: Filter arrays based on conditions.

  • Fancy Indexing: Use arrays of indices to select elements.

Code Example: Indexing and Slicing

array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Basic indexing
element = array[1, 2]  # Returns 6

# Slicing
subarray = array[0:2, 1:3]  # Returns [[2, 3], [5, 6]]

# Boolean indexing
mask = array > 5
filtered = array[mask]  # Returns [6, 7, 8, 9]

# Fancy indexing
rows = np.array([0, 2])
cols = np.array([1, 2])
selected = array[rows, cols]  # Returns [2, 9]

print("Element:", element)
print("Subarray:\n", subarray)
print("Filtered:\n", filtered)
print("Fancy Indexing:\n", selected)

Best Practices:

  • Use boolean indexing for conditional filtering to avoid loops.

  • Prefer slicing over manual iteration for performance.

  • Be cautious with fancy indexing, as it creates copies, not views.

2.4 Array Operations: Arithmetic, Logical, and Statistical

NumPy supports vectorized operations, eliminating the need for explicit loops:

  • Arithmetic: Add, subtract, multiply, divide arrays element-wise.

  • Logical: Perform element-wise comparisons.

  • Statistical: Compute mean, median, standard deviation, etc.

Code Example: Array Operations

array1 = np.array([1, 2, 3])
array2 = np.array([4, 5, 6])

# Arithmetic
sum_array = array1 + array2  # [5, 7, 9]
product = array1 * array2    # [4, 10, 18]

# Logical
greater_than = array1 > array2  # [False, False, False]

# Statistical
mean = np.mean(array1)  # 2.0
std_dev = np.std(array2)  # 0.816

print("Sum:", sum_array)
print("Product:", product)
print("Greater Than:", greater_than)
print("Mean:", mean)
print("Standard Deviation:", std_dev)

Best Practices:

  • Use vectorized operations to leverage NumPy’s speed.

  • Avoid Python loops for large arrays to prevent performance bottlenecks.

  • Check array shapes before operations to avoid broadcasting errors.

2.5 Real-Life Example: Analyzing Sales Data with NumPy

Scenario: A retail company wants to analyze monthly sales data across multiple stores to identify trends and calculate performance metrics.

Dataset: A CSV file (sales_data.csv) containing monthly sales for 12 months across 5 stores.

Code Example: Sales Data Analysis

import numpy as np

# Load sales data
sales_data = np.genfromtxt('sales_data.csv', delimiter=',', skip_header=1)

# Calculate total sales per store
total_sales = np.sum(sales_data, axis=0)
print("Total Sales per Store:", total_sales)

# Calculate average monthly sales
avg_sales = np.mean(sales_data, axis=1)
print("Average Monthly Sales:", avg_sales)

# Identify stores with sales above threshold
threshold = 5000
high_performers = np.any(sales_data > threshold, axis=0)
print("High Performing Stores:", high_performers)

# Normalize sales data
normalized_sales = (sales_data - np.min(sales_data)) / (np.max(sales_data) - np.min(sales_data))
print("Normalized Sales:\n", normalized_sales)

Explanation:

  • Loading Data: np.genfromtxt reads the CSV file into a NumPy array.

  • Total Sales: np.sum calculates total sales per store (column-wise).

  • Average Sales: np.mean computes average sales per month (row-wise).

  • Threshold Check: np.any identifies stores exceeding the sales threshold.

  • Normalization: Scales data to a [0, 1] range for comparison.

Best Practices:

  • Use genfromtxt for CSV files with missing values, specifying filling_values if needed.

  • Validate data types after loading to ensure numerical consistency.

  • Save intermediate results for debugging large datasets.

2.6 Exception Handling in NumPy

NumPy operations can raise exceptions like ValueError (shape mismatch) or IndexError (out-of-bounds indexing).

Code Example: Exception Handling

try:
    array = np.array([1, 2, 3])
    invalid_array = np.array([1, 'two', 3])  # Raises ValueError
except ValueError as e:
    print("ValueError:", e)

try:
    array[10]  # Raises IndexError
except IndexError as e:
    print("IndexError:", e)

try:
    array1 = np.array([1, 2])
    array2 = np.array([1, 2, 3])
    result = array1 + array2  # Raises ValueError
except ValueError as e:
    print("Shape Mismatch:", e)

Best Practices:

  • Use specific exception types (ValueError, IndexError) for precise handling.

  • Validate array shapes before operations using array.shape.

  • Log errors for debugging in production environments.

2.7 Pros, Cons, and Alternatives to NumPy

Pros:

  • High performance for numerical computations.

  • Supports multi-dimensional arrays and vectorized operations.

  • Extensive mathematical functions.

Cons:

  • Limited support for non-numerical data.

  • Steep learning curve for advanced indexing.

  • Memory-intensive for very large datasets.

Alternatives:

  • SciPy: Extends NumPy with additional scientific functions.

  • JAX: Offers GPU/TPU acceleration for numerical tasks.

  • CuPy: NumPy-compatible library for GPU computing.


3. Pandas: Mastering DataFrames and Series

3.1 Introduction to Pandas Series and DataFrames

Pandas is a powerful library for data manipulation, built on NumPy. Its primary structures are:

  • Series: A one-dimensional labeled array.

  • DataFrame: A two-dimensional table with rows and columns, similar to a spreadsheet.

Code Example: Creating Series and DataFrames

import pandas as pd

# Create a Series
series = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Cathy'], 'Age': [25, 30, 35], 'Salary': [50000, 60000, 75000]}
df = pd.DataFrame(data)

print("Series:\n", series)
print("DataFrame:\n", df)

Best Practices:

  • Use meaningful index labels for Series and DataFrames.

  • Specify dtype to optimize memory usage.

  • Avoid creating DataFrames from nested lists for large datasets.

3.2 Indexing, Selection, and Filtering in Pandas

Pandas offers flexible ways to access and manipulate data:

  • Indexing: Use .loc (label-based) or .iloc (integer-based).

  • Selection: Select columns or rows using labels or conditions.

  • Filtering: Apply boolean conditions to filter data.

Code Example: Indexing and Filtering

# Select a column
names = df['Name']

# Select rows with .loc
row = df.loc[0]

# Select with .iloc
row_iloc = df.iloc[0]

# Filter rows
high_salary = df[df['Salary'] > 55000]

print("Names:\n", names)
print("Row (loc):\n", row)
print("Row (iloc):\n", row_iloc)
print("High Salary:\n", high_salary)

Best Practices:

  • Use .loc and .iloc to avoid ambiguity with implicit indexing.

  • Chain operations sparingly to prevent performance issues.

  • Use vectorized operations for filtering instead of loops.

3.3 Real-Life Example: Customer Data Analysis with Pandas

Scenario: A marketing team wants to analyze customer data to segment high-value customers based on purchase history.

Dataset: A CSV file (customer_data.csv) with customer ID, name, age, and total purchases.

Code Example: Customer Data Analysis

import pandas as pd

# Load customer data
df = pd.read_csv('customer_data.csv')

# Select relevant columns
customer_info = df[['CustomerID', 'Name', 'TotalPurchases']]

# Filter high-value customers
high_value = df[df['TotalPurchases'] > 1000]

# Group by age group
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 25, 35, 50, 100], labels=['Young', 'Adult', 'Middle-Aged', 'Senior'])
age_summary = df.groupby('AgeGroup')['TotalPurchases'].mean()

print("Customer Info:\n", customer_info.head())
print("High-Value Customers:\n", high_value.head())
print("Average Purchases by Age Group:\n", age_summary)

Explanation:

  • Loading Data: pd.read_csv loads the dataset into a DataFrame.

  • Selection: Selects specific columns for analysis.

  • Filtering: Identifies customers with purchases above $1000.

  • Grouping: Uses pd.cut to bin ages and groupby to calculate average purchases per age group.

Best Practices:

  • Use usecols in read_csv to load only necessary columns.

  • Validate data types after loading with df.dtypes.

  • Save filtered DataFrames for intermediate analysis.

3.4 Exception Handling in Pandas

Pandas operations can raise exceptions like KeyError (invalid column) or TypeError (incompatible operations).

Code Example: Exception Handling

try:
    df['NonExistentColumn']  # Raises KeyError
except KeyError as e:
    print("KeyError:", e)

try:
    df['Age'] + df['Name']  # Raises TypeError
except TypeError as e:
    print("TypeError:", e)

try:
    df.loc[10]  # Raises IndexError
except IndexError as e:
    print("IndexError:", e)

Best Practices:

  • Check column names with df.columns before accessing.

  • Validate data types before operations using df.dtypes.

  • Use try-except blocks for robust data pipelines.

3.5 Pros, Cons, and Alternatives to Pandas

Pros:

  • Intuitive syntax for data manipulation.

  • Handles heterogeneous data (numerical, categorical, text).

  • Integrates with other libraries like Matplotlib and Scikit-learn.

Cons:

  • Memory-intensive for very large datasets.

  • Performance slower than NumPy for numerical tasks.

  • Complex for beginners due to multiple methods for similar tasks.

Alternatives:

  • Dask: Scales Pandas to large datasets with parallel computing.

  • Polars: A faster, memory-efficient alternative to Pandas.

  • Vaex: Optimized for out-of-core data processing.


4. Data Cleaning: Preparing Data for Analysis

4.1 Handling Missing Data

Missing data is common in real-world datasets and can skew analysis. Pandas provides methods like isna(), fillna(), and dropna() to handle missing values.

Code Example: Handling Missing Data

import pandas as pd
import numpy as np

# Create a DataFrame with missing values
data = {'A': [1, np.nan, 3], 'B': [4, 5, np.nan], 'C': ['x', np.nan, 'z']}
df = pd.DataFrame(data)

# Detect missing values
missing = df.isna()

# Fill missing values
df_filled = df.fillna({'A': df['A'].mean(), 'B': 0, 'C': 'unknown'})

# Drop rows with missing values
df_dropped = df.dropna()

print("Missing Values:\n", missing)
print("Filled DataFrame:\n", df_filled)
print("Dropped DataFrame:\n", df_dropped)

Best Practices:

  • Use isna() or isnull() to check for missing values.

  • Choose imputation methods (mean, median, mode) based on data context.

  • Document missing value handling for reproducibility.

4.2 Removing Duplicates

Duplicate rows can distort analysis. Pandas’ duplicated() and drop_duplicates() methods identify and remove them.

Code Example: Removing Duplicates

# Create a DataFrame with duplicates
data = {'Name': ['Alice', 'Bob', 'Alice'], 'Age': [25, 30, 25]}
df = pd.DataFrame(data)

# Identify duplicates
duplicates = df.duplicated()

# Remove duplicates
df_unique = df.drop_duplicates()

print("Duplicates:\n", duplicates)
print("Unique DataFrame:\n", df_unique)

Best Practices:

  • Specify subset in drop_duplicates to check specific columns.

  • Use keep='first' or keep='last' to control which duplicates to retain.

  • Verify duplicates with duplicated() before dropping.

4.3 Detecting and Managing Outliers

Outliers can skew statistical analysis. Common methods include z-score, IQR, or domain-specific thresholds.

Code Example: Outlier Detection

# Create a DataFrame with outliers
data = {'Sales': [100, 150, 1000, 200, 250]}
df = pd.DataFrame(data)

# Detect outliers using IQR
Q1 = df['Sales'].quantile(0.25)
Q3 = df['Sales'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Sales'] < (Q1 - 1.5 * IQR)) | (df['Sales'] > (Q3 + 1.5 * IQR))]

# Remove outliers
df_clean = df[~((df['Sales'] < (Q1 - 1.5 * IQR)) | (df['Sales'] > (Q3 + 1.5 * IQR)))]

print("Outliers:\n", outliers)
print("Cleaned DataFrame:\n", df_clean)

Best Practices:

  • Use IQR for robust outlier detection in skewed distributions.

  • Visualize data (e.g., boxplots) to confirm outliers.

  • Consider domain knowledge before removing outliers.

4.4 Real-Life Example: Cleaning E-Commerce Transaction Data

Scenario: An e-commerce company needs to clean transaction data to analyze customer behavior.

Dataset: A CSV file (transactions.csv) with columns for transaction ID, customer ID, amount, and date.

Code Example: Cleaning Transaction Data

import pandas as pd
import numpy as np

# Load transaction data
df = pd.read_csv('transactions.csv')

# Handle missing values
df['Amount'] = df['Amount'].fillna(df['Amount'].median())
df['Date'] = df['Date'].fillna('Unknown')

# Remove duplicates
df = df.drop_duplicates(subset=['TransactionID'])

# Detect outliers in Amount
Q1 = df['Amount'].quantile(0.25)
Q3 = df['Amount'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['Amount'] < (Q1 - 1.5 * IQR)) | (df['Amount'] > (Q3 + 1.5 * IQR)))]

# Convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

print("Cleaned DataFrame:\n", df.head())

Explanation:

  • Missing Values: Fills missing Amount with median and Date with ‘Unknown’.

  • Duplicates: Removes duplicate transactions based on TransactionID.

  • Outliers: Removes transactions with amounts outside the IQR range.

  • Data Types: Converts Date to datetime for analysis.

Best Practices:

  • Use median for imputation in skewed numerical data.

  • Validate date formats before conversion.

  • Save cleaned data to a new file to preserve the original.

4.5 Best Practices for Data Cleaning

  • Document Changes: Log all cleaning steps for reproducibility.

  • Validate Data: Check data types and ranges after cleaning.

  • Iterative Process: Clean data incrementally, validating at each step.

  • Automate Workflows: Use functions or pipelines for repetitive tasks.

4.6 Exception Handling in Data Cleaning

Code Example: Robust Cleaning

try:
    df = pd.read_csv('transactions.csv')
except FileNotFoundError as e:
    print("FileNotFoundError:", e)

try:
    df['Amount'] = df['Amount'].fillna(df['Amount'].median())
except KeyError as e:
    print("KeyError:", e)

try:
    df['InvalidColumn'] = df['InvalidColumn'].astype(float)  # Raises KeyError
except KeyError as e:
    print("KeyError:", e)

Best Practices:

  • Handle FileNotFoundError for missing files.

  • Check column existence before operations.

  • Use errors='coerce' in to_datetime for invalid formats.


5. Data Transformation: Shaping Data for Insights

5.1 Merging DataFrames

Merging combines DataFrames based on keys, similar to SQL joins:

  • Inner: Keep only matching rows.

  • Left/Right: Keep all rows from one DataFrame.

  • Outer: Keep all rows from both DataFrames.

Code Example: Merging DataFrames

# Create two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Cathy']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Salary': [50000, 60000, 75000]})

# Inner merge
inner_merge = pd.merge(df1, df2, on='ID', how='inner')

# Left merge
left_merge = pd.merge(df1, df2, on='ID', how='left')

print("Inner Merge:\n", inner_merge)
print("Left Merge:\n", left_merge)

Best Practices:

  • Specify on or left_on/right_on for clarity.

  • Use how='inner' for strict matching unless otherwise needed.

  • Validate merge results with row counts.

5.2 Concatenating DataFrames

Concatenation stacks DataFrames vertically or horizontally.

Code Example: Concatenation

# Create two DataFrames
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

# Vertical concatenation
vertical = pd.concat([df1, df2], axis=0)

# Horizontal concatenation
horizontal = pd.concat([df1, df2.reset_index(drop=True)], axis=1)

print("Vertical Concat:\n", vertical)
print("Horizontal Concat:\n", horizontal)

Best Practices:

  • Ensure matching columns for vertical concatenation.

  • Reset indices before horizontal concatenation.

  • Use ignore_index=True for vertical concatenation if indices are irrelevant.

5.3 Pivoting and Melting Data

  • Pivoting: Reshapes data from long to wide format.

  • Melting: Reshapes data from wide to long format.

Code Example: Pivoting and Melting

# Create a DataFrame
data = {'Date': ['2023-01', '2023-01', '2023-02', '2023-02'], 
        'Store': ['A', 'B', 'A', 'B'], 
        'Sales': [100, 150, 200, 250]}
df = pd.DataFrame(data)

# Pivot
pivot_df = df.pivot(index='Date', columns='Store', values='Sales')

# Melt
melted_df = pd.melt(df, id_vars=['Date'], value_vars=['Sales'], var_name='Metric', value_name='Value')

print("Pivot Table:\n", pivot_df)
print("Melted DataFrame:\n", melted_df)

Best Practices:

  • Use pivoting for summary tables and melting for detailed analysis.

  • Handle missing values post-pivot with fillna().

  • Ensure unique index-column pairs in pivoting to avoid errors.

5.4 Real-Life Example: Transforming Retail Inventory Data

Scenario: A retail chain needs to transform inventory data to analyze stock levels across stores and months.

Dataset: A CSV file (inventory.csv) with columns for date, store, product, and stock.

Code Example: Transforming Inventory Data

import pandas as pd

# Load inventory data
df = pd.read_csv('inventory.csv')

# Pivot to get stock levels by store and date
pivot_df = df.pivot(index='Date', columns='Store', values='Stock')

# Merge with product details
product_details = pd.DataFrame({'Product': ['P1', 'P2'], 'Category': ['Electronics', 'Clothing']})
merged_df = pd.merge(df, product_details, on='Product', how='left')

# Melt for detailed analysis
melted_df = pd.melt(pivot_df.reset_index(), id_vars=['Date'], value_vars=pivot_df.columns, 
                    var_name='Store', value_name='Stock')

print("Pivot Table:\n", pivot_df.head())
print("Merged DataFrame:\n", merged_df.head())
print("Melted DataFrame:\n", melted_df.head())

Explanation:

  • Pivoting: Creates a wide-format table showing stock by store and date.

  • Merging: Adds product category information.

  • Melting: Converts the pivoted table back to a long format for detailed analysis.

Best Practices:

  • Validate merge keys to avoid missing data.

  • Use reset_index() before melting pivoted DataFrames.

  • Save transformed data for downstream analysis.

5.5 Best Practices for Data Transformation

  • Plan Transformations: Map out the desired data structure before transforming.

  • Validate Results: Check row counts and data integrity after each operation.

  • Use Descriptive Names: Rename columns post-transformation for clarity.

  • Optimize Performance: Use merge over join for explicit key control.

5.6 Exception Handling in Data Transformation

Code Example: Robust Transformation

try:
    df = pd.merge(df1, df2, on='InvalidKey')  # Raises KeyError
except KeyError as e:
    print("KeyError:", e)

try:
    pivot_df = df.pivot(index='Date', columns='Store', values='Sales')  # Raises ValueError if duplicates exist
except ValueError as e:
    print("ValueError:", e)

try:
    pd.concat([df1, df2], axis=0, verify_integrity=True)  # Raises ValueError for duplicate indices
except ValueError as e:
    print("ValueError:", e)

Best Practices:

  • Check for duplicate keys before pivoting.

  • Use validate in merge to enforce one-to-one or one-to-many relationships.

  • Handle index conflicts in concatenation with ignore_index=True.


6. Latest Pandas 2.x Features

6.1 Performance Improvements with Apache Arrow

Pandas 2.x integrates Apache Arrow for faster data processing and reduced memory usage. Arrow-backed DataFrames improve performance for large datasets.

Code Example: Using Arrow Backend

import pandas as pd

# Enable Arrow backend
df = pd.read_csv('large_dataset.csv', dtype_backend='pyarrow')

# Perform operations
filtered = df[df['Value'] > 100]
grouped = filtered.groupby('Category').sum()

print("Arrow-Backed DataFrame:\n", filtered.head())

Best Practices:

  • Use dtype_backend='pyarrow' for large datasets.

  • Test performance gains on your specific use case.

  • Ensure compatibility with other libraries.

6.2 Enhanced Data Types and Nullable Dtypes

Pandas 2.x introduces nullable data types (e.g., Int64, StringDtype) to handle missing values without converting to float64.

Code Example: Nullable Dtypes

# Create a DataFrame with nullable types
data = {'A': pd.Series([1, None, 3], dtype='Int64'), 'B': pd.Series(['x', 'y', None], dtype='string')}
df = pd.DataFrame(data)

print("Nullable DataFrame:\n", df)
print("Data Types:\n", df.dtypes)

Best Practices:

  • Use nullable dtypes (Int64, StringDtype) for better type safety.

  • Avoid object dtype for strings to reduce memory usage.

  • Check dtype compatibility with downstream tools.

6.3 New Functions and Methods

Pandas 2.x adds methods like explode(), improved merge_ordered(), and enhanced to_datetime().

Code Example: New Features

# Explode a list column
data = {'ID': [1, 2], 'Items': [[1, 2, 3], [4, 5]]}
df = pd.DataFrame(data)
exploded = df.explode('Items')

print("Exploded DataFrame:\n", exploded)

Best Practices:

  • Use explode() for list-like columns to simplify analysis.

  • Leverage merge_ordered() for time-series data.

  • Test new methods on small datasets first.

6.4 Real-Life Example: Using Pandas 2.x for Financial Data Analysis

Scenario: A financial analyst needs to process stock price data to calculate daily returns and handle missing values.

Dataset: A CSV file (stock_prices.csv) with columns for date, ticker, and price.

Code Example: Financial Data Analysis

import pandas as pd

# Load data with Arrow backend
df = pd.read_csv('stock_prices.csv', dtype_backend='pyarrow')

# Handle missing prices with nullable dtype
df['Price'] = df['Price'].astype('float64[pyarrow]')

# Calculate daily returns
df['Return'] = df.groupby('Ticker')['Price'].pct_change()

# Explode a column with list-like data (e.g., tickers with multiple exchanges)
df['Exchanges'] = df['Exchanges'].explode()

print("Processed DataFrame:\n", df.head())

Explanation:

  • Arrow Backend: Improves performance for large datasets.

  • Nullable Dtypes: Ensures accurate handling of missing prices.

  • Explode: Expands list-like columns for analysis.

Best Practices:

  • Use Arrow for large financial datasets to reduce memory usage.

  • Validate returns calculation with domain knowledge.

  • Save results to a database for reporting.

6.5 Best Practices for Leveraging Pandas 2.x

  • Adopt Arrow Backend: For datasets larger than 1GB.

  • Use Nullable Dtypes: For numerical and categorical data.

  • Stay Updated: Monitor Pandas release notes for new features.

  • Test Compatibility: Ensure libraries support Pandas 2.x.


7. Conclusion and Next Steps

7.1 Recap of Module 4

This module covered NumPy’s array operations, Pandas’ DataFrame and Series manipulation, data cleaning techniques (missing values, duplicates, outliers), and data transformation methods (merge, concat, pivot, melt). We explored real-life examples, best practices, and the latest Pandas 2.x features to prepare you for practical data analysis.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam

Post Bottom Ad

Responsive Ads Here