Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Post Top Ad

Responsive Ads Here

Friday, August 29, 2025

Python Pandas Data Wrangling & Transformation Tutorial: Module 2 for Beginners to Advanced

 


Table of Contents

  1. Introduction to Data Wrangling & Transformation

  2. Handling Missing Data (NaN, fillna, dropna)

    • Basics of Missing Data

    • Real-Life Scenarios

    • Methods: fillna, dropna

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  3. Handling Duplicates

    • Identifying and Removing Duplicates

    • Real-Life Use Cases

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  4. Data Type Conversion (astype)

    • Why Data Types Matter

    • Real-Life Applications

    • Using astype

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  5. String Operations and Regex with .str Accessor

    • String Manipulation in Pandas

    • Real-Life Scenarios

    • Using .str and Regex

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  6. Mapping, Replacing, and Applying Functions

    • Transforming Data with Functions

    • Real-Life Use Cases

    • Methods: map, replace, apply

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  7. Binning and Categorization of Data

    • Grouping Data into Bins

    • Real-Life Applications

    • Methods: cut, qcut

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  8. Conditional Filtering and Boolean Indexing

    • Filtering Data with Conditions

    • Real-Life Scenarios

    • Boolean Indexing Techniques

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  9. Working with Datetime Objects and Time-Based Indexing

    • Handling Dates and Times

    • Real-Life Use Cases

    • Datetime Operations

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  10. Resampling, Shifting, and Rolling Windows

    • Time Series Transformations

    • Real-Life Applications

    • Methods: resample, shift, rolling

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  11. Handling Missing Time Series Data

    • Missing Data in Time Series

    • Real-Life Scenarios

    • Interpolation and Other Methods

    • Pros, Cons, and Alternatives

    • Best Practices and Examples

  12. Conclusion and Next Steps


1. Introduction to Data Wrangling & Transformation

Data wrangling and transformation are critical steps in the data analysis pipeline. Data wrangling involves cleaning and preparing raw data for analysis, while data transformation focuses on reshaping, aggregating, or modifying data to extract meaningful insights. In real-world scenarios, datasets are often messy—containing missing values, duplicates, inconsistent types, or poorly formatted strings. Pandas provides a robust toolkit to tackle these challenges efficiently.

In this module, we’ll explore techniques like handling missing data, removing duplicates, converting data types, manipulating strings, and working with time series data. Each section includes real-life examples, such as analyzing e-commerce sales, customer feedback, or sensor data, to make the content relatable and engaging. We’ll also discuss pros, cons, alternatives, and best practices to ensure you adopt industry-standard approaches.


2. Handling Missing Data (NaN, fillna, dropna)

Basics of Missing Data

Missing data is a common issue in datasets, represented in Pandas as NaN (Not a Number) for numeric data or None for object types. Missing values can arise from data collection errors, incomplete records, or merging datasets. Handling them correctly is crucial to avoid biased analyses or errors in downstream processes.

Real-Life Scenarios

  1. E-commerce Sales Data: An online retailer’s dataset may have missing customer ratings due to unrated purchases.

  2. Healthcare Records: Patient records might lack blood pressure readings for some visits.

  3. IoT Sensor Data: Sensors may fail to record temperature at certain intervals, leading to gaps.

Methods: fillna, dropna

Pandas provides two primary methods for handling missing data:

  • fillna: Replaces missing values with a specified value or method (e.g., forward fill, mean).

  • dropna: Removes rows or columns containing missing values.

Example 1: Basic Missing Data Handling

Let’s analyze a dataset of customer orders from an e-commerce platform.

import pandas as pd
import numpy as np

# Sample dataset
data = {
    'order_id': [1, 2, 3, 4, 5],
    'customer_name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
    'order_amount': [100.50, np.nan, 200.75, 150.25, np.nan],
    'rating': [5, 4, np.nan, 3, 2]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Detecting missing values
print("\nMissing values:")
print(df.isna().sum())

# Dropping rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)

# Filling missing values with a specific value
df_filled = df.fillna({'customer_name': 'Unknown', 'order_amount': df['order_amount'].mean(), 'rating': 0})
print("\nDataFrame after filling missing values:")
print(df_filled)

Output:

Original DataFrame:
   order_id customer_name  order_amount  rating
0         1         Alice       100.50     5.0
1         2           Bob          NaN     4.0
2         3       Charlie       200.75     NaN
3         4          None       150.25     3.0
4         5           Eve          NaN     2.0

Missing values:
order_id         0
customer_name    1
order_amount     2
rating           1
dtype: int64

DataFrame after dropping rows with missing values:
   order_id customer_name  order_amount  rating
0         1         Alice       100.50     5.0

DataFrame after filling missing values:
   order_id customer_name  order_amount  rating
0         1         Alice    100.500000     5.0
1         2           Bob    150.166667     4.0
2         3       Charlie    200.750000     0.0
3         4       Unknown    150.250000     3.0
4         5           Eve    150.166667     2.0

Example 2: Advanced Missing Data Handling with Interpolation

For a time series dataset, such as temperature readings from IoT sensors, interpolation can fill gaps more meaningfully.

# Sample time series dataset
dates = pd.date_range('2025-01-01', periods=5, freq='D')
data = {
    'temperature': [22.5, np.nan, 23.0, np.nan, 24.5],
    'humidity': [50, 52, np.nan, 55, 54]
}
df_ts = pd.DataFrame(data, index=dates)
print("Original Time Series DataFrame:")
print(df_ts)

# Linear interpolation for missing values
df_interpolated = df_ts.interpolate(method='linear')
print("\nDataFrame after linear interpolation:")
print(df_interpolated)

Output:

Original Time Series DataFrame:
            temperature  humidity
2025-01-01        22.5      50.0
2025-01-02         NaN      52.0
2025-01-03        23.0       NaN
2025-01-04         NaN      55.0
2025-01-05        24.5      54.0

DataFrame after linear interpolation:
            temperature  humidity
2025-01-01       22.50      50.0
2025-01-02       22.75      52.0
2025-01-03       23.00      53.5
2025-01-04       23.75      55.0
2025-01-05       24.50      54.0

Pros, Cons, and Alternatives

Pros of fillna:

  • Flexible: Supports multiple filling strategies (mean, median, forward fill, etc.).

  • Preserves data: Avoids losing rows, which is critical for small datasets.

Cons of fillna:

  • Risk of bias: Filling with mean or a constant may distort data distributions.

  • Context-specific: Requires domain knowledge to choose appropriate fill values.

Pros of dropna:

  • Simple: Removes problematic rows/columns quickly.

  • Safe: Avoids introducing artificial values.

Cons of dropna:

  • Data loss: Can significantly reduce dataset size, especially with widespread missing values.

  • Not suitable for time series: Dropping rows disrupts temporal continuity.

Alternatives:

  • Imputation Libraries: Use scikit-learn’s SimpleImputer or KNNImputer for advanced imputation.

  • Manual Imputation: Custom logic based on domain knowledge (e.g., filling missing customer names based on order IDs).

  • External Data: Replace missing values using external sources (e.g., customer databases).

Best Practices:

  • Understand the Data: Analyze why data is missing (random, systematic, or structural).

  • Choose Contextually: Use mean/median for numerical data, mode for categorical, or interpolation for time series.

  • Validate Impact: Compare model performance or summary statistics before and after handling missing data.

  • Document Choices: Clearly document imputation methods for reproducibility.

Standards:

  • Follow CRISP-DM (Cross-Industry Standard Process for Data Mining) for data preparation.

  • Adhere to FAIR principles (Findable, Accessible, Interoperable, Reusable) for data handling.


3. Handling Duplicates

Identifying and Removing Duplicates

Duplicates occur when identical rows or values appear in a dataset, often due to data entry errors or merging issues. Pandas provides duplicated() to identify duplicates and drop_duplicates() to remove them.

Real-Life Scenarios

  1. Customer Database: Duplicate customer records due to multiple sign-ups.

  2. Survey Data: Respondents submitting the same response multiple times.

  3. Transaction Logs: Repeated entries from system glitches.

Example: Removing Duplicates in Customer Data

# Sample customer dataset with duplicates
data = {
    'customer_id': [101, 102, 101, 103, 102],
    'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
    'purchase_amount': [50, 75, 50, 100, 75]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Identifying duplicates
print("\nDuplicated rows:")
print(df.duplicated().sum())

# Removing duplicates
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

# Keeping the last occurrence
df_last = df.drop_duplicates(keep='last')
print("\nDataFrame keeping last duplicate:")
print(df_last)

Output:

Original DataFrame:
   customer_id     name  purchase_amount
0          101    Alice              50
1          102      Bob              75
2          101    Alice              50
3          103  Charlie             100
4          102      Bob              75

Duplicated rows:
2

DataFrame after removing duplicates:
   customer_id     name  purchase_amount
0          101    Alice              50
1          102      Bob              75
3          103  Charlie             100

DataFrame keeping last duplicate:
   customer_id     name  purchase_amount
2          101    Alice              50
3          103  Charlie             100
4          102      Bob              75

Pros, Cons, and Alternatives

Pros:

  • Easy to use: drop_duplicates() is straightforward and customizable.

  • Improves data quality: Removes redundant information, reducing bias in analysis.

Cons:

  • Risk of data loss: Incorrectly identifying duplicates can remove valid data.

  • Performance: Can be slow for large datasets with complex duplicate checks.

Alternatives:

  • Custom Deduplication: Use domain-specific logic (e.g., matching on partial strings).

  • Database-Level Deduplication: Use SQL DISTINCT for large datasets.

  • Fuzzy Matching: Libraries like fuzzywuzzy for near-duplicate detection.

Best Practices:

  • Check All Columns: Ensure duplicates are identified across relevant columns.

  • Validate Duplicates: Manually inspect a sample to confirm duplicates are errors.

  • Log Changes: Record which rows were removed for auditability.

Standards:

  • Align with ISO/IEC 25012 for data quality (accuracy, completeness).


4. Data Type Conversion (astype)

Why Data Types Matter

Correct data types ensure efficient storage, accurate computations, and compatibility with downstream processes. For example, treating numerical data as strings can cause errors in calculations.

Real-Life Applications

  1. Financial Analysis: Converting string prices to floats for calculations.

  2. Survey Analysis: Converting categorical responses to category type for memory efficiency.

  3. Log Analysis: Converting timestamps to datetime for time-based operations.

Example: Converting Data Types

# Sample dataset with mixed types
data = {
    'product_id': ['001', '002', '003'],
    'price': ['19.99', '29.99', '39.99'],
    'in_stock': ['True', 'False', 'True']
}
df = pd.DataFrame(data)
print("Original DataFrame with types:")
print(df.dtypes)

# Converting data types
df['product_id'] = df['product_id'].astype(int)
df['price'] = df['price'].astype(float)
df['in_stock'] = df['in_stock'].astype(bool)
print("\nDataFrame after type conversion:")
print(df.dtypes)
print(df)

Output:

Original DataFrame with types:
product_id    object
price         object
in_stock      object
dtype: object

DataFrame after type conversion:
product_id     int64
price        float64
in_stock      bool
dtype: object
   product_id  price  in_stock
0           1  19.99      True
1           2  29.99     False
2           3  39.99      True

Pros, Cons, and Alternatives

Pros:

  • Improves performance: Correct types reduce memory usage (e.g., category vs. object).

  • Enables operations: Numeric types allow mathematical computations.

Cons:

  • Errors on invalid data: astype fails if data cannot be converted (e.g., strings with letters to int).

  • Loss of precision: Converting floats to integers may truncate decimals.

Alternatives:

  • pd.to_numeric: Converts to numeric types with error handling.

  • pd.to_datetime: For datetime conversions.

  • Custom Parsing: Use apply for complex type conversions.

Best Practices:

  • Inspect Data First: Use df.dtypes and df.info() to understand current types.

  • Handle Errors: Use pd.to_numeric(errors='coerce') for robust conversions.

  • Optimize Memory: Use category for categorical data to save memory.

Standards:

  • Follow IEEE 754 for floating-point arithmetic.

  • Use ISO 8601 for datetime formats.


5. String Operations and Regex with .str Accessor

String Manipulation in Pandas

Pandas’ .str accessor allows vectorized string operations, such as extracting substrings, applying regex, or transforming text case.

Real-Life Scenarios

  1. Customer Feedback Analysis: Extracting keywords from reviews.

  2. Log Parsing: Extracting IP addresses or error codes from logs.

  3. E-commerce: Cleaning product names for consistency.

Example: String Operations and Regex

# Sample dataset of customer feedback
data = {
    'review_id': [1, 2, 3],
    'feedback': ['Great product!', 'Not bad, but slow delivery', 'Amazing quality, highly recommend']
}
df = pd.DataFrame(data)

# Basic string operations
df['feedback_lower'] = df['feedback'].str.lower()
df['contains_delivery'] = df['feedback'].str.contains('delivery', case=False, na=False)
df['word_count'] = df['feedback'].str.split().str.len()

# Regex: Extract words starting with 'h'
df['high_words'] = df['feedback'].str.extract(r'\b(h\w+)\b', flags=re.IGNORECASE)
print(df)

Output:

   review_id                           feedback                  feedback_lower  contains_delivery  word_count high_words
0         1                    Great product!                  great product!             False          2        NaN
1         2        Not bad, but slow delivery      not bad, but slow delivery              True          5        NaN
2         3  Amazing quality, highly recommend  amazing quality, highly recommend             False          4     highly

Pros, Cons, and Alternatives

Pros:

  • Efficient: Vectorized operations are faster than loops.

  • Flexible: Supports complex regex patterns for pattern matching.

Cons:

  • Performance: Regex can be slow on large datasets.

  • Complexity: Requires familiarity with regex syntax.

Alternatives:

  • Python’s re Module: For complex regex not supported by .str.

  • Custom Functions: Use apply for bespoke string processing.

  • NLP Libraries: Use nltk or spacy for advanced text analysis.

Best Practices:

  • Use Vectorized Operations: Prefer .str over loops for performance.

  • Test Regex: Validate patterns on small datasets to avoid errors.

  • Handle NaN: Use na=False in .str.contains to handle missing values.

Standards:

  • Follow POSIX for regex syntax where applicable.

  • Use Unicode for text encoding to support multilingual data.


6. Mapping, Replacing, and Applying Functions

Transforming Data with Functions

Pandas allows data transformation using map, replace, and apply to apply custom logic or mappings to data.

Real-Life Use Cases

  1. Customer Segmentation: Mapping customer IDs to segments (e.g., VIP, Regular).

  2. Data Cleaning: Replacing invalid values with standard ones.

  3. Feature Engineering: Applying calculations to create new features.

Example: Mapping and Applying Functions

# Sample dataset
data = {
    'customer_id': [101, 102, 103],
    'status': ['active', 'inactive', 'active'],
    'spend': [500, 200, 1000]
}
df = pd.DataFrame(data)

# Mapping status to numeric values
status_map = {'active': 1, 'inactive': 0}
df['status_numeric'] = df['status'].map(status_map)

# Replacing specific values
df['spend'] = df['spend'].replace(200, 250)

# Applying a custom function
def discount(spend):
    return spend * 0.9 if spend > 500 else spend
df['discounted_spend'] = df['spend'].apply(discount)
print(df)

Output:

   customer_id   status  spend  status_numeric  discounted_spend
0          101   active    500             1             500.0
1          102 inactive    250             0             250.0
2          103   active   1000             1             900.0

Pros, Cons, and Alternatives

Pros:

  • Flexible: apply supports complex logic.

  • Intuitive: map and replace are straightforward for simple transformations.

Cons:

  • Performance: apply can be slow for large datasets.

  • Limited scope: map only works on Series, not DataFrames.

Alternatives:

  • Vectorized Operations: Use NumPy operations for faster computations.

  • Lambda Functions: For simple one-off transformations.

  • Custom Pipelines: Use scikit-learn pipelines for complex transformations.

Best Practices:

  • Prefer Vectorized Operations: Use map or NumPy over apply when possible.

  • Profile Performance: Test apply on large datasets to avoid bottlenecks.

  • Document Logic: Clearly comment custom functions for maintainability.

Standards:

  • Follow PEP 8 for Python code style.

  • Ensure transformations are idempotent (same input, same output).


7. Binning and Categorization of Data

Grouping Data into Bins

Binning involves grouping continuous data into discrete intervals (bins) or categories, useful for analysis and visualization.

Real-Life Applications

  1. Customer Analysis: Grouping customers by age or income brackets.

  2. Sales Analysis: Categorizing sales into low, medium, high.

  3. Sensor Data: Binning sensor readings into ranges for alerts.

Example: Binning Sales Data

# Sample dataset
data = {
    'customer_id': [101, 102, 103, 104],
    'sales': [150, 300, 600, 900]
}
df = pd.DataFrame(data)

# Binning sales into categories
bins = [0, 200, 500, 1000]
labels = ['Low', 'Medium', 'High']
df['sales_category'] = pd.cut(df['sales'], bins=bins, labels=labels)
print(df)

# Quantile-based binning
df['sales_quantile'] = pd.qcut(df['sales'], q=3, labels=['Low', 'Medium', 'High'])
print("\nDataFrame with quantile binning:")
print(df)

Output:

   customer_id  sales sales_category
0          101    150           Low
1          102    300        Medium
2          103    600          High
3          104    900          High

DataFrame with quantile binning:
   customer_id  sales sales_category sales_quantile
0          101    150           Low            Low
1          102    300        Medium         Medium
2          103    600          High           High
3          104    900          High           High

Pros, Cons, and Alternatives

Pros:

  • Simplifies analysis: Reduces complexity of continuous data.

  • Enhances visualization: Bins are easier to plot (e.g., histograms).

Cons:

  • Loss of information: Binning discards precise values.

  • Arbitrary boundaries: Fixed bins may not suit all data distributions.

Alternatives:

  • Clustering: Use scikit-learn for data-driven binning.

  • Custom Binning: Define bins based on domain knowledge.

  • Discretization Libraries: Use KBinsDiscretizer from scikit-learn.

Best Practices:

  • Choose Meaningful Bins: Align bins with business or statistical logic.

  • Test Bin Sensitivity: Analyze how bin size affects results.

  • Use Quantile Binning: For skewed data, qcut ensures balanced bins.

Standards:

  • Follow ISO 80000-2 for mathematical notation in binning.

  • Ensure bins are mutually exclusive and collectively exhaustive.


8. Conditional Filtering and Boolean Indexing

Filtering Data with Conditions

Boolean indexing allows filtering rows based on conditions, using logical operations (&, |, ~).

Real-Life Scenarios

  1. Marketing Campaigns: Filter customers who spent above a threshold.

  2. Quality Control: Identify defective products based on metrics.

  3. Fraud Detection: Flag transactions with suspicious patterns.

Example: Filtering Customer Data

# Sample dataset
data = {
    'customer_id': [101, 102, 103, 104],
    'age': [25, 35, 45, 30],
    'spend': [100, 500, 200, 800]
}
df = pd.DataFrame(data)

# Filtering customers with spend > 300 and age < 40
filtered_df = df[(df['spend'] > 300) & (df['age'] < 40)]
print("Filtered DataFrame:")
print(filtered_df)

Output:

   customer_id  age  spend
1          102   35    500
3          104   30    800

Pros, Cons, and Alternatives

Pros:

  • Intuitive: Uses familiar logical operators.

  • Flexible: Supports complex conditions with multiple criteria.

Cons:

  • Readability: Complex conditions can become hard to read.

  • Performance: Large datasets may require optimized filtering.

Alternatives:

  • query Method: Use df.query() for readable filtering.

  • SQL: Use SQL for filtering in database-backed workflows.

  • NumPy Indexing: For performance-critical applications.

Best Practices:

  • Use Parentheses: Ensure correct precedence in complex conditions.

  • Optimize Filters: Apply filters early to reduce dataset size.

  • Validate Results: Check filtered data for correctness.

Standards:

  • Follow SQL-92 for logical condition syntax inspiration.

  • Ensure filters are deterministic for reproducibility.


9. Working with Datetime Objects and Time-Based Indexing

Handling Dates and Times

Pandas provides robust support for datetime operations, including parsing, formatting, and time-based indexing.

Real-Life Use Cases

  1. Sales Analysis: Analyzing sales trends over time.

  2. IoT Data: Processing timestamped sensor readings.

  3. Log Analysis: Extracting events within a time window.

Example: Datetime Operations

# Sample dataset with dates
data = {
    'order_date': ['2025-01-01', '2025-01-02', '2025-01-03'],
    'sales': [100, 200, 300]
}
df = pd.DataFrame(data)
df['order_date'] = pd.to_datetime(df['order_date'])

# Extracting components
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['day'] = df['order_date'].dt.day
print("DataFrame with extracted components:")
print(df)

# Time-based indexing
df.set_index('order_date', inplace=True)
print("\nSales for January 2025:")
print(df.loc['2025-01'])

Output:

DataFrame with extracted components:
   order_date  sales  year  month  day
0 2025-01-01    100  2025      1    1
1 2025-01-02    200  2025      1    2
2 2025-01-03    300  2025      1    3

Sales for January 2025:
            sales  year  month  day
order_date                         
2025-01-01    100  2025      1    1
2025-01-02    200  2025      1    2
2025-01-03    300  2025      1    3

Pros, Cons, and Alternatives

Pros:

  • Powerful: Supports a wide range of datetime operations.

  • Time-based indexing: Simplifies temporal filtering and grouping.

Cons:

  • Memory usage: Datetime objects can increase memory footprint.

  • Parsing errors: Incorrect formats can cause conversion issues.

Alternatives:

  • datetime Module: For low-level datetime operations.

  • Arrow Library: For faster datetime processing.

  • SQL: Use database-native datetime functions for large datasets.

Best Practices:

  • Standardize Formats: Use ISO 8601 (e.g., YYYY-MM-DD) for consistency.

  • Handle Time Zones: Use tz_localize or tz_convert for time zone-aware data.

  • Validate Parsing: Check for parsing errors with errors='coerce'.

Standards:

  • Adhere to ISO 8601 for datetime formats.

  • Use UTC as the default time zone for consistency.


10. Resampling, Shifting, and Rolling Windows

Time Series Transformations

Pandas supports advanced time series operations like resampling (aggregating over time intervals), shifting (lagging/leading data), and rolling windows (moving calculations).

Real-Life Applications

  1. Financial Analysis: Calculating moving averages for stock prices.

  2. IoT Monitoring: Aggregating sensor data by hour or day.

  3. Sales Forecasting: Shifting sales data to create lagged features.

Example: Time Series Operations

# Sample time series data
dates = pd.date_range('2025-01-01', periods=5, freq='D')
data = {'sales': [100, 150, 200, 250, 300]}
df = pd.DataFrame(data, index=dates)

# Resampling to weekly mean
weekly_sales = df.resample('W').mean()
print("Weekly resampled sales:")
print(weekly_sales)

# Shifting sales by one day
df['lagged_sales'] = df['sales'].shift(1)
print("\nDataFrame with lagged sales:")
print(df)

# Rolling window (3-day moving average)
df['moving_avg'] = df['sales'].rolling(window=3).mean()
print("\nDataFrame with 3-day moving average:")
print(df)

Output:

Weekly resampled sales:
            sales
2025-01-05  200.0

DataFrame with lagged sales:
            sales  lagged_sales
2025-01-01    100          NaN
2025-01-02    150        100.0
2025-01-03    200        150.0
2025-01-04    250        200.0
2025-01-05    300        250.0

DataFrame with 3-day moving average:
            sales  lagged_sales  moving_avg
2025-01-01    100          NaN         NaN
2025-01-02    150        100.0         NaN
2025-01-03    200        150.0  150.000000
2025-01-04    250        200.0  200.000000
2025-01-05    300        250.0  250.000000

Pros, Cons, and Alternatives

Pros:

  • Versatile: Supports various aggregations (mean, sum, etc.).

  • Intuitive: Simplifies time series analysis.

Cons:

  • Performance: Rolling operations can be slow for large datasets.

  • Edge effects: Missing values at the start/end of rolling windows.

Alternatives:

  • NumPy: For faster rolling calculations.

  • Statsmodels: For advanced time series modeling.

  • SQL: Use window functions for database-backed time series.

Best Practices:

  • Choose Appropriate Windows: Align window size with analysis goals.

  • Handle Edges: Use min_periods in rolling to control missing values.

  • Validate Aggregations: Ensure resampling aligns with data frequency.

Standards:

  • Follow ISO 80000-3 for time-related calculations.

  • Ensure time series operations are time zone-aware.


11. Handling Missing Time Series Data

Missing Data in Time Series

Missing data in time series can disrupt analyses, especially when continuity is critical. Interpolation and forward/backward filling are common solutions.

Real-Life Scenarios

  1. Stock Prices: Missing prices due to market closures.

  2. Sensor Data: Gaps from sensor failures.

  3. Website Traffic: Missing data during server downtimes.

Example: Interpolating Missing Time Series Data

# Sample time series with missing data
dates = pd.date_range('2025-01-01', periods=5, freq='D')
data = {'traffic': [1000, np.nan, 1200, np.nan, 1500]}
df = pd.DataFrame(data, index=dates)

# Forward fill
df['ffill'] = df['traffic'].fillna(method='ffill')
# Backward fill
df['bfill'] = df['traffic'].fillna(method='bfill')
# Linear interpolation
df['interpolated'] = df['traffic'].interpolate(method='linear')
print(df)

Output:

            traffic  ffill  bfill  interpolated
2025-01-01  1000.0  1000.0  1000.0   1000.000000
2025-01-02     NaN  1000.0  1200.0   1100.000000
2025-01-03  1200.0  1200.0  1200.0   1200.000000
2025-01-04     NaN  1200.0  1500.0   1350.000000
2025-01-05  1500.0  1500.0  1500.0   1500.000000

Pros, Cons, and Alternatives

Pros:

  • Maintains continuity: Interpolation preserves trends in time series.

  • Flexible: Multiple methods (linear, polynomial, etc.) available.

Cons:

  • Bias risk: Interpolation assumes smooth trends, which may not always hold.

  • Limited applicability: Not suitable for large gaps or non-continuous data.

Alternatives:

  • Time Series Models: Use ARIMA or Prophet for predictive imputation.

  • External Data: Fill gaps using related datasets (e.g., weather data for sensors).

  • Manual Imputation: Domain-specific rules for filling gaps.

Best Practices:

  • Use Interpolation for Small Gaps: Linear or spline interpolation works well for short gaps.

  • Validate Trends: Check if imputed values align with data patterns.

  • Limit Extrapolation: Avoid interpolating beyond data boundaries.

Standards:

  • Follow ISO 8601 for time series consistency.

  • Ensure imputation methods are statistically valid.


12. Conclusion and Next Steps

This comprehensive tutorial has covered the essential techniques of data wrangling and transformation in Pandas, from handling missing data to advanced time series operations. By mastering these skills, you can clean, transform, and analyze real-world datasets with confidence. The examples provided—ranging from e-commerce sales to IoT sensor data—demonstrate how these techniques apply to practical scenarios.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam

Post Bottom Ad

Responsive Ads Here