Table of Contents
Introduction to Data Wrangling & Transformation
Handling Missing Data (NaN, fillna, dropna)
Basics of Missing Data
Real-Life Scenarios
Methods: fillna, dropna
Pros, Cons, and Alternatives
Best Practices and Examples
Handling Duplicates
Identifying and Removing Duplicates
Real-Life Use Cases
Pros, Cons, and Alternatives
Best Practices and Examples
Data Type Conversion (astype)
Why Data Types Matter
Real-Life Applications
Using astype
Pros, Cons, and Alternatives
Best Practices and Examples
String Operations and Regex with .str Accessor
String Manipulation in Pandas
Real-Life Scenarios
Using .str and Regex
Pros, Cons, and Alternatives
Best Practices and Examples
Mapping, Replacing, and Applying Functions
Transforming Data with Functions
Real-Life Use Cases
Methods: map, replace, apply
Pros, Cons, and Alternatives
Best Practices and Examples
Binning and Categorization of Data
Grouping Data into Bins
Real-Life Applications
Methods: cut, qcut
Pros, Cons, and Alternatives
Best Practices and Examples
Conditional Filtering and Boolean Indexing
Filtering Data with Conditions
Real-Life Scenarios
Boolean Indexing Techniques
Pros, Cons, and Alternatives
Best Practices and Examples
Working with Datetime Objects and Time-Based Indexing
Handling Dates and Times
Real-Life Use Cases
Datetime Operations
Pros, Cons, and Alternatives
Best Practices and Examples
Resampling, Shifting, and Rolling Windows
Time Series Transformations
Real-Life Applications
Methods: resample, shift, rolling
Pros, Cons, and Alternatives
Best Practices and Examples
Handling Missing Time Series Data
Missing Data in Time Series
Real-Life Scenarios
Interpolation and Other Methods
Pros, Cons, and Alternatives
Best Practices and Examples
Conclusion and Next Steps
1. Introduction to Data Wrangling & Transformation
Data wrangling and transformation are critical steps in the data analysis pipeline. Data wrangling involves cleaning and preparing raw data for analysis, while data transformation focuses on reshaping, aggregating, or modifying data to extract meaningful insights. In real-world scenarios, datasets are often messy—containing missing values, duplicates, inconsistent types, or poorly formatted strings. Pandas provides a robust toolkit to tackle these challenges efficiently.
In this module, we’ll explore techniques like handling missing data, removing duplicates, converting data types, manipulating strings, and working with time series data. Each section includes real-life examples, such as analyzing e-commerce sales, customer feedback, or sensor data, to make the content relatable and engaging. We’ll also discuss pros, cons, alternatives, and best practices to ensure you adopt industry-standard approaches.
2. Handling Missing Data (NaN, fillna, dropna)
Basics of Missing Data
Missing data is a common issue in datasets, represented in Pandas as NaN (Not a Number) for numeric data or None for object types. Missing values can arise from data collection errors, incomplete records, or merging datasets. Handling them correctly is crucial to avoid biased analyses or errors in downstream processes.
Real-Life Scenarios
E-commerce Sales Data: An online retailer’s dataset may have missing customer ratings due to unrated purchases.
Healthcare Records: Patient records might lack blood pressure readings for some visits.
IoT Sensor Data: Sensors may fail to record temperature at certain intervals, leading to gaps.
Methods: fillna, dropna
Pandas provides two primary methods for handling missing data:
fillna: Replaces missing values with a specified value or method (e.g., forward fill, mean).
dropna: Removes rows or columns containing missing values.
Example 1: Basic Missing Data Handling
Let’s analyze a dataset of customer orders from an e-commerce platform.
import pandas as pd
import numpy as np
# Sample dataset
data = {
'order_id': [1, 2, 3, 4, 5],
'customer_name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
'order_amount': [100.50, np.nan, 200.75, 150.25, np.nan],
'rating': [5, 4, np.nan, 3, 2]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Detecting missing values
print("\nMissing values:")
print(df.isna().sum())
# Dropping rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
# Filling missing values with a specific value
df_filled = df.fillna({'customer_name': 'Unknown', 'order_amount': df['order_amount'].mean(), 'rating': 0})
print("\nDataFrame after filling missing values:")
print(df_filled)
Output:
Original DataFrame:
order_id customer_name order_amount rating
0 1 Alice 100.50 5.0
1 2 Bob NaN 4.0
2 3 Charlie 200.75 NaN
3 4 None 150.25 3.0
4 5 Eve NaN 2.0
Missing values:
order_id 0
customer_name 1
order_amount 2
rating 1
dtype: int64
DataFrame after dropping rows with missing values:
order_id customer_name order_amount rating
0 1 Alice 100.50 5.0
DataFrame after filling missing values:
order_id customer_name order_amount rating
0 1 Alice 100.500000 5.0
1 2 Bob 150.166667 4.0
2 3 Charlie 200.750000 0.0
3 4 Unknown 150.250000 3.0
4 5 Eve 150.166667 2.0
Example 2: Advanced Missing Data Handling with Interpolation
For a time series dataset, such as temperature readings from IoT sensors, interpolation can fill gaps more meaningfully.
# Sample time series dataset
dates = pd.date_range('2025-01-01', periods=5, freq='D')
data = {
'temperature': [22.5, np.nan, 23.0, np.nan, 24.5],
'humidity': [50, 52, np.nan, 55, 54]
}
df_ts = pd.DataFrame(data, index=dates)
print("Original Time Series DataFrame:")
print(df_ts)
# Linear interpolation for missing values
df_interpolated = df_ts.interpolate(method='linear')
print("\nDataFrame after linear interpolation:")
print(df_interpolated)
Output:
Original Time Series DataFrame:
temperature humidity
2025-01-01 22.5 50.0
2025-01-02 NaN 52.0
2025-01-03 23.0 NaN
2025-01-04 NaN 55.0
2025-01-05 24.5 54.0
DataFrame after linear interpolation:
temperature humidity
2025-01-01 22.50 50.0
2025-01-02 22.75 52.0
2025-01-03 23.00 53.5
2025-01-04 23.75 55.0
2025-01-05 24.50 54.0
Pros, Cons, and Alternatives
Pros of fillna:
Flexible: Supports multiple filling strategies (mean, median, forward fill, etc.).
Preserves data: Avoids losing rows, which is critical for small datasets.
Cons of fillna:
Risk of bias: Filling with mean or a constant may distort data distributions.
Context-specific: Requires domain knowledge to choose appropriate fill values.
Pros of dropna:
Simple: Removes problematic rows/columns quickly.
Safe: Avoids introducing artificial values.
Cons of dropna:
Data loss: Can significantly reduce dataset size, especially with widespread missing values.
Not suitable for time series: Dropping rows disrupts temporal continuity.
Alternatives:
Imputation Libraries: Use scikit-learn’s SimpleImputer or KNNImputer for advanced imputation.
Manual Imputation: Custom logic based on domain knowledge (e.g., filling missing customer names based on order IDs).
External Data: Replace missing values using external sources (e.g., customer databases).
Best Practices:
Understand the Data: Analyze why data is missing (random, systematic, or structural).
Choose Contextually: Use mean/median for numerical data, mode for categorical, or interpolation for time series.
Validate Impact: Compare model performance or summary statistics before and after handling missing data.
Document Choices: Clearly document imputation methods for reproducibility.
Standards:
Follow CRISP-DM (Cross-Industry Standard Process for Data Mining) for data preparation.
Adhere to FAIR principles (Findable, Accessible, Interoperable, Reusable) for data handling.
3. Handling Duplicates
Identifying and Removing Duplicates
Duplicates occur when identical rows or values appear in a dataset, often due to data entry errors or merging issues. Pandas provides duplicated() to identify duplicates and drop_duplicates() to remove them.
Real-Life Scenarios
Customer Database: Duplicate customer records due to multiple sign-ups.
Survey Data: Respondents submitting the same response multiple times.
Transaction Logs: Repeated entries from system glitches.
Example: Removing Duplicates in Customer Data
# Sample customer dataset with duplicates
data = {
'customer_id': [101, 102, 101, 103, 102],
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'purchase_amount': [50, 75, 50, 100, 75]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Identifying duplicates
print("\nDuplicated rows:")
print(df.duplicated().sum())
# Removing duplicates
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)
# Keeping the last occurrence
df_last = df.drop_duplicates(keep='last')
print("\nDataFrame keeping last duplicate:")
print(df_last)
Output:
Original DataFrame:
customer_id name purchase_amount
0 101 Alice 50
1 102 Bob 75
2 101 Alice 50
3 103 Charlie 100
4 102 Bob 75
Duplicated rows:
2
DataFrame after removing duplicates:
customer_id name purchase_amount
0 101 Alice 50
1 102 Bob 75
3 103 Charlie 100
DataFrame keeping last duplicate:
customer_id name purchase_amount
2 101 Alice 50
3 103 Charlie 100
4 102 Bob 75
Pros, Cons, and Alternatives
Pros:
Easy to use: drop_duplicates() is straightforward and customizable.
Improves data quality: Removes redundant information, reducing bias in analysis.
Cons:
Risk of data loss: Incorrectly identifying duplicates can remove valid data.
Performance: Can be slow for large datasets with complex duplicate checks.
Alternatives:
Custom Deduplication: Use domain-specific logic (e.g., matching on partial strings).
Database-Level Deduplication: Use SQL DISTINCT for large datasets.
Fuzzy Matching: Libraries like fuzzywuzzy for near-duplicate detection.
Best Practices:
Check All Columns: Ensure duplicates are identified across relevant columns.
Validate Duplicates: Manually inspect a sample to confirm duplicates are errors.
Log Changes: Record which rows were removed for auditability.
Standards:
Align with ISO/IEC 25012 for data quality (accuracy, completeness).
4. Data Type Conversion (astype)
Why Data Types Matter
Correct data types ensure efficient storage, accurate computations, and compatibility with downstream processes. For example, treating numerical data as strings can cause errors in calculations.
Real-Life Applications
Financial Analysis: Converting string prices to floats for calculations.
Survey Analysis: Converting categorical responses to category type for memory efficiency.
Log Analysis: Converting timestamps to datetime for time-based operations.
Example: Converting Data Types
# Sample dataset with mixed types
data = {
'product_id': ['001', '002', '003'],
'price': ['19.99', '29.99', '39.99'],
'in_stock': ['True', 'False', 'True']
}
df = pd.DataFrame(data)
print("Original DataFrame with types:")
print(df.dtypes)
# Converting data types
df['product_id'] = df['product_id'].astype(int)
df['price'] = df['price'].astype(float)
df['in_stock'] = df['in_stock'].astype(bool)
print("\nDataFrame after type conversion:")
print(df.dtypes)
print(df)
Output:
Original DataFrame with types:
product_id object
price object
in_stock object
dtype: object
DataFrame after type conversion:
product_id int64
price float64
in_stock bool
dtype: object
product_id price in_stock
0 1 19.99 True
1 2 29.99 False
2 3 39.99 True
Pros, Cons, and Alternatives
Pros:
Improves performance: Correct types reduce memory usage (e.g., category vs. object).
Enables operations: Numeric types allow mathematical computations.
Cons:
Errors on invalid data: astype fails if data cannot be converted (e.g., strings with letters to int).
Loss of precision: Converting floats to integers may truncate decimals.
Alternatives:
pd.to_numeric: Converts to numeric types with error handling.
pd.to_datetime: For datetime conversions.
Custom Parsing: Use apply for complex type conversions.
Best Practices:
Inspect Data First: Use df.dtypes and df.info() to understand current types.
Handle Errors: Use pd.to_numeric(errors='coerce') for robust conversions.
Optimize Memory: Use category for categorical data to save memory.
Standards:
Follow IEEE 754 for floating-point arithmetic.
Use ISO 8601 for datetime formats.
5. String Operations and Regex with .str Accessor
String Manipulation in Pandas
Pandas’ .str accessor allows vectorized string operations, such as extracting substrings, applying regex, or transforming text case.
Real-Life Scenarios
Customer Feedback Analysis: Extracting keywords from reviews.
Log Parsing: Extracting IP addresses or error codes from logs.
E-commerce: Cleaning product names for consistency.
Example: String Operations and Regex
# Sample dataset of customer feedback
data = {
'review_id': [1, 2, 3],
'feedback': ['Great product!', 'Not bad, but slow delivery', 'Amazing quality, highly recommend']
}
df = pd.DataFrame(data)
# Basic string operations
df['feedback_lower'] = df['feedback'].str.lower()
df['contains_delivery'] = df['feedback'].str.contains('delivery', case=False, na=False)
df['word_count'] = df['feedback'].str.split().str.len()
# Regex: Extract words starting with 'h'
df['high_words'] = df['feedback'].str.extract(r'\b(h\w+)\b', flags=re.IGNORECASE)
print(df)
Output:
review_id feedback feedback_lower contains_delivery word_count high_words
0 1 Great product! great product! False 2 NaN
1 2 Not bad, but slow delivery not bad, but slow delivery True 5 NaN
2 3 Amazing quality, highly recommend amazing quality, highly recommend False 4 highly
Pros, Cons, and Alternatives
Pros:
Efficient: Vectorized operations are faster than loops.
Flexible: Supports complex regex patterns for pattern matching.
Cons:
Performance: Regex can be slow on large datasets.
Complexity: Requires familiarity with regex syntax.
Alternatives:
Python’s re Module: For complex regex not supported by .str.
Custom Functions: Use apply for bespoke string processing.
NLP Libraries: Use nltk or spacy for advanced text analysis.
Best Practices:
Use Vectorized Operations: Prefer .str over loops for performance.
Test Regex: Validate patterns on small datasets to avoid errors.
Handle NaN: Use na=False in .str.contains to handle missing values.
Standards:
Follow POSIX for regex syntax where applicable.
Use Unicode for text encoding to support multilingual data.
6. Mapping, Replacing, and Applying Functions
Transforming Data with Functions
Pandas allows data transformation using map, replace, and apply to apply custom logic or mappings to data.
Real-Life Use Cases
Customer Segmentation: Mapping customer IDs to segments (e.g., VIP, Regular).
Data Cleaning: Replacing invalid values with standard ones.
Feature Engineering: Applying calculations to create new features.
Example: Mapping and Applying Functions
# Sample dataset
data = {
'customer_id': [101, 102, 103],
'status': ['active', 'inactive', 'active'],
'spend': [500, 200, 1000]
}
df = pd.DataFrame(data)
# Mapping status to numeric values
status_map = {'active': 1, 'inactive': 0}
df['status_numeric'] = df['status'].map(status_map)
# Replacing specific values
df['spend'] = df['spend'].replace(200, 250)
# Applying a custom function
def discount(spend):
return spend * 0.9 if spend > 500 else spend
df['discounted_spend'] = df['spend'].apply(discount)
print(df)
Output:
customer_id status spend status_numeric discounted_spend
0 101 active 500 1 500.0
1 102 inactive 250 0 250.0
2 103 active 1000 1 900.0
Pros, Cons, and Alternatives
Pros:
Flexible: apply supports complex logic.
Intuitive: map and replace are straightforward for simple transformations.
Cons:
Performance: apply can be slow for large datasets.
Limited scope: map only works on Series, not DataFrames.
Alternatives:
Vectorized Operations: Use NumPy operations for faster computations.
Lambda Functions: For simple one-off transformations.
Custom Pipelines: Use scikit-learn pipelines for complex transformations.
Best Practices:
Prefer Vectorized Operations: Use map or NumPy over apply when possible.
Profile Performance: Test apply on large datasets to avoid bottlenecks.
Document Logic: Clearly comment custom functions for maintainability.
Standards:
Follow PEP 8 for Python code style.
Ensure transformations are idempotent (same input, same output).
7. Binning and Categorization of Data
Grouping Data into Bins
Binning involves grouping continuous data into discrete intervals (bins) or categories, useful for analysis and visualization.
Real-Life Applications
Customer Analysis: Grouping customers by age or income brackets.
Sales Analysis: Categorizing sales into low, medium, high.
Sensor Data: Binning sensor readings into ranges for alerts.
Example: Binning Sales Data
# Sample dataset
data = {
'customer_id': [101, 102, 103, 104],
'sales': [150, 300, 600, 900]
}
df = pd.DataFrame(data)
# Binning sales into categories
bins = [0, 200, 500, 1000]
labels = ['Low', 'Medium', 'High']
df['sales_category'] = pd.cut(df['sales'], bins=bins, labels=labels)
print(df)
# Quantile-based binning
df['sales_quantile'] = pd.qcut(df['sales'], q=3, labels=['Low', 'Medium', 'High'])
print("\nDataFrame with quantile binning:")
print(df)
Output:
customer_id sales sales_category
0 101 150 Low
1 102 300 Medium
2 103 600 High
3 104 900 High
DataFrame with quantile binning:
customer_id sales sales_category sales_quantile
0 101 150 Low Low
1 102 300 Medium Medium
2 103 600 High High
3 104 900 High High
Pros, Cons, and Alternatives
Pros:
Simplifies analysis: Reduces complexity of continuous data.
Enhances visualization: Bins are easier to plot (e.g., histograms).
Cons:
Loss of information: Binning discards precise values.
Arbitrary boundaries: Fixed bins may not suit all data distributions.
Alternatives:
Clustering: Use scikit-learn for data-driven binning.
Custom Binning: Define bins based on domain knowledge.
Discretization Libraries: Use KBinsDiscretizer from scikit-learn.
Best Practices:
Choose Meaningful Bins: Align bins with business or statistical logic.
Test Bin Sensitivity: Analyze how bin size affects results.
Use Quantile Binning: For skewed data, qcut ensures balanced bins.
Standards:
Follow ISO 80000-2 for mathematical notation in binning.
Ensure bins are mutually exclusive and collectively exhaustive.
8. Conditional Filtering and Boolean Indexing
Filtering Data with Conditions
Boolean indexing allows filtering rows based on conditions, using logical operations (&, |, ~).
Real-Life Scenarios
Marketing Campaigns: Filter customers who spent above a threshold.
Quality Control: Identify defective products based on metrics.
Fraud Detection: Flag transactions with suspicious patterns.
Example: Filtering Customer Data
# Sample dataset
data = {
'customer_id': [101, 102, 103, 104],
'age': [25, 35, 45, 30],
'spend': [100, 500, 200, 800]
}
df = pd.DataFrame(data)
# Filtering customers with spend > 300 and age < 40
filtered_df = df[(df['spend'] > 300) & (df['age'] < 40)]
print("Filtered DataFrame:")
print(filtered_df)
Output:
customer_id age spend
1 102 35 500
3 104 30 800
Pros, Cons, and Alternatives
Pros:
Intuitive: Uses familiar logical operators.
Flexible: Supports complex conditions with multiple criteria.
Cons:
Readability: Complex conditions can become hard to read.
Performance: Large datasets may require optimized filtering.
Alternatives:
query Method: Use df.query() for readable filtering.
SQL: Use SQL for filtering in database-backed workflows.
NumPy Indexing: For performance-critical applications.
Best Practices:
Use Parentheses: Ensure correct precedence in complex conditions.
Optimize Filters: Apply filters early to reduce dataset size.
Validate Results: Check filtered data for correctness.
Standards:
Follow SQL-92 for logical condition syntax inspiration.
Ensure filters are deterministic for reproducibility.
9. Working with Datetime Objects and Time-Based Indexing
Handling Dates and Times
Pandas provides robust support for datetime operations, including parsing, formatting, and time-based indexing.
Real-Life Use Cases
Sales Analysis: Analyzing sales trends over time.
IoT Data: Processing timestamped sensor readings.
Log Analysis: Extracting events within a time window.
Example: Datetime Operations
# Sample dataset with dates
data = {
'order_date': ['2025-01-01', '2025-01-02', '2025-01-03'],
'sales': [100, 200, 300]
}
df = pd.DataFrame(data)
df['order_date'] = pd.to_datetime(df['order_date'])
# Extracting components
df['year'] = df['order_date'].dt.year
df['month'] = df['order_date'].dt.month
df['day'] = df['order_date'].dt.day
print("DataFrame with extracted components:")
print(df)
# Time-based indexing
df.set_index('order_date', inplace=True)
print("\nSales for January 2025:")
print(df.loc['2025-01'])
Output:
DataFrame with extracted components:
order_date sales year month day
0 2025-01-01 100 2025 1 1
1 2025-01-02 200 2025 1 2
2 2025-01-03 300 2025 1 3
Sales for January 2025:
sales year month day
order_date
2025-01-01 100 2025 1 1
2025-01-02 200 2025 1 2
2025-01-03 300 2025 1 3
Pros, Cons, and Alternatives
Pros:
Powerful: Supports a wide range of datetime operations.
Time-based indexing: Simplifies temporal filtering and grouping.
Cons:
Memory usage: Datetime objects can increase memory footprint.
Parsing errors: Incorrect formats can cause conversion issues.
Alternatives:
datetime Module: For low-level datetime operations.
Arrow Library: For faster datetime processing.
SQL: Use database-native datetime functions for large datasets.
Best Practices:
Standardize Formats: Use ISO 8601 (e.g., YYYY-MM-DD) for consistency.
Handle Time Zones: Use tz_localize or tz_convert for time zone-aware data.
Validate Parsing: Check for parsing errors with errors='coerce'.
Standards:
Adhere to ISO 8601 for datetime formats.
Use UTC as the default time zone for consistency.
10. Resampling, Shifting, and Rolling Windows
Time Series Transformations
Pandas supports advanced time series operations like resampling (aggregating over time intervals), shifting (lagging/leading data), and rolling windows (moving calculations).
Real-Life Applications
Financial Analysis: Calculating moving averages for stock prices.
IoT Monitoring: Aggregating sensor data by hour or day.
Sales Forecasting: Shifting sales data to create lagged features.
Example: Time Series Operations
# Sample time series data
dates = pd.date_range('2025-01-01', periods=5, freq='D')
data = {'sales': [100, 150, 200, 250, 300]}
df = pd.DataFrame(data, index=dates)
# Resampling to weekly mean
weekly_sales = df.resample('W').mean()
print("Weekly resampled sales:")
print(weekly_sales)
# Shifting sales by one day
df['lagged_sales'] = df['sales'].shift(1)
print("\nDataFrame with lagged sales:")
print(df)
# Rolling window (3-day moving average)
df['moving_avg'] = df['sales'].rolling(window=3).mean()
print("\nDataFrame with 3-day moving average:")
print(df)
Output:
Weekly resampled sales:
sales
2025-01-05 200.0
DataFrame with lagged sales:
sales lagged_sales
2025-01-01 100 NaN
2025-01-02 150 100.0
2025-01-03 200 150.0
2025-01-04 250 200.0
2025-01-05 300 250.0
DataFrame with 3-day moving average:
sales lagged_sales moving_avg
2025-01-01 100 NaN NaN
2025-01-02 150 100.0 NaN
2025-01-03 200 150.0 150.000000
2025-01-04 250 200.0 200.000000
2025-01-05 300 250.0 250.000000
Pros, Cons, and Alternatives
Pros:
Versatile: Supports various aggregations (mean, sum, etc.).
Intuitive: Simplifies time series analysis.
Cons:
Performance: Rolling operations can be slow for large datasets.
Edge effects: Missing values at the start/end of rolling windows.
Alternatives:
NumPy: For faster rolling calculations.
Statsmodels: For advanced time series modeling.
SQL: Use window functions for database-backed time series.
Best Practices:
Choose Appropriate Windows: Align window size with analysis goals.
Handle Edges: Use min_periods in rolling to control missing values.
Validate Aggregations: Ensure resampling aligns with data frequency.
Standards:
Follow ISO 80000-3 for time-related calculations.
Ensure time series operations are time zone-aware.
11. Handling Missing Time Series Data
Missing Data in Time Series
Missing data in time series can disrupt analyses, especially when continuity is critical. Interpolation and forward/backward filling are common solutions.
Real-Life Scenarios
Stock Prices: Missing prices due to market closures.
Sensor Data: Gaps from sensor failures.
Website Traffic: Missing data during server downtimes.
Example: Interpolating Missing Time Series Data
# Sample time series with missing data
dates = pd.date_range('2025-01-01', periods=5, freq='D')
data = {'traffic': [1000, np.nan, 1200, np.nan, 1500]}
df = pd.DataFrame(data, index=dates)
# Forward fill
df['ffill'] = df['traffic'].fillna(method='ffill')
# Backward fill
df['bfill'] = df['traffic'].fillna(method='bfill')
# Linear interpolation
df['interpolated'] = df['traffic'].interpolate(method='linear')
print(df)
Output:
traffic ffill bfill interpolated
2025-01-01 1000.0 1000.0 1000.0 1000.000000
2025-01-02 NaN 1000.0 1200.0 1100.000000
2025-01-03 1200.0 1200.0 1200.0 1200.000000
2025-01-04 NaN 1200.0 1500.0 1350.000000
2025-01-05 1500.0 1500.0 1500.0 1500.000000
Pros, Cons, and Alternatives
Pros:
Maintains continuity: Interpolation preserves trends in time series.
Flexible: Multiple methods (linear, polynomial, etc.) available.
Cons:
Bias risk: Interpolation assumes smooth trends, which may not always hold.
Limited applicability: Not suitable for large gaps or non-continuous data.
Alternatives:
Time Series Models: Use ARIMA or Prophet for predictive imputation.
External Data: Fill gaps using related datasets (e.g., weather data for sensors).
Manual Imputation: Domain-specific rules for filling gaps.
Best Practices:
Use Interpolation for Small Gaps: Linear or spline interpolation works well for short gaps.
Validate Trends: Check if imputed values align with data patterns.
Limit Extrapolation: Avoid interpolating beyond data boundaries.
Standards:
Follow ISO 8601 for time series consistency.
Ensure imputation methods are statistically valid.
12. Conclusion and Next Steps
This comprehensive tutorial has covered the essential techniques of data wrangling and transformation in Pandas, from handling missing data to advanced time series operations. By mastering these skills, you can clean, transform, and analyze real-world datasets with confidence. The examples provided—ranging from e-commerce sales to IoT sensor data—demonstrate how these techniques apply to practical scenarios.
No comments:
Post a Comment
Thanks for your valuable comment...........
Md. Mominul Islam