Skip to content
📖 Welcome to my knowledge base! Notes on AI/ML, Maths, CS, MBA, Trading, Economics, Health & Self-Help — all in one place.! 🎉 Discover what’s new

Pandas

Pandas is a powerful, open-source data manipulation and analysis library for Python. It provides high-performance, easy-to-use data structures and tools for working with structured (tabular, multidimensional, heterogeneous) and time-series data.

This tutorial covers everything from installation to advanced features, with practical examples.


Installation and Setup

Installing Pandas

pip install pandas

Or with conda:

conda install pandas

Importing Pandas

import pandas as pd
import numpy as np  # Often used alongside pandas

Core Data Structures

Pandas has two primary data structures:

Series

A one-dimensional labeled array capable of holding any data type.

# Create a Series from a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)
# 0    1
# 1    3
# 2    5
# 3    7
# 4    9
# dtype: int64

# With custom index
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s['c'])  # 5

DataFrame

A two-dimensional labeled data structure with columns of potentially different types.

# Create a DataFrame from a dictionary
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['NYC', 'LA', 'Chicago', 'Miami']
})
print(df)

Creating DataFrames

From Dictionary

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['x', 'y', 'z']
})

From List of Dictionaries

data = [
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 30}
]
df = pd.DataFrame(data)

From CSV

df = pd.read_csv('filename.csv')

From Excel

df = pd.read_excel('filename.xlsx', sheet_name='Sheet1')

From NumPy Array

arr = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])

Reading and Writing Data

Reading Data

# CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', index_col=0)  # Set first column as index
df = pd.read_csv('data.csv', encoding='utf-8')

# Excel
df = pd.read_excel('data.xlsx', sheet_name=0)

# JSON
df = pd.read_json('data.json')

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table', conn)

Writing Data

# CSV
df.to_csv('output.csv', index=False)

# Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)

# JSON
df.to_json('output.json', orient='records')

# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

Viewing and Inspecting Data

# View first/last rows
df.head(5)      # First 5 rows
df.tail(3)      # Last 3 rows

# Dimensions
df.shape        # (rows, columns)
df.size         # Total number of elements

# Info about DataFrame
df.info()       # Column types, non-null counts, memory usage

# Statistical summary
df.describe()   # Only numerical columns by default
df.describe(include='all')  # All columns

# Column names
df.columns

# Data types
df.dtypes

# Unique values in a column
df['column'].unique()
df['column'].nunique()  # Count of unique values

# Value counts
df['column'].value_counts()

# Check for null values
df.isnull().sum()

Selecting Data

Selecting Columns

# Single column (returns Series)
df['Name']
df.Name  # Alternative syntax

# Multiple columns (returns DataFrame)
df[['Name', 'Age']]

# Columns with dot notation don't work if column name has spaces or special chars

Selecting Rows

# By index position using iloc (integer location)
df.iloc[0]          # First row
df.iloc[0:3]        # First 3 rows
df.iloc[0, 1]       # First row, second column
df.iloc[:, 0:2]     # All rows, first 2 columns

# By label using loc
df.loc[0]           # Row with index label 0
df.loc[0:3, 'Name'] # Rows 0-3, column 'Name'
df.loc[df['Age'] > 25, :]  # Conditional selection

# Boolean indexing
df[df['Age'] > 25]
df[(df['Age'] > 25) & (df['City'] == 'NYC')]

# isin method
df[df['City'].isin(['NYC', 'LA'])]

Using query() Method

df.query('Age > 25 and City == "NYC"')

Data Cleaning

Handling Missing Values

# Detect missing values
df.isnull()           # Boolean DataFrame
df.isnull().sum()     # Count missing per column

# Drop missing values
df.dropna()           # Drop rows with any missing values
df.dropna(axis=1)     # Drop columns with any missing values
df.dropna(subset=['column'])  # Drop rows missing in specific column

# Fill missing values
df.fillna(0)                    # Fill all with 0
df.fillna({'col1': 0, 'col2': 'Unknown'})  # Fill specific columns
df.fillna(method='ffill')       # Forward fill
df.fillna(method='bfill')       # Backward fill
df.fillna(df.mean())            # Fill with mean

# Replace specific values
df.replace('old_value', 'new_value')
df.replace({1: 'one', 2: 'two'})

# Interpolate missing values
df.interpolate()

Handling Duplicates

# Identify duplicates
df.duplicated()           # Boolean Series
df.duplicated(subset=['column'])  # Check specific columns

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['column'], keep='first')
# keep: 'first', 'last', False

Renaming Columns

# Rename specific columns
df.rename(columns={'old_name': 'new_name', 'old': 'new'})

# Rename all columns
df.columns = ['Col1', 'Col2', 'Col3']

# Using a mapping function
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')

Changing Data Types

# Convert to specific type
df['Age'] = df['Age'].astype('int64')
df['Name'] = df['Name'].astype('string')

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Convert to numeric (with errors handling)
df['Number'] = pd.to_numeric(df['Number'], errors='coerce')

# Categorical data
df['Category'] = df['Category'].astype('category')

Data Transformation

Applying Functions

# Apply function to columns
df['Age_squared'] = df['Age'].apply(lambda x: x**2)

# Apply function to rows
df.apply(lambda row: row['Age'] + row['Score'], axis=1)

# Map values
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})

# Replace using a dictionary
df.replace({'Gender': {'M': 'Male', 'F': 'Female'}})

Adding New Columns

# Direct assignment
df['NewColumn'] = 0
df['NewColumn'] = df['Col1'] + df['Col2']

# Using assign
df.assign(NewCol=lambda x: x['Col1'] * 2)

# Using insert (at specific position)
df.insert(0, 'NewFirstCol', df['ExistingCol'])

Dropping Columns/Rows

# Drop columns
df.drop('column_name', axis=1)
df.drop(['col1', 'col2'], axis=1)

# Drop rows
df.drop([0, 1, 2], axis=0)  # Drop by index
df.drop(df[df['Age'] < 18].index)  # Drop rows meeting condition

# In-place modification
df.drop('column', axis=1, inplace=True)

Data Aggregation

Descriptive Statistics

# Basic statistics
df.mean()           # Mean of all numerical columns
df.median()         # Median
df.mode()           # Mode
df.std()            # Standard deviation
df.var()            # Variance
df.min() / df.max() # Min/Max
df.sum()            # Sum
df.corr()           # Correlation matrix
df.cov()            # Covariance matrix

# For specific columns
df['Age'].mean()
df[['Age', 'Score']].describe()

Grouping Data

# Group by a single column
grouped = df.groupby('City')
grouped['Age'].mean()
grouped['Age'].agg(['mean', 'max', 'min', 'count'])

# Group by multiple columns
df.groupby(['City', 'Gender'])['Age'].mean()

# Multiple aggregations
df.groupby('City').agg({
    'Age': ['mean', 'max'],
    'Score': ['mean', 'sum']
})

# Named aggregations
df.groupby('City').agg(
    avg_age=('Age', 'mean'),
    max_score=('Score', 'max')
)

Pivot Tables

# Create pivot table
df.pivot_table(
    values='Sales',
    index='City',
    columns='Product',
    aggfunc='sum',
    fill_value=0
)

# Multiple aggregations
pd.pivot_table(df, 
               values='Sales', 
               index=['City', 'Region'], 
               columns='Product',
               aggfunc=['sum', 'mean'])

Cross Tabulation

pd.crosstab(df['City'], df['Product'])
pd.crosstab(df['City'], df['Product'], values=df['Sales'], aggfunc='sum')

Merging and Joining

Concatenation

# Concatenate rows (vertical)
df_combined = pd.concat([df1, df2], axis=0)

# Concatenate columns (horizontal)
df_combined = pd.concat([df1, df2], axis=1)

# With keys for identification
df_combined = pd.concat([df1, df2], keys=['A', 'B'])

Merging

# Inner join (default)
merged = pd.merge(df1, df2, on='key')

# Left join
merged = pd.merge(df1, df2, on='key', how='left')

# Right join
merged = pd.merge(df1, df2, on='key', how='right')

# Outer join
merged = pd.merge(df1, df2, on='key', how='outer')

# Merge on different columns
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')

# Merge with suffixes for overlapping columns
merged = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))

Joining

# Using join (similar to merge but uses index)
df1.join(df2, on='key')
df1.join(df2, how='inner')

Time Series Data

Datetime Index

# Create datetime index
df.index = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Generate date range
dates = pd.date_range('2023-01-01', periods=10, freq='D')
df = pd.DataFrame(index=dates, data={'value': range(10)})

Date Operations

# Extract date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Quarter'] = df['Date'].dt.quarter

# Date differences
df['Date2'] - df['Date1']  # Returns timedelta
(df['Date2'] - df['Date1']).dt.days

# Shift and lag
df['Shifted'] = df['Value'].shift(1)  # Previous row
df['Shifted_forward'] = df['Value'].shift(-1)  # Next row

# Rolling windows
df['Rolling_Mean'] = df['Value'].rolling(window=7).mean()
df['Rolling_Sum'] = df['Value'].rolling(window=7).sum()

# Resampling
df.resample('M').mean()  # Monthly
df.resample('Q').sum()   # Quarterly
df.resample('Y').max()   # Yearly

Time Zones

# Set timezone
df = df.tz_localize('UTC')
df = df.tz_convert('US/Eastern')

Working with Text Data

# Convert to lowercase/uppercase
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.title()

# String operations
df['Name'].str.len()
df['Name'].str.contains('pattern')
df['Name'].str.startswith('A')
df['Name'].str.endswith('z')

# Extract substrings
df['Name'].str.extract(r'(\d+)')  # Extract digits
df['Name'].str.extractall(r'(\w+)')  # Extract all words

# Split and expand
df['Name'].str.split(' ', expand=True)
df['Name'].str.split(' ', n=1, expand=True)  # Split into 2 columns

# Replace and clean
df['Text'].str.replace('old', 'new')
df['Text'].str.replace('_', ' ', regex=False)
df['Text'].str.strip()  # Remove whitespace
df['Text'].str.lstrip()
df['Text'].str.rstrip()

# Pad strings
df['Name'].str.pad(width=10, side='right', fillchar='-')
df['Name'].str.zfill(5)

# Categorize by string patterns
df['Category'] = df['Text'].str.extract(r'(PatternA|PatternB|PatternC)')

Advanced Features

MultiIndex (Hierarchical Indexing)

# Create MultiIndex
df = df.set_index(['City', 'Year'])
df.loc[('NYC', 2020)]  # Access by both levels

# Create MultiIndex from tuples
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
df = pd.DataFrame(index=index)

# Cross-section with xs
df.xs('A', level=0)  # Get all 'A' level 0 entries

# Unstack/Stack
df.unstack()  # Pivot from MultiIndex to columns
df.stack()    # Pivot from columns to MultiIndex

Window Functions

# Expanding windows
df['Expanding_Mean'] = df['Value'].expanding().mean()
df['Expanding_Sum'] = df['Value'].expanding().sum()

# Exponential weighted
df['EWMA'] = df['Value'].ewm(span=10).mean()

# Rank and quantile
df['Rank'] = df['Value'].rank()
df['Percentile'] = df['Value'].rank(pct=True)
df['Quantile'] = df['Value'].quantile(0.75)

Handling Categorical Data

# Create categorical
df['Category'] = pd.Categorical(df['Category'], categories=['Low', 'Medium', 'High'], ordered=True)

# Get codes
df['Category'].cat.codes

# Rename categories
df['Category'].cat.rename_categories(['L', 'M', 'H'])

# Add new category
df['Category'] = df['Category'].cat.add_categories(['Very High'])

# Remove categories
df['Category'] = df['Category'].cat.remove_categories(['Low'])

Memory Optimization

# Check memory usage
df.memory_usage(deep=True)
df.info(memory_usage='deep')

# Optimize data types
df['IntCol'] = pd.to_numeric(df['IntCol'], downcast='integer')
df['FloatCol'] = pd.to_numeric(df['FloatCol'], downcast='float')
df['CatCol'] = df['CatCol'].astype('category')  # Reduce memory for strings

Visualization

Basic Plotting (with matplotlib)

import matplotlib.pyplot as plt

# Line plot
df.plot()
df.plot(kind='line', title='My Plot', figsize=(10, 6))

# Bar plot
df.plot(kind='bar')
df.plot(kind='barh')  # Horizontal

# Histogram
df['Column'].plot(kind='hist', bins=20)

# Box plot
df.plot(kind='box')

# Scatter plot
df.plot(kind='scatter', x='col1', y='col2')

# Area plot
df.plot(kind='area', stacked=True)

# Pie chart
df['Column'].value_counts().plot(kind='pie', autopct='%1.1f%%')

# Density plot
df['Column'].plot(kind='density')

# Multiple plots
df.plot(subplots=True, layout=(2, 2))

# Direct matplotlib integration
df.plot()
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.legend(['Legend'])
plt.show()

Advanced Visualization with Seaborn (Optional)

import seaborn as sns

# Pairplot
sns.pairplot(df)

# Heatmap
sns.heatmap(df.corr(), annot=True)

# Violin plot
sns.violinplot(x='Category', y='Value', data=df)

Performance Optimization

Vectorization

# Instead of loops, use vectorized operations
# Bad:
for i in range(len(df)):
    df.loc[i, 'New'] = df.loc[i, 'A'] + df.loc[i, 'B']

# Good:
df['New'] = df['A'] + df['B']

# Using numpy for complex operations
df['New'] = np.where(df['A'] > 10, df['A'] * 2, df['A'])

Using eval() for Performance

# Use eval for large DataFrames
df.eval('C = A + B')
df.eval('D = (A + B) / C', inplace=True)

Parallel Processing with Dask

# For very large datasets
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=4)
result = ddf.groupby('column').mean().compute()

Chunking for Large Files

# Process large files in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    process(chunk)

Common Recipes and Patterns

Filtering with Complex Conditions

# Multiple conditions
df[(df['Age'] > 18) & (df['Score'] >= 70) | (df['City'] == 'NYC')]

# Using query for readability
df.query('Age > 18 and Score >= 70 or City == "NYC"')

# Using between
df[df['Age'].between(18, 25)]

# Using isin with multiple values
df[df['City'].isin(['NYC', 'LA', 'Chicago'])]

# Using notna/isna
df[df['Email'].notna()]

Cumulative Operations

# Cumulative sum, product, max, min
df['CumSum'] = df['Value'].cumsum()
df['CumProd'] = df['Value'].cumprod()
df['CumMax'] = df['Value'].cummax()
df['CumMin'] = df['Value'].cummin()

Sampling and Randomization

# Random sampling
df.sample(n=10)  # Sample 10 rows
df.sample(frac=0.1)  # Sample 10% of rows

# Sampling with replacement
df.sample(n=10, replace=True)

# Random permutation
df.sample(frac=1).reset_index(drop=True)  # Shuffle entire dataframe

# Stratified sampling
df.groupby('Category', group_keys=False).apply(lambda x: x.sample(frac=0.1))

Date/Time Operations

# Date range
df = df[df['Date'].between('2023-01-01', '2023-12-31')]

# Resampling with custom aggregations
df.resample('M').agg({
    'Sales': 'sum',
    'Orders': 'count',
    'Revenue': lambda x: x.sum() / x.count()
})

# Lagged values
df['Previous_Value'] = df.groupby('ID')['Value'].shift(1)
df['Change'] = df['Value'] - df['Previous_Value']
df['Pct_Change'] = df['Value'].pct_change()

Handling Outliers

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['Value']))
df_no_outliers = df[z_scores < 3]

# IQR method
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df[(df['Value'] >= Q1 - 1.5*IQR) & (df['Value'] <= Q3 + 1.5*IQR)]

Handling Missing Values Strategically

# Forward fill within groups
df['Value'] = df.groupby('ID')['Value'].fillna(method='ffill')

# Linear interpolation within groups
df['Value'] = df.groupby('ID')['Value'].transform(lambda x: x.interpolate(method='linear'))

# Flag missing data
df['Value_IsMissing'] = df['Value'].isna().astype(int)

Exporting for Different Use Cases

For Machine Learning

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Export to numpy
X_numpy = X.values
y_numpy = y.values

# Export to dict/JSON for APIs
df.to_dict(orient='records')
df.to_json(orient='records')

For Reporting

# Formatting floats
df.style.format({'Price': '${:,.2f}', 'Percentage': '{:.2%}'})

# Highlighting
df.style.highlight_max()
df.style.highlight_min()
df.style.background_gradient()

# Export to HTML
df.to_html('table.html', index=False)

# Export with formatting to Excel
with pd.ExcelWriter('report.xlsx', engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name='Sheet1', index=False)
    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
    # Add formatting
    worksheet.set_column('A:A', 20)

Error Handling and Debugging

Common Errors and Solutions

# SettingWithCopyWarning - Use .copy() or .loc
df_copy = df.copy()  # Create explicit copy

# KeyError - Check if column exists
if 'column' in df.columns:
    df['column']

# TypeError - Check data types
df.dtypes

# MemoryError - Use chunking or optimization

Debugging with info()

# Quick data inspection
print(df.head())
print(df.info())
print(df.describe())
print(df['column'].value_counts())

Integration with Other Libraries

With NumPy

# Convert to numpy array
arr = df.to_numpy()

# Use numpy functions
df['New'] = np.log(df['Value'])
df['New'] = np.sqrt(df['Value'])

With Matplotlib/Seaborn

import matplotlib.pyplot as plt
import seaborn as sns

df.plot()
plt.show()

sns.boxplot(data=df, x='Category', y='Value')

With Scikit-learn

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['col1', 'col2']])

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['categorical_column'])

With SQL Databases

import sqlalchemy

# Write to SQL
engine = sqlalchemy.create_engine('sqlite:///database.db')
df.to_sql('table', engine, if_exists='replace', index=False)

# Read from SQL
df = pd.read_sql('SELECT * FROM table', engine)

Best Practices

1. Use Vectorized Operations

# Avoid loops
# Bad:
for i, row in df.iterrows():
    df.loc[i, 'New'] = row['A'] + row['B']

# Good:
df['New'] = df['A'] + df['B']

2. Avoid Chained Assignment

# Bad:
df[df['A'] > 5]['B'] = 10

# Good:
df.loc[df['A'] > 5, 'B'] = 10

3. Use .copy() When Needed

df2 = df.copy()  # Prevents SettingWithCopyWarning

4. Optimize Data Types

# Convert to more efficient types
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
df['string_col'] = df['string_col'].astype('category')  # If limited categories

5. Use Chaining for Readability

result = (df
          .query('Age > 18')
          .groupby('City')
          .agg({'Score': 'mean'})
          .reset_index()
          .sort_values('Score', ascending=False))

6. Handle Missing Data Explicitly

# Always decide how to handle missing values
df.dropna()
df.fillna(value)
df.fillna(method='ffill')

Version Compatibility

Check Pandas Version

print(pd.__version__)

Deprecations to Watch

# Old:
df.ix[0, 'column']  # Deprecated

# New:
df.loc[0, 'column']  # Use loc for labels
df.iloc[0, 0]        # Use iloc for integer positions

Real-World Example

Here’s a complete example combining many concepts:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. Load and inspect data
df = pd.read_csv('sales_data.csv')
print(df.info())

# 2. Clean the data
df['Date'] = pd.to_datetime(df['Date'])
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
df.dropna(subset=['Sales'], inplace=True)

# 3. Feature engineering
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter
df['Month_Name'] = df['Date'].dt.month_name()

# 4. Aggregations
monthly_sales = df.groupby(['Year', 'Month'])['Sales'].agg(['sum', 'mean', 'count']).reset_index()

# 5. Pivot for analysis
pivot = df.pivot_table(values='Sales', 
                       index='Month_Name', 
                       columns='Year', 
                       aggfunc='sum', 
                       fill_value=0)

# 6. Time series analysis
df.set_index('Date', inplace=True)
df['Rolling_7D'] = df['Sales'].rolling('7D').mean()

# 7. Detection of seasonality
seasonal_stats = df.groupby(df.index.month)['Sales'].agg(['mean', 'std'])

# 8. Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
pivot.plot(kind='bar', ax=axes[0, 0])
axes[0, 0].set_title('Monthly Sales by Year')

df['Sales'].plot(ax=axes[0, 1])
axes[0, 1].set_title('Time Series')

df['Rolling_7D'].plot(ax=axes[1, 0])
axes[1, 0].set_title('7-Day Moving Average')

seasonal_stats.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Seasonal Pattern')

plt.tight_layout()
plt.show()

# 9. Export results
monthly_sales.to_excel('analysis_results.xlsx', index=False)

Summary

Pandas is an essential library for data manipulation in Python. This tutorial covered:

  • Core data structures (Series, DataFrame)
  • Data loading and saving (CSV, Excel, JSON, SQL)
  • Data cleaning and preprocessing
  • Data transformation and aggregation
  • Advanced operations (grouping, pivoting, merging)
  • Time series analysis
  • Performance optimization
  • Integration with other libraries

Pandas provides a powerful and intuitive interface for data manipulation, making it an indispensable tool for data scientists, analysts, and anyone working with tabular data in Python.

Last updated on