Pandas

Technology

Data Science

Data Analysis

Pandas

Pandas is a powerful, open-source data manipulation and analysis library for Python. It provides high-performance, easy-to-use data structures and tools for working with structured (tabular, multidimensional, heterogeneous) and time-series data.

This tutorial covers everything from installation to advanced features, with practical examples.

Installation and Setup

Installing Pandas

pip install pandas

Or with conda:

conda install pandas

Importing Pandas

import pandas as pd
import numpy as np  # Often used alongside pandas

Core Data Structures

Pandas has two primary data structures:

Series

A one-dimensional labeled array capable of holding any data type.

# Create a Series from a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)
# 0    1
# 1    3
# 2    5
# 3    7
# 4    9
# dtype: int64

# With custom index
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s['c'])  # 5

DataFrame

A two-dimensional labeled data structure with columns of potentially different types.

# Create a DataFrame from a dictionary
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['NYC', 'LA', 'Chicago', 'Miami']
})
print(df)

Creating DataFrames

From Dictionary

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['x', 'y', 'z']
})

From List of Dictionaries

data = [
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 30}
]
df = pd.DataFrame(data)

From CSV

df = pd.read_csv('filename.csv')

From Excel

df = pd.read_excel('filename.xlsx', sheet_name='Sheet1')

From NumPy Array

arr = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])

Reading and Writing Data

Reading Data

# CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', index_col=0)  # Set first column as index
df = pd.read_csv('data.csv', encoding='utf-8')

# Excel
df = pd.read_excel('data.xlsx', sheet_name=0)

# JSON
df = pd.read_json('data.json')

# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table', conn)

Writing Data

# CSV
df.to_csv('output.csv', index=False)

# Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)

# JSON
df.to_json('output.json', orient='records')

# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)

Viewing and Inspecting Data

# View first/last rows
df.head(5)      # First 5 rows
df.tail(3)      # Last 3 rows

# Dimensions
df.shape        # (rows, columns)
df.size         # Total number of elements

# Info about DataFrame
df.info()       # Column types, non-null counts, memory usage

# Statistical summary
df.describe()   # Only numerical columns by default
df.describe(include='all')  # All columns

# Column names
df.columns

# Data types
df.dtypes

# Unique values in a column
df['column'].unique()
df['column'].nunique()  # Count of unique values

# Value counts
df['column'].value_counts()

# Check for null values
df.isnull().sum()

Selecting Data

Selecting Columns

# Single column (returns Series)
df['Name']
df.Name  # Alternative syntax

# Multiple columns (returns DataFrame)
df[['Name', 'Age']]

# Columns with dot notation don't work if column name has spaces or special chars

Selecting Rows

# By index position using iloc (integer location)
df.iloc[0]          # First row
df.iloc[0:3]        # First 3 rows
df.iloc[0, 1]       # First row, second column
df.iloc[:, 0:2]     # All rows, first 2 columns

# By label using loc
df.loc[0]           # Row with index label 0
df.loc[0:3, 'Name'] # Rows 0-3, column 'Name'
df.loc[df['Age'] > 25, :]  # Conditional selection

# Boolean indexing
df[df['Age'] > 25]
df[(df['Age'] > 25) & (df['City'] == 'NYC')]

# isin method
df[df['City'].isin(['NYC', 'LA'])]

Using query() Method

df.query('Age > 25 and City == "NYC"')

Data Cleaning

Handling Missing Values

# Detect missing values
df.isnull()           # Boolean DataFrame
df.isnull().sum()     # Count missing per column

# Drop missing values
df.dropna()           # Drop rows with any missing values
df.dropna(axis=1)     # Drop columns with any missing values
df.dropna(subset=['column'])  # Drop rows missing in specific column

# Fill missing values
df.fillna(0)                    # Fill all with 0
df.fillna({'col1': 0, 'col2': 'Unknown'})  # Fill specific columns
df.fillna(method='ffill')       # Forward fill
df.fillna(method='bfill')       # Backward fill
df.fillna(df.mean())            # Fill with mean

# Replace specific values
df.replace('old_value', 'new_value')
df.replace({1: 'one', 2: 'two'})

# Interpolate missing values
df.interpolate()

Handling Duplicates

# Identify duplicates
df.duplicated()           # Boolean Series
df.duplicated(subset=['column'])  # Check specific columns

# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['column'], keep='first')
# keep: 'first', 'last', False

Renaming Columns

# Rename specific columns
df.rename(columns={'old_name': 'new_name', 'old': 'new'})

# Rename all columns
df.columns = ['Col1', 'Col2', 'Col3']

# Using a mapping function
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')

Changing Data Types

# Convert to specific type
df['Age'] = df['Age'].astype('int64')
df['Name'] = df['Name'].astype('string')

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Convert to numeric (with errors handling)
df['Number'] = pd.to_numeric(df['Number'], errors='coerce')

# Categorical data
df['Category'] = df['Category'].astype('category')

Data Transformation

Applying Functions

# Apply function to columns
df['Age_squared'] = df['Age'].apply(lambda x: x**2)

# Apply function to rows
df.apply(lambda row: row['Age'] + row['Score'], axis=1)

# Map values
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})

# Replace using a dictionary
df.replace({'Gender': {'M': 'Male', 'F': 'Female'}})

Adding New Columns

# Direct assignment
df['NewColumn'] = 0
df['NewColumn'] = df['Col1'] + df['Col2']

# Using assign
df.assign(NewCol=lambda x: x['Col1'] * 2)

# Using insert (at specific position)
df.insert(0, 'NewFirstCol', df['ExistingCol'])

Dropping Columns/Rows

# Drop columns
df.drop('column_name', axis=1)
df.drop(['col1', 'col2'], axis=1)

# Drop rows
df.drop([0, 1, 2], axis=0)  # Drop by index
df.drop(df[df['Age'] < 18].index)  # Drop rows meeting condition

# In-place modification
df.drop('column', axis=1, inplace=True)

Data Aggregation

Descriptive Statistics

# Basic statistics
df.mean()           # Mean of all numerical columns
df.median()         # Median
df.mode()           # Mode
df.std()            # Standard deviation
df.var()            # Variance
df.min() / df.max() # Min/Max
df.sum()            # Sum
df.corr()           # Correlation matrix
df.cov()            # Covariance matrix

# For specific columns
df['Age'].mean()
df[['Age', 'Score']].describe()

Grouping Data

# Group by a single column
grouped = df.groupby('City')
grouped['Age'].mean()
grouped['Age'].agg(['mean', 'max', 'min', 'count'])

# Group by multiple columns
df.groupby(['City', 'Gender'])['Age'].mean()

# Multiple aggregations
df.groupby('City').agg({
    'Age': ['mean', 'max'],
    'Score': ['mean', 'sum']
})

# Named aggregations
df.groupby('City').agg(
    avg_age=('Age', 'mean'),
    max_score=('Score', 'max')
)

Pivot Tables

# Create pivot table
df.pivot_table(
    values='Sales',
    index='City',
    columns='Product',
    aggfunc='sum',
    fill_value=0
)

# Multiple aggregations
pd.pivot_table(df, 
               values='Sales', 
               index=['City', 'Region'], 
               columns='Product',
               aggfunc=['sum', 'mean'])

Cross Tabulation

pd.crosstab(df['City'], df['Product'])
pd.crosstab(df['City'], df['Product'], values=df['Sales'], aggfunc='sum')

Merging and Joining

Concatenation

# Concatenate rows (vertical)
df_combined = pd.concat([df1, df2], axis=0)

# Concatenate columns (horizontal)
df_combined = pd.concat([df1, df2], axis=1)

# With keys for identification
df_combined = pd.concat([df1, df2], keys=['A', 'B'])

Merging

# Inner join (default)
merged = pd.merge(df1, df2, on='key')

# Left join
merged = pd.merge(df1, df2, on='key', how='left')

# Right join
merged = pd.merge(df1, df2, on='key', how='right')

# Outer join
merged = pd.merge(df1, df2, on='key', how='outer')

# Merge on different columns
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')

# Merge with suffixes for overlapping columns
merged = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))

Joining

# Using join (similar to merge but uses index)
df1.join(df2, on='key')
df1.join(df2, how='inner')

Time Series Data

Datetime Index

# Create datetime index
df.index = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

# Generate date range
dates = pd.date_range('2023-01-01', periods=10, freq='D')
df = pd.DataFrame(index=dates, data={'value': range(10)})

Date Operations

# Extract date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Quarter'] = df['Date'].dt.quarter

# Date differences
df['Date2'] - df['Date1']  # Returns timedelta
(df['Date2'] - df['Date1']).dt.days

# Shift and lag
df['Shifted'] = df['Value'].shift(1)  # Previous row
df['Shifted_forward'] = df['Value'].shift(-1)  # Next row

# Rolling windows
df['Rolling_Mean'] = df['Value'].rolling(window=7).mean()
df['Rolling_Sum'] = df['Value'].rolling(window=7).sum()

# Resampling
df.resample('M').mean()  # Monthly
df.resample('Q').sum()   # Quarterly
df.resample('Y').max()   # Yearly

Time Zones

# Set timezone
df = df.tz_localize('UTC')
df = df.tz_convert('US/Eastern')

Working with Text Data

# Convert to lowercase/uppercase
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.title()

# String operations
df['Name'].str.len()
df['Name'].str.contains('pattern')
df['Name'].str.startswith('A')
df['Name'].str.endswith('z')

# Extract substrings
df['Name'].str.extract(r'(\d+)')  # Extract digits
df['Name'].str.extractall(r'(\w+)')  # Extract all words

# Split and expand
df['Name'].str.split(' ', expand=True)
df['Name'].str.split(' ', n=1, expand=True)  # Split into 2 columns

# Replace and clean
df['Text'].str.replace('old', 'new')
df['Text'].str.replace('_', ' ', regex=False)
df['Text'].str.strip()  # Remove whitespace
df['Text'].str.lstrip()
df['Text'].str.rstrip()

# Pad strings
df['Name'].str.pad(width=10, side='right', fillchar='-')
df['Name'].str.zfill(5)

# Categorize by string patterns
df['Category'] = df['Text'].str.extract(r'(PatternA|PatternB|PatternC)')

Advanced Features

MultiIndex (Hierarchical Indexing)

# Create MultiIndex
df = df.set_index(['City', 'Year'])
df.loc[('NYC', 2020)]  # Access by both levels

# Create MultiIndex from tuples
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
df = pd.DataFrame(index=index)

# Cross-section with xs
df.xs('A', level=0)  # Get all 'A' level 0 entries

# Unstack/Stack
df.unstack()  # Pivot from MultiIndex to columns
df.stack()    # Pivot from columns to MultiIndex

Window Functions

# Expanding windows
df['Expanding_Mean'] = df['Value'].expanding().mean()
df['Expanding_Sum'] = df['Value'].expanding().sum()

# Exponential weighted
df['EWMA'] = df['Value'].ewm(span=10).mean()

# Rank and quantile
df['Rank'] = df['Value'].rank()
df['Percentile'] = df['Value'].rank(pct=True)
df['Quantile'] = df['Value'].quantile(0.75)

Handling Categorical Data

# Create categorical
df['Category'] = pd.Categorical(df['Category'], categories=['Low', 'Medium', 'High'], ordered=True)

# Get codes
df['Category'].cat.codes

# Rename categories
df['Category'].cat.rename_categories(['L', 'M', 'H'])

# Add new category
df['Category'] = df['Category'].cat.add_categories(['Very High'])

# Remove categories
df['Category'] = df['Category'].cat.remove_categories(['Low'])

Memory Optimization

# Check memory usage
df.memory_usage(deep=True)
df.info(memory_usage='deep')

# Optimize data types
df['IntCol'] = pd.to_numeric(df['IntCol'], downcast='integer')
df['FloatCol'] = pd.to_numeric(df['FloatCol'], downcast='float')
df['CatCol'] = df['CatCol'].astype('category')  # Reduce memory for strings

Visualization

Basic Plotting (with matplotlib)

import matplotlib.pyplot as plt

# Line plot
df.plot()
df.plot(kind='line', title='My Plot', figsize=(10, 6))

# Bar plot
df.plot(kind='bar')
df.plot(kind='barh')  # Horizontal

# Histogram
df['Column'].plot(kind='hist', bins=20)

# Box plot
df.plot(kind='box')

# Scatter plot
df.plot(kind='scatter', x='col1', y='col2')

# Area plot
df.plot(kind='area', stacked=True)

# Pie chart
df['Column'].value_counts().plot(kind='pie', autopct='%1.1f%%')

# Density plot
df['Column'].plot(kind='density')

# Multiple plots
df.plot(subplots=True, layout=(2, 2))

# Direct matplotlib integration
df.plot()
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.legend(['Legend'])
plt.show()

Advanced Visualization with Seaborn (Optional)

import seaborn as sns

# Pairplot
sns.pairplot(df)

# Heatmap
sns.heatmap(df.corr(), annot=True)

# Violin plot
sns.violinplot(x='Category', y='Value', data=df)

Performance Optimization

Vectorization

# Instead of loops, use vectorized operations
# Bad:
for i in range(len(df)):
    df.loc[i, 'New'] = df.loc[i, 'A'] + df.loc[i, 'B']

# Good:
df['New'] = df['A'] + df['B']

# Using numpy for complex operations
df['New'] = np.where(df['A'] > 10, df['A'] * 2, df['A'])

Using eval() for Performance

# Use eval for large DataFrames
df.eval('C = A + B')
df.eval('D = (A + B) / C', inplace=True)

Parallel Processing with Dask

# For very large datasets
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=4)
result = ddf.groupby('column').mean().compute()

Chunking for Large Files

# Process large files in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    process(chunk)

Common Recipes and Patterns

Filtering with Complex Conditions

# Multiple conditions
df[(df['Age'] > 18) & (df['Score'] >= 70) | (df['City'] == 'NYC')]

# Using query for readability
df.query('Age > 18 and Score >= 70 or City == "NYC"')

# Using between
df[df['Age'].between(18, 25)]

# Using isin with multiple values
df[df['City'].isin(['NYC', 'LA', 'Chicago'])]

# Using notna/isna
df[df['Email'].notna()]

Cumulative Operations

# Cumulative sum, product, max, min
df['CumSum'] = df['Value'].cumsum()
df['CumProd'] = df['Value'].cumprod()
df['CumMax'] = df['Value'].cummax()
df['CumMin'] = df['Value'].cummin()

Sampling and Randomization

# Random sampling
df.sample(n=10)  # Sample 10 rows
df.sample(frac=0.1)  # Sample 10% of rows

# Sampling with replacement
df.sample(n=10, replace=True)

# Random permutation
df.sample(frac=1).reset_index(drop=True)  # Shuffle entire dataframe

# Stratified sampling
df.groupby('Category', group_keys=False).apply(lambda x: x.sample(frac=0.1))

Date/Time Operations

# Date range
df = df[df['Date'].between('2023-01-01', '2023-12-31')]

# Resampling with custom aggregations
df.resample('M').agg({
    'Sales': 'sum',
    'Orders': 'count',
    'Revenue': lambda x: x.sum() / x.count()
})

# Lagged values
df['Previous_Value'] = df.groupby('ID')['Value'].shift(1)
df['Change'] = df['Value'] - df['Previous_Value']
df['Pct_Change'] = df['Value'].pct_change()

Handling Outliers

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['Value']))
df_no_outliers = df[z_scores < 3]

# IQR method
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df[(df['Value'] >= Q1 - 1.5*IQR) & (df['Value'] <= Q3 + 1.5*IQR)]

Handling Missing Values Strategically

# Forward fill within groups
df['Value'] = df.groupby('ID')['Value'].fillna(method='ffill')

# Linear interpolation within groups
df['Value'] = df.groupby('ID')['Value'].transform(lambda x: x.interpolate(method='linear'))

# Flag missing data
df['Value_IsMissing'] = df['Value'].isna().astype(int)

Exporting for Different Use Cases

For Machine Learning

# Split features and target
X = df.drop('target', axis=1)
y = df['target']

# Export to numpy
X_numpy = X.values
y_numpy = y.values

# Export to dict/JSON for APIs
df.to_dict(orient='records')
df.to_json(orient='records')

For Reporting

# Formatting floats
df.style.format({'Price': '${:,.2f}', 'Percentage': '{:.2%}'})

# Highlighting
df.style.highlight_max()
df.style.highlight_min()
df.style.background_gradient()

# Export to HTML
df.to_html('table.html', index=False)

# Export with formatting to Excel
with pd.ExcelWriter('report.xlsx', engine='xlsxwriter') as writer:
    df.to_excel(writer, sheet_name='Sheet1', index=False)
    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
    # Add formatting
    worksheet.set_column('A:A', 20)

Error Handling and Debugging

Common Errors and Solutions

# SettingWithCopyWarning - Use .copy() or .loc
df_copy = df.copy()  # Create explicit copy

# KeyError - Check if column exists
if 'column' in df.columns:
    df['column']

# TypeError - Check data types
df.dtypes

# MemoryError - Use chunking or optimization

Debugging with info()

# Quick data inspection
print(df.head())
print(df.info())
print(df.describe())
print(df['column'].value_counts())

Integration with Other Libraries

With NumPy

# Convert to numpy array
arr = df.to_numpy()

# Use numpy functions
df['New'] = np.log(df['Value'])
df['New'] = np.sqrt(df['Value'])

With Matplotlib/Seaborn

import matplotlib.pyplot as plt
import seaborn as sns

df.plot()
plt.show()

sns.boxplot(data=df, x='Category', y='Value')

With Scikit-learn

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['col1', 'col2']])

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['categorical_column'])

With SQL Databases

import sqlalchemy

# Write to SQL
engine = sqlalchemy.create_engine('sqlite:///database.db')
df.to_sql('table', engine, if_exists='replace', index=False)

# Read from SQL
df = pd.read_sql('SELECT * FROM table', engine)

Best Practices

1. Use Vectorized Operations

# Avoid loops
# Bad:
for i, row in df.iterrows():
    df.loc[i, 'New'] = row['A'] + row['B']

# Good:
df['New'] = df['A'] + df['B']

2. Avoid Chained Assignment

# Bad:
df[df['A'] > 5]['B'] = 10

# Good:
df.loc[df['A'] > 5, 'B'] = 10

3. Use .copy() When Needed

df2 = df.copy()  # Prevents SettingWithCopyWarning

4. Optimize Data Types

# Convert to more efficient types
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
df['string_col'] = df['string_col'].astype('category')  # If limited categories

5. Use Chaining for Readability

result = (df
          .query('Age > 18')
          .groupby('City')
          .agg({'Score': 'mean'})
          .reset_index()
          .sort_values('Score', ascending=False))

6. Handle Missing Data Explicitly

# Always decide how to handle missing values
df.dropna()
df.fillna(value)
df.fillna(method='ffill')

Version Compatibility

Check Pandas Version

print(pd.__version__)

Deprecations to Watch

# Old:
df.ix[0, 'column']  # Deprecated

# New:
df.loc[0, 'column']  # Use loc for labels
df.iloc[0, 0]        # Use iloc for integer positions

Real-World Example

Here’s a complete example combining many concepts:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. Load and inspect data
df = pd.read_csv('sales_data.csv')
print(df.info())

# 2. Clean the data
df['Date'] = pd.to_datetime(df['Date'])
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
df.dropna(subset=['Sales'], inplace=True)

# 3. Feature engineering
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter
df['Month_Name'] = df['Date'].dt.month_name()

# 4. Aggregations
monthly_sales = df.groupby(['Year', 'Month'])['Sales'].agg(['sum', 'mean', 'count']).reset_index()

# 5. Pivot for analysis
pivot = df.pivot_table(values='Sales', 
                       index='Month_Name', 
                       columns='Year', 
                       aggfunc='sum', 
                       fill_value=0)

# 6. Time series analysis
df.set_index('Date', inplace=True)
df['Rolling_7D'] = df['Sales'].rolling('7D').mean()

# 7. Detection of seasonality
seasonal_stats = df.groupby(df.index.month)['Sales'].agg(['mean', 'std'])

# 8. Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
pivot.plot(kind='bar', ax=axes[0, 0])
axes[0, 0].set_title('Monthly Sales by Year')

df['Sales'].plot(ax=axes[0, 1])
axes[0, 1].set_title('Time Series')

df['Rolling_7D'].plot(ax=axes[1, 0])
axes[1, 0].set_title('7-Day Moving Average')

seasonal_stats.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Seasonal Pattern')

plt.tight_layout()
plt.show()

# 9. Export results
monthly_sales.to_excel('analysis_results.xlsx', index=False)

Summary

Pandas is an essential library for data manipulation in Python. This tutorial covered:

Core data structures (Series, DataFrame)
Data loading and saving (CSV, Excel, JSON, SQL)
Data cleaning and preprocessing
Data transformation and aggregation
Advanced operations (grouping, pivoting, merging)
Time series analysis
Performance optimization
Integration with other libraries

Pandas provides a powerful and intuitive interface for data manipulation, making it an indispensable tool for data scientists, analysts, and anyone working with tabular data in Python.

Last updated on July 3, 2026

NumPy Matplotlib