Pandas
Pandas is a powerful, open-source data manipulation and analysis library for Python. It provides high-performance, easy-to-use data structures and tools for working with structured (tabular, multidimensional, heterogeneous) and time-series data.
This tutorial covers everything from installation to advanced features, with practical examples.
Installation and Setup
Installing Pandas
pip install pandasOr with conda:
conda install pandasImporting Pandas
import pandas as pd
import numpy as np # Often used alongside pandasCore Data Structures
Pandas has two primary data structures:
Series
A one-dimensional labeled array capable of holding any data type.
# Create a Series from a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)
# 0 1
# 1 3
# 2 5
# 3 7
# 4 9
# dtype: int64
# With custom index
s = pd.Series([1, 3, 5, 7, 9], index=['a', 'b', 'c', 'd', 'e'])
print(s['c']) # 5DataFrame
A two-dimensional labeled data structure with columns of potentially different types.
# Create a DataFrame from a dictionary
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['NYC', 'LA', 'Chicago', 'Miami']
})
print(df)Creating DataFrames
From Dictionary
df = pd.DataFrame({
'A': [1, 2, 3],
'B': ['x', 'y', 'z']
})From List of Dictionaries
data = [
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30}
]
df = pd.DataFrame(data)From CSV
df = pd.read_csv('filename.csv')From Excel
df = pd.read_excel('filename.xlsx', sheet_name='Sheet1')From NumPy Array
arr = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])Reading and Writing Data
Reading Data
# CSV
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', index_col=0) # Set first column as index
df = pd.read_csv('data.csv', encoding='utf-8')
# Excel
df = pd.read_excel('data.xlsx', sheet_name=0)
# JSON
df = pd.read_json('data.json')
# SQL
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table', conn)Writing Data
# CSV
df.to_csv('output.csv', index=False)
# Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
# JSON
df.to_json('output.json', orient='records')
# SQL
df.to_sql('table_name', conn, if_exists='replace', index=False)Viewing and Inspecting Data
# View first/last rows
df.head(5) # First 5 rows
df.tail(3) # Last 3 rows
# Dimensions
df.shape # (rows, columns)
df.size # Total number of elements
# Info about DataFrame
df.info() # Column types, non-null counts, memory usage
# Statistical summary
df.describe() # Only numerical columns by default
df.describe(include='all') # All columns
# Column names
df.columns
# Data types
df.dtypes
# Unique values in a column
df['column'].unique()
df['column'].nunique() # Count of unique values
# Value counts
df['column'].value_counts()
# Check for null values
df.isnull().sum()Selecting Data
Selecting Columns
# Single column (returns Series)
df['Name']
df.Name # Alternative syntax
# Multiple columns (returns DataFrame)
df[['Name', 'Age']]
# Columns with dot notation don't work if column name has spaces or special charsSelecting Rows
# By index position using iloc (integer location)
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[0, 1] # First row, second column
df.iloc[:, 0:2] # All rows, first 2 columns
# By label using loc
df.loc[0] # Row with index label 0
df.loc[0:3, 'Name'] # Rows 0-3, column 'Name'
df.loc[df['Age'] > 25, :] # Conditional selection
# Boolean indexing
df[df['Age'] > 25]
df[(df['Age'] > 25) & (df['City'] == 'NYC')]
# isin method
df[df['City'].isin(['NYC', 'LA'])]Using query() Method
df.query('Age > 25 and City == "NYC"')Data Cleaning
Handling Missing Values
# Detect missing values
df.isnull() # Boolean DataFrame
df.isnull().sum() # Count missing per column
# Drop missing values
df.dropna() # Drop rows with any missing values
df.dropna(axis=1) # Drop columns with any missing values
df.dropna(subset=['column']) # Drop rows missing in specific column
# Fill missing values
df.fillna(0) # Fill all with 0
df.fillna({'col1': 0, 'col2': 'Unknown'}) # Fill specific columns
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
df.fillna(df.mean()) # Fill with mean
# Replace specific values
df.replace('old_value', 'new_value')
df.replace({1: 'one', 2: 'two'})
# Interpolate missing values
df.interpolate()Handling Duplicates
# Identify duplicates
df.duplicated() # Boolean Series
df.duplicated(subset=['column']) # Check specific columns
# Remove duplicates
df.drop_duplicates()
df.drop_duplicates(subset=['column'], keep='first')
# keep: 'first', 'last', FalseRenaming Columns
# Rename specific columns
df.rename(columns={'old_name': 'new_name', 'old': 'new'})
# Rename all columns
df.columns = ['Col1', 'Col2', 'Col3']
# Using a mapping function
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')Changing Data Types
# Convert to specific type
df['Age'] = df['Age'].astype('int64')
df['Name'] = df['Name'].astype('string')
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
# Convert to numeric (with errors handling)
df['Number'] = pd.to_numeric(df['Number'], errors='coerce')
# Categorical data
df['Category'] = df['Category'].astype('category')Data Transformation
Applying Functions
# Apply function to columns
df['Age_squared'] = df['Age'].apply(lambda x: x**2)
# Apply function to rows
df.apply(lambda row: row['Age'] + row['Score'], axis=1)
# Map values
df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
# Replace using a dictionary
df.replace({'Gender': {'M': 'Male', 'F': 'Female'}})Adding New Columns
# Direct assignment
df['NewColumn'] = 0
df['NewColumn'] = df['Col1'] + df['Col2']
# Using assign
df.assign(NewCol=lambda x: x['Col1'] * 2)
# Using insert (at specific position)
df.insert(0, 'NewFirstCol', df['ExistingCol'])Dropping Columns/Rows
# Drop columns
df.drop('column_name', axis=1)
df.drop(['col1', 'col2'], axis=1)
# Drop rows
df.drop([0, 1, 2], axis=0) # Drop by index
df.drop(df[df['Age'] < 18].index) # Drop rows meeting condition
# In-place modification
df.drop('column', axis=1, inplace=True)Data Aggregation
Descriptive Statistics
# Basic statistics
df.mean() # Mean of all numerical columns
df.median() # Median
df.mode() # Mode
df.std() # Standard deviation
df.var() # Variance
df.min() / df.max() # Min/Max
df.sum() # Sum
df.corr() # Correlation matrix
df.cov() # Covariance matrix
# For specific columns
df['Age'].mean()
df[['Age', 'Score']].describe()Grouping Data
# Group by a single column
grouped = df.groupby('City')
grouped['Age'].mean()
grouped['Age'].agg(['mean', 'max', 'min', 'count'])
# Group by multiple columns
df.groupby(['City', 'Gender'])['Age'].mean()
# Multiple aggregations
df.groupby('City').agg({
'Age': ['mean', 'max'],
'Score': ['mean', 'sum']
})
# Named aggregations
df.groupby('City').agg(
avg_age=('Age', 'mean'),
max_score=('Score', 'max')
)Pivot Tables
# Create pivot table
df.pivot_table(
values='Sales',
index='City',
columns='Product',
aggfunc='sum',
fill_value=0
)
# Multiple aggregations
pd.pivot_table(df,
values='Sales',
index=['City', 'Region'],
columns='Product',
aggfunc=['sum', 'mean'])Cross Tabulation
pd.crosstab(df['City'], df['Product'])
pd.crosstab(df['City'], df['Product'], values=df['Sales'], aggfunc='sum')Merging and Joining
Concatenation
# Concatenate rows (vertical)
df_combined = pd.concat([df1, df2], axis=0)
# Concatenate columns (horizontal)
df_combined = pd.concat([df1, df2], axis=1)
# With keys for identification
df_combined = pd.concat([df1, df2], keys=['A', 'B'])Merging
# Inner join (default)
merged = pd.merge(df1, df2, on='key')
# Left join
merged = pd.merge(df1, df2, on='key', how='left')
# Right join
merged = pd.merge(df1, df2, on='key', how='right')
# Outer join
merged = pd.merge(df1, df2, on='key', how='outer')
# Merge on different columns
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')
# Merge with suffixes for overlapping columns
merged = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))Joining
# Using join (similar to merge but uses index)
df1.join(df2, on='key')
df1.join(df2, how='inner')Time Series Data
Datetime Index
# Create datetime index
df.index = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Generate date range
dates = pd.date_range('2023-01-01', periods=10, freq='D')
df = pd.DataFrame(index=dates, data={'value': range(10)})Date Operations
# Extract date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek
df['Quarter'] = df['Date'].dt.quarter
# Date differences
df['Date2'] - df['Date1'] # Returns timedelta
(df['Date2'] - df['Date1']).dt.days
# Shift and lag
df['Shifted'] = df['Value'].shift(1) # Previous row
df['Shifted_forward'] = df['Value'].shift(-1) # Next row
# Rolling windows
df['Rolling_Mean'] = df['Value'].rolling(window=7).mean()
df['Rolling_Sum'] = df['Value'].rolling(window=7).sum()
# Resampling
df.resample('M').mean() # Monthly
df.resample('Q').sum() # Quarterly
df.resample('Y').max() # YearlyTime Zones
# Set timezone
df = df.tz_localize('UTC')
df = df.tz_convert('US/Eastern')Working with Text Data
# Convert to lowercase/uppercase
df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.title()
# String operations
df['Name'].str.len()
df['Name'].str.contains('pattern')
df['Name'].str.startswith('A')
df['Name'].str.endswith('z')
# Extract substrings
df['Name'].str.extract(r'(\d+)') # Extract digits
df['Name'].str.extractall(r'(\w+)') # Extract all words
# Split and expand
df['Name'].str.split(' ', expand=True)
df['Name'].str.split(' ', n=1, expand=True) # Split into 2 columns
# Replace and clean
df['Text'].str.replace('old', 'new')
df['Text'].str.replace('_', ' ', regex=False)
df['Text'].str.strip() # Remove whitespace
df['Text'].str.lstrip()
df['Text'].str.rstrip()
# Pad strings
df['Name'].str.pad(width=10, side='right', fillchar='-')
df['Name'].str.zfill(5)
# Categorize by string patterns
df['Category'] = df['Text'].str.extract(r'(PatternA|PatternB|PatternC)')Advanced Features
MultiIndex (Hierarchical Indexing)
# Create MultiIndex
df = df.set_index(['City', 'Year'])
df.loc[('NYC', 2020)] # Access by both levels
# Create MultiIndex from tuples
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1)])
df = pd.DataFrame(index=index)
# Cross-section with xs
df.xs('A', level=0) # Get all 'A' level 0 entries
# Unstack/Stack
df.unstack() # Pivot from MultiIndex to columns
df.stack() # Pivot from columns to MultiIndexWindow Functions
# Expanding windows
df['Expanding_Mean'] = df['Value'].expanding().mean()
df['Expanding_Sum'] = df['Value'].expanding().sum()
# Exponential weighted
df['EWMA'] = df['Value'].ewm(span=10).mean()
# Rank and quantile
df['Rank'] = df['Value'].rank()
df['Percentile'] = df['Value'].rank(pct=True)
df['Quantile'] = df['Value'].quantile(0.75)Handling Categorical Data
# Create categorical
df['Category'] = pd.Categorical(df['Category'], categories=['Low', 'Medium', 'High'], ordered=True)
# Get codes
df['Category'].cat.codes
# Rename categories
df['Category'].cat.rename_categories(['L', 'M', 'H'])
# Add new category
df['Category'] = df['Category'].cat.add_categories(['Very High'])
# Remove categories
df['Category'] = df['Category'].cat.remove_categories(['Low'])Memory Optimization
# Check memory usage
df.memory_usage(deep=True)
df.info(memory_usage='deep')
# Optimize data types
df['IntCol'] = pd.to_numeric(df['IntCol'], downcast='integer')
df['FloatCol'] = pd.to_numeric(df['FloatCol'], downcast='float')
df['CatCol'] = df['CatCol'].astype('category') # Reduce memory for stringsVisualization
Basic Plotting (with matplotlib)
import matplotlib.pyplot as plt
# Line plot
df.plot()
df.plot(kind='line', title='My Plot', figsize=(10, 6))
# Bar plot
df.plot(kind='bar')
df.plot(kind='barh') # Horizontal
# Histogram
df['Column'].plot(kind='hist', bins=20)
# Box plot
df.plot(kind='box')
# Scatter plot
df.plot(kind='scatter', x='col1', y='col2')
# Area plot
df.plot(kind='area', stacked=True)
# Pie chart
df['Column'].value_counts().plot(kind='pie', autopct='%1.1f%%')
# Density plot
df['Column'].plot(kind='density')
# Multiple plots
df.plot(subplots=True, layout=(2, 2))
# Direct matplotlib integration
df.plot()
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Y Label')
plt.legend(['Legend'])
plt.show()Advanced Visualization with Seaborn (Optional)
import seaborn as sns
# Pairplot
sns.pairplot(df)
# Heatmap
sns.heatmap(df.corr(), annot=True)
# Violin plot
sns.violinplot(x='Category', y='Value', data=df)Performance Optimization
Vectorization
# Instead of loops, use vectorized operations
# Bad:
for i in range(len(df)):
df.loc[i, 'New'] = df.loc[i, 'A'] + df.loc[i, 'B']
# Good:
df['New'] = df['A'] + df['B']
# Using numpy for complex operations
df['New'] = np.where(df['A'] > 10, df['A'] * 2, df['A'])Using eval() for Performance
# Use eval for large DataFrames
df.eval('C = A + B')
df.eval('D = (A + B) / C', inplace=True)Parallel Processing with Dask
# For very large datasets
import dask.dataframe as dd
ddf = dd.from_pandas(df, npartitions=4)
result = ddf.groupby('column').mean().compute()Chunking for Large Files
# Process large files in chunks
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
process(chunk)Common Recipes and Patterns
Filtering with Complex Conditions
# Multiple conditions
df[(df['Age'] > 18) & (df['Score'] >= 70) | (df['City'] == 'NYC')]
# Using query for readability
df.query('Age > 18 and Score >= 70 or City == "NYC"')
# Using between
df[df['Age'].between(18, 25)]
# Using isin with multiple values
df[df['City'].isin(['NYC', 'LA', 'Chicago'])]
# Using notna/isna
df[df['Email'].notna()]Cumulative Operations
# Cumulative sum, product, max, min
df['CumSum'] = df['Value'].cumsum()
df['CumProd'] = df['Value'].cumprod()
df['CumMax'] = df['Value'].cummax()
df['CumMin'] = df['Value'].cummin()Sampling and Randomization
# Random sampling
df.sample(n=10) # Sample 10 rows
df.sample(frac=0.1) # Sample 10% of rows
# Sampling with replacement
df.sample(n=10, replace=True)
# Random permutation
df.sample(frac=1).reset_index(drop=True) # Shuffle entire dataframe
# Stratified sampling
df.groupby('Category', group_keys=False).apply(lambda x: x.sample(frac=0.1))Date/Time Operations
# Date range
df = df[df['Date'].between('2023-01-01', '2023-12-31')]
# Resampling with custom aggregations
df.resample('M').agg({
'Sales': 'sum',
'Orders': 'count',
'Revenue': lambda x: x.sum() / x.count()
})
# Lagged values
df['Previous_Value'] = df.groupby('ID')['Value'].shift(1)
df['Change'] = df['Value'] - df['Previous_Value']
df['Pct_Change'] = df['Value'].pct_change()Handling Outliers
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['Value']))
df_no_outliers = df[z_scores < 3]
# IQR method
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df[(df['Value'] >= Q1 - 1.5*IQR) & (df['Value'] <= Q3 + 1.5*IQR)]Handling Missing Values Strategically
# Forward fill within groups
df['Value'] = df.groupby('ID')['Value'].fillna(method='ffill')
# Linear interpolation within groups
df['Value'] = df.groupby('ID')['Value'].transform(lambda x: x.interpolate(method='linear'))
# Flag missing data
df['Value_IsMissing'] = df['Value'].isna().astype(int)Exporting for Different Use Cases
For Machine Learning
# Split features and target
X = df.drop('target', axis=1)
y = df['target']
# Export to numpy
X_numpy = X.values
y_numpy = y.values
# Export to dict/JSON for APIs
df.to_dict(orient='records')
df.to_json(orient='records')For Reporting
# Formatting floats
df.style.format({'Price': '${:,.2f}', 'Percentage': '{:.2%}'})
# Highlighting
df.style.highlight_max()
df.style.highlight_min()
df.style.background_gradient()
# Export to HTML
df.to_html('table.html', index=False)
# Export with formatting to Excel
with pd.ExcelWriter('report.xlsx', engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add formatting
worksheet.set_column('A:A', 20)Error Handling and Debugging
Common Errors and Solutions
# SettingWithCopyWarning - Use .copy() or .loc
df_copy = df.copy() # Create explicit copy
# KeyError - Check if column exists
if 'column' in df.columns:
df['column']
# TypeError - Check data types
df.dtypes
# MemoryError - Use chunking or optimizationDebugging with info()
# Quick data inspection
print(df.head())
print(df.info())
print(df.describe())
print(df['column'].value_counts())Integration with Other Libraries
With NumPy
# Convert to numpy array
arr = df.to_numpy()
# Use numpy functions
df['New'] = np.log(df['Value'])
df['New'] = np.sqrt(df['Value'])With Matplotlib/Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
df.plot()
plt.show()
sns.boxplot(data=df, x='Category', y='Value')With Scikit-learn
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[['col1', 'col2']])
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['categorical_column'])With SQL Databases
import sqlalchemy
# Write to SQL
engine = sqlalchemy.create_engine('sqlite:///database.db')
df.to_sql('table', engine, if_exists='replace', index=False)
# Read from SQL
df = pd.read_sql('SELECT * FROM table', engine)Best Practices
1. Use Vectorized Operations
# Avoid loops
# Bad:
for i, row in df.iterrows():
df.loc[i, 'New'] = row['A'] + row['B']
# Good:
df['New'] = df['A'] + df['B']2. Avoid Chained Assignment
# Bad:
df[df['A'] > 5]['B'] = 10
# Good:
df.loc[df['A'] > 5, 'B'] = 103. Use .copy() When Needed
df2 = df.copy() # Prevents SettingWithCopyWarning4. Optimize Data Types
# Convert to more efficient types
df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
df['string_col'] = df['string_col'].astype('category') # If limited categories5. Use Chaining for Readability
result = (df
.query('Age > 18')
.groupby('City')
.agg({'Score': 'mean'})
.reset_index()
.sort_values('Score', ascending=False))6. Handle Missing Data Explicitly
# Always decide how to handle missing values
df.dropna()
df.fillna(value)
df.fillna(method='ffill')Version Compatibility
Check Pandas Version
print(pd.__version__)Deprecations to Watch
# Old:
df.ix[0, 'column'] # Deprecated
# New:
df.loc[0, 'column'] # Use loc for labels
df.iloc[0, 0] # Use iloc for integer positionsReal-World Example
Here’s a complete example combining many concepts:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# 1. Load and inspect data
df = pd.read_csv('sales_data.csv')
print(df.info())
# 2. Clean the data
df['Date'] = pd.to_datetime(df['Date'])
df['Sales'] = pd.to_numeric(df['Sales'], errors='coerce')
df.dropna(subset=['Sales'], inplace=True)
# 3. Feature engineering
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Quarter'] = df['Date'].dt.quarter
df['Month_Name'] = df['Date'].dt.month_name()
# 4. Aggregations
monthly_sales = df.groupby(['Year', 'Month'])['Sales'].agg(['sum', 'mean', 'count']).reset_index()
# 5. Pivot for analysis
pivot = df.pivot_table(values='Sales',
index='Month_Name',
columns='Year',
aggfunc='sum',
fill_value=0)
# 6. Time series analysis
df.set_index('Date', inplace=True)
df['Rolling_7D'] = df['Sales'].rolling('7D').mean()
# 7. Detection of seasonality
seasonal_stats = df.groupby(df.index.month)['Sales'].agg(['mean', 'std'])
# 8. Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
pivot.plot(kind='bar', ax=axes[0, 0])
axes[0, 0].set_title('Monthly Sales by Year')
df['Sales'].plot(ax=axes[0, 1])
axes[0, 1].set_title('Time Series')
df['Rolling_7D'].plot(ax=axes[1, 0])
axes[1, 0].set_title('7-Day Moving Average')
seasonal_stats.plot(kind='bar', ax=axes[1, 1])
axes[1, 1].set_title('Seasonal Pattern')
plt.tight_layout()
plt.show()
# 9. Export results
monthly_sales.to_excel('analysis_results.xlsx', index=False)Summary
Pandas is an essential library for data manipulation in Python. This tutorial covered:
- Core data structures (Series, DataFrame)
- Data loading and saving (CSV, Excel, JSON, SQL)
- Data cleaning and preprocessing
- Data transformation and aggregation
- Advanced operations (grouping, pivoting, merging)
- Time series analysis
- Performance optimization
- Integration with other libraries
Pandas provides a powerful and intuitive interface for data manipulation, making it an indispensable tool for data scientists, analysts, and anyone working with tabular data in Python.