Learning Pandas - When SQL Isn't Enough

When I started working with data more seriously at my job, I realized SQL could only take me so far. I was writing queries in Databricks and Snowflake, getting results back, and then thinking "okay now what?" That's when I had to learn pandas.

Why Pandas?

I already knew SQL pretty well at this point. I could write complex queries, use CTEs, all that stuff. But sometimes you need to do transformations or analysis that are just awkward in SQL, or you've already pulled the data and need to mess with it locally. That's where pandas comes in.

Pandas is a Python library for working with data. Think of it like Excel but way more powerful and you control it with code instead of clicking around.

The Learning Curve

I'm not gonna lie - pandas was confusing at first. Coming from SQL, the syntax felt weird. But once I got the basics down, it became super useful.

DataFrames - The Core Concept

Everything in pandas revolves around DataFrames. It's basically a table - rows and columns, just like a SQL result set.

import pandas as pd

# Reading a CSV file
df = pd.read_csv('data.csv')

# Or loading SQL results from Databricks/Snowflake
# (without getting into job specifics, you can convert query results to DataFrames)
df = pd.DataFrame(sql_results)

# Basic stuff
print(df.head())  # First 5 rows
print(df.info())  # Column types and non-null counts
print(df.describe())  # Quick statistics

Data Cleaning and Transformation

This is where pandas really shines. Stuff that would be annoying in SQL becomes way easier.

# Handling missing data
df = df.dropna()  # Drop rows with missing values
df['column'] = df['column'].fillna(0)  # Fill missing values with 0

# String operations
df['name'] = df['name'].str.upper()
df['email_domain'] = df['email'].str.split('@').str[1]

# Date handling
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month

# Filtering (like WHERE in SQL)
df_filtered = df[df['amount'] > 100]
df_filtered = df[(df['status'] == 'active') & (df['amount'] > 50)]

The date handling stuff especially - way better than fighting with date functions in SQL.

Groupby and Aggregations

If you know GROUP BY in SQL, this will make sense. But pandas gives you more flexibility.

# Basic groupby
df.groupby('category')['amount'].sum()

# Multiple aggregations
df.groupby('category').agg({
    'amount': ['sum', 'mean', 'count'],
    'user_id': 'nunique'
})

# Pivot tables (like Excel pivot tables)
pivot = df.pivot_table(
    values='amount',
    index='category',
    columns='month',
    aggfunc='sum'
)

Working with SQL Results

Here's how I actually use pandas day-to-day. I'll run a query in Databricks or Snowflake to get the bulk of the data, then use pandas for the finishing touches.

For example:

SQL does the heavy lifting: joins, filters, big aggregations
Pandas does the cleanup: formatting, final calculations, reshaping data

# After getting results from a SQL query
df = pd.DataFrame(query_results)

# Now do transformations that would be annoying in SQL
df['percent_change'] = ((df['current'] - df['previous']) / df['previous'] * 100).round(2)
df['formatted_name'] = df['first_name'] + ' ' + df['last_name']

# Or reshape data
df_wide = df.pivot(index='date', columns='product', values='sales')

Real-World Example: Exploding Comma-Separated Data

Today I ran into a situation that would've been a nightmare in SQL but pandas handled it perfectly. I got a dataset where one column had multiple apps listed in a single cell, separated by commas.

Like this:

user_id | name  | apps
--------|-------|------------------
1       | John  | App1, App2, App3
2       | Jane  | App4, App5

But I needed each app in its own row, keeping all the other data the same:

user_id | name  | app
--------|-------|------
1       | John  | App1
1       | John  | App2
1       | John  | App3
2       | Jane  | App4
2       | Jane  | App5

Here's what I did:

# First, split the comma-separated apps into a list
df['apps'] = df['apps'].str.split(',')

# Then explode - creates a new row for each item in the list
df_exploded = df.explode('apps')

# Clean up the app names (remove extra spaces)
df_exploded['apps'] = df_exploded['apps'].str.strip()

# Reset the index since we now have more rows
df_exploded = df_exploded.reset_index(drop=True)

Could I have done this in SQL? Maybe with a recursive CTE or some weird string splitting function. But it would've been messy and hard to read. Pandas made it like 4 lines of code.

This kind of stuff is where pandas really saves you. Data doesn't always come in the format you need it, and reshaping it in pandas is way easier than trying to force SQL to do it.

The Big Realization: When to Use What

This took me a while to figure out. At first I'd try to do everything in SQL, or everything in pandas. But the right answer is using both.

Use SQL when:

Working with large datasets that live in a database
You need to join multiple tables
Doing aggregations on millions of rows
The database is optimized for what you're doing

Use pandas when:

You've already pulled data and need to transform it
Doing complex string manipulations or date operations
Creating pivot tables or reshaping data
Need to combine data from multiple sources (CSV, API, SQL results)
Working with smaller datasets locally

Basically: SQL for the heavy lifting, pandas for the final mile.

Common Mistakes I Made

Trying to load too much data - Don't pull a million rows from the database into pandas. Filter in SQL first, then load what you need.

Not using vectorized operations - In pandas, don't loop through rows. Use pandas operations that work on the whole column at once. Way faster.

# Slow - don't do this
for index, row in df.iterrows():
    df.at[index, 'new_col'] = row['col1'] + row['col2']

# Fast - do this
df['new_col'] = df['col1'] + df['col2']

Forgetting about chaining - You can chain operations together instead of making a bunch of intermediate variables.

# Instead of this
df_temp = df[df['status'] == 'active']
df_temp2 = df_temp.groupby('category')['amount'].sum()
result = df_temp2.reset_index()

# Do this
result = (df[df['status'] == 'active']
    .groupby('category')['amount']
    .sum()
    .reset_index()
)

What I Use Pandas For

Without getting into work specifics, here's the kind of stuff I do with pandas:

Pull query results from Databricks/Snowflake and clean them up
Combine data from different sources
Transform data into formats needed for reports or dashboards
Quick exploratory analysis on datasets
Preparing data before loading it somewhere else

Learning Resources That Helped

The official pandas documentation - Actually pretty good once you know what you're looking for

Stack Overflow - Seriously, so many pandas questions have been answered there

Just trying stuff - Load a CSV and start messing with it. That's how most of this stuck for me

Moving Forward

Pandas is one of those tools that once you learn it, you find uses for it everywhere. It's not replacing SQL - they work together. SQL gets you the data, pandas helps you shape it into what you actually need.

If you're already comfortable with SQL and work with data, learning pandas is worth it. Start small - read a CSV, filter some rows, do a groupby. Then build from there.

And remember: you don't need to memorize all the functions. I look stuff up constantly. The important part is knowing what's possible, then you can Google the exact syntax when you need it.