Getting Started With Pandas
For many this is old news - but I just need to give a shout out to Pandas. What an amazing module for analyzing and manipulating table-like data!
As much as I like SQL and “proper” programming, much of the business world revolves around Excel and CSV. I interact with spreadsheets every day - sometimes multiple times a day. Excel is a workhorse, and it’s not going away anytime soon. But being able to interact with spreadsheets programatically (and without digging into VBA or Excel syntax) has always been a pipe dream of mine.
Enter Pandas
. It’s the power tool (or perhaps the superpower) that you need.
I don’t have much any experience with Python… And I was able to get productive with Pandas in just a couple hours. Here are the steps I took.
Installation
- Install Anaconda.
- That’s it! You’re all set. Start writing python, and then run your scripts through the Anaconda CLI.
Pandas
and a bunch of other packages are included by default.
Getting The Basics
There are two basic types of data that you need to understand in Pandas
- Series, and Dataframes.
Series
A Pandas Series is essentially an indexed array/list. Think of it as one column in a table of data. The data can be any type.
Example:
# Import pandas
import pandas as pd
# Create a series
cities = ["Houston", "Denver", "San Francisco", "Orlando"]
series = pd.Series(cities)
# Print result
print(series)
Output:
0 Houston
1 Denver
2 San Francisco
3 Orlando
dtype: object
Straightforward, right? Let’s move on to Dataframes - that’s where the fun is.
DataFrames
A Pandas DataFrame is a complete table of data - multiple columns, not just one column like a Series. Once again, the data in the columns can be multiple types.
Example:
# Import pandas
import pandas as pd
# Create a dataframe
cities = {
"Cities": ["Houston", "Denver", "San Franscisco", "Orlando"],
"Population": [2300000, 2900000, 883000, 285000]
}
df = pd.DataFrame(cities)
# Print result
print(series)
Output:
Cities Population
0 Houston 2300000
1 Denver 2900000
2 San Franscisco 883000
3 Orlando 285000
Importing and Exporting Files
The last step to being productive is importing and exporting files.
To import a CSV file, use pandas.read_csv()
. To export to external file use Series.to_csv()
or DataFrame.to_csv()
.
Example CSV File - cities.csv:
Cities | Population |
---|---|
Houston | 2300000 |
Denver | 2900000 |
San Francisco | 883000 |
Orlando | 285000 |
Example Code:
# Imports
# Numpy is another common python package that works with python and pandas - it comes with Anaconda
import pandas as pd
import numpy as np
# Import CSV as DataFrame
df = pd.read_csv("cities.csv")
# Add additional column
df["Population_Over_Million"] = np.where(df["Population"] > 1000000, True, False)
# print, then export result
print(df)
df.to_csv("export.csv")
Output:
Cities Population Population_Over_Million
0 Houston 2300000 True
1 Denver 2900000 True
2 San Franscisco 883000 False
3 Orlando 285000 False
Unleash the Magic
If you’re like me, your imagination is already running wild with all the operations that are possible. We’ve only barley scratched the surface of what Pandas is capable of. You can:
- Clean data by removing rows that have empty cells, duplicates, errors, etc.
- Format data in an scriptable manner
- Find correlations between columns via the
.corr()
function - Create diagrams using
.plot()
- etc.
If you can do it in Excel, chances are Pandas can do it as well - faster, in plain text, and all from the comfort of your CLI or code editor. You can also manipulate massive files that would bring Excel to its knees.
Further Reading
I found the W3Schools tutorial to be simple and helpful. The Pandas documentation is also robust.