3 minutes
Pandas - Quick review
Introduction
Pandas, open-source library for data analysis and manipulation in python. The following is like a cheatsheet for myself to go through before interviews. Examples can be found in github.
Importing, Exporting data
# read/write csv from/to file
pd.read_csv(filename) | df.to_csv(filename)
# read from a delimited text file (like TSV)
pd.read_table(filename)
pd.read_excel(filename) | df.to_excel(filename)
# read/write from/to a SQL table/db
pd.read_sql(query, connection_object) | df.to_sql(tablename, connection_object)
# read/write json from/to string, url or file
pd.read_json(json_string) | df.to_json(filename)
# parse an html URL, string or file and extracts tables to a list of dataframes
pd.read_html(url) | df.to_html(filename)
# takes the contents of your clipboard and passes it to read_table(), viceversa
pd.read_clipboard() | df.to_clipboard()
# From a dict, key for columns, values for data as lists
pd.DataFrame(dict)
Viewing/Inspecting data
df.head(n) - First n rows of df
df.tail(n) - Last n rows of df
df.shape() - num of rows & columns
df.info() - index, datatype and memory info
df.describe() - summary statistic for numerical columns
Selection
df[col] - returns column with label col as Series
df[[col1, col2]] - returns columns as a new df
s.iloc[0] - selection by position
s.loc[0] - selection by index
df.iloc[0, :] - first row
df.iloc[0, 0] - first element of first column
Data Cleaning
df.columns = ['a', 'b', 'c'] - renames columns
pd.isnull - checks for null values, return boolean array
pd.notnull()
# options:
# axis = 1 (drops all columns containing null values)
# thresh = n (drops all rows having less than n non-null values)
df.dropna() - drops all rows containing null values
df.fillna(x) - replaces all null values with x
s.fillna(s.mean()) - replaces all null values with mean
s.astype(float) - converts datatype of series to float
s.replace(1, 'one') - replaces all values equal to 1 with 'one'
s.replace([1,3], ['one', 'three']) - replaces all 1 with 'one', 3 with 'three'
df.rename(columns = lambda x : x + 1) - rename columns
df.rename(index = lambda x : x + 1) - rename index
df.rename(columns = {'old_name' : 'new_name}) - selective renaming
df.set_index('column_one') - changes the index
Filter, sort & groupby
df[df[col] > 0.5] - rows where col column is > 0.5
df.sort_values(col1) - sorts values by col1 in ascending order
df.sort_values([col1, col2], ascending=[True, False]) - col1 by ascending order, col2 by descending order
df.groupby(col) - returns a groupby object for values from one column
df.groupby([col1, col2]) - returns a groupby object values from multiple columns
df.groupby(col1)['col2'] - returns mean of values in col2, grouped by values in col1
df.pivot_table(index = col1, values = [col2, col3], aggfunc = mean) - creates a pivot table that groups by col1 and calcultes the mean of col2, col3
df.apply(np.mean) - applies a function across each column
df.apply(np.max, axis = 1) - applies a function across each row
join/combine
df1.append(df2) - adds rows in df1 to end of df2 (columns should be identical)
pd.concat([df1, df2], axis = 1) - adds columns in df1 to end of df2 (rows should be identical)
df1.join(df2, on=col1, how = 'inner') - SQL-style joins the columns in df1 with columns on df2 where the rows for col have identical value. 'how' can be left, right, outer, inner
References : Python Pandas for your grandpa, Dataquest pandas cheatsheet
Read other posts