A cheatsheet with examples for the common Pandas DataFrame operations.
CreateProperties
Number of rows and columns
Number of columns
Number of rows
Column names/labels
Row names/labels/index Column data type
Query/Select/Slice Data
Indexing operator
[]
.loc
.iloc
Modify Data
Add column(s)
Remove column(s)
Add row(s)
Remove row(s)
Modify column(s)
Modify row(s)
Modify Index
Change column label
Change row label
Change column index name
Change row index name
Descriptive Info
.describe
.info
Create
Import |
In [1]: import pandas as pd |
Data as dictionary |
In [2]: pd.DataFrame(data = {'A': ['0A', '1A', '2A', '3A', '4A'], 'B': ['00.0 ', '10.0', '20.0', '30.0', '40.0'], 'C': [True, False, True, False, True], 'D': [2000, 2010, 2020, 2030, 2040] } ) Out[2]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
Data as a list Creates default numeric row and column labels |
In [3]: pd.DataFrame( [['0A', '00.0', True, 2000], ['1A', '10.0', False, 2010], ['2A', '20.0', True, 2020], ['3A', '30.0', False, 2030], ['4A', '40.0', True, 2040]] ) Out[3]: 0 1 2 3 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
The following
| |
Data as list Using optional row and column labels |
In [4] df = pd.DataFrame(data = [['0A', '00.0 ', True, 2000], ['1A', '10.0 ', False, 2010], ['2A', '20.0 ', True, 2020], ['3A', '30.0 ', False, 2030], ['4A', '40.0 ', True, 2040]], index = ['0', '1', '2', '3', '4'], columns = ['A', 'B', 'C', 'D']) In [5]: df Out[5]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
Properties
Number of rows and columns |
In [6]: nrow,ncol = df.shape In [7]: nrow, ncol Out[7]: (5, 4) |
Number of columns |
In [8]: df.shape[1] Out[8]: 4 In [9]: len(df.columns) Out[9]: 4 |
Number of rows |
In [10]: df.shape[0] Out[10]: 5 In [11]: len(df) Out[11]: 5 In [12]: len(df.index) Out[12]: 5 |
Column names/labels |
In [13]: df.columns Out[13]: Index(['A', 'B', 'C', 'D'], dtype='object') In [14]: df.columns.to_list() Out[14]: ['A', 'B', 'C', 'D'] |
Row names/labels/index |
In [15]: df.index Out[15]: Index(['0', '1', '2', '3', '4'], dtype='object') In [16]: df.index.to_list() Out[16]: ['0', '1', '2', '3', '4'] |
Column data type(s) |
In [17]: df.dtypes Out[17]: A object B object C bool D int64 dtype: object |
Query/Select/Slice Data
Common ways to specify a selection
- A single label or position
- A list or array of labels or positions
- A slice object with labels or positions
- A boolean array (literals or generated dynamically) of the same length as the axis being sliced
Note: Row and column index can have integer labels. It very important to distinguish between integer labels and integer positions.
Indexing operator
| |
Column data by label
Note: Difference between
|
In [18]: df['A'] Out[18]: 0 0A 1 1A 2 2A 3 3A 4 4A Name: A, dtype: object In [19]: df[['A']] Out[19]: A 0 0A 1 1A 2 2A 3 3A 4 4A In [20]: df[['A', 'B']] Out[20]: A B 0 0A 00.0 1 1A 10.0 2 2A 20.0 3 3A 30.0 4 4A 40.0 |
Row data by label slice
|
In [21]: df['0':'2'] Out[21]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 |
Row data by position slice
|
In [22]: df[0:2] Out[22]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 |
Row data by boolean indexing |
In [23]: df[[True, False, True, False, True]] Out[23]: A B C D 0 0A 00.0 True 2000 2 2A 20.0 True 2020 4 4A 40.0 True 2040 In [24]: df[ pd.Series({'0':True, '1':False, '2':True, '3':False, '4':True}) ] Out[24]: A B C D 0 0A 00.0 True 2000 2 2A 20.0 True 2020 4 4A 40.0 True 2040 |
.locAccess rows and columns by label(s) or a boolean array. Note: Unlike a normal function, In [25]: type(df.loc) Out[25]: pandas.core.indexing._LocIndexer | |
DataFrame.loc[] Note: Shortcuts to specify all rows or columns
|
In [26]: df.loc['1':'3', ['A','C']] Out[26]: A C 1 1A False 2 2A True 3 3A False In [27]: df.loc['1':'3', 'B':'C'] Out[27]: B C 1 10.0 False 2 20.0 True 3 30.0 False In [28]: df.loc['1':'3', [False, True, False, True]] Out[28]: B D 1 10.0 2010 2 20.0 2020 3 30.0 2030 In [29]: df.loc['1':'3'] # all columns Out[29]: A B C D 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 In [30]: df.loc[:, [False, True, False, True]] # all rows Out[30]: B D 0 00.0 2000 1 10.0 2010 2 20.0 2020 3 30.0 2030 4 40.0 2040 |
.ilocAccess rows and columns by purely integer positions or a boolean array. Note: Similar to In [10]: type(df.iloc) Out[10]: pandas.core.indexing._iLocIndexer |
|
DataFrame.iloc[]
Note: Shortcuts to specify all rows or columns
|
In [31]: df.iloc[1:3, [0,2]] Out[31]: A C 1 1A False 2 2A True In [32]: df.iloc[1:3, 1:3] Out[32]: B C 1 10.0 False 2 20.0 True In [33]: df.iloc[1:3, [False, True, False, True]] Out[33]: B D 1 10.0 2010 2 20.0 2020 In [34]: df.iloc[1:3] # all columns Out[34]: A B C D 1 1A 10.0 False 2010 2 2A 20.0 True 2020 In [35]: df.iloc[:, 1:3] # all rows Out[35]: B C 0 00.0 True 1 10.0 False 2 20.0 True 3 30.0 False |
Advanced boolean indexingThe earlier examples used |
|
Conditional selections with an elegant syntax |
In [36]: df[df['D']>2020] Out[36]: A B C D 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
It consists of 2 steps
|
In [37]: s = df['D']>2020 In [38]: s Out[38]: 0 False 1 False 2 False 3 True 4 True Name: D, dtype: bool In [39]: df[s] Out[39]: A B C D 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
It can extend to complex conditions with logical operators |
In [40]: df[(df['D'] > 2000) & (df['D'] < 2040)] Out[40]: A B C D 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 In [41]: df.loc[(df['D'] > 2000) & (df['D'] < 2040)] Out[41]: A B C D 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 |
Modify Data
Note the subtle behaviors of the APIs- whether it modifies the dataframe inplace or not
- whether it returns a dataframe or series
- generally, what can be done with one column, can be done with multiple columns
Add column(s) | |
Indexing operator [] |
In [42]: df['E'] = ['0E', '1E', '2E', '3E', '4E'] In [43]: df Out[43]: A B C D E 0 0A 00.0 True 2000 0E 1 1A 10.0 False 2010 1E 2 2A 20.0 True 2020 2E 3 3A 30.0 False 2030 3E 4 4A 40.0 True 2040 4E |
DataFrame.loc |
In [44]: df.loc[:, 'F'] = ['0F', '1F', '2F', '3F', '4F'] In [45]: df Out[45]: A B C D E F 0 0A 00.0 True 2000 0E 0F 1 1A 10.0 False 2010 1E 1F 2 2A 20.0 True 2020 2E 2F 3 3A 30.0 False 2030 3E 3F 4 4A 40.0 True 2040 4E 4F |
DataFrame.assign returns a copy |
In [46]: df = df.assign(G=['0G', '1G', '2G', '3G', '4G']) In [47]: df Out[47]: A B C D E F G 0 0A 00.0 True 2000 0E 0F 0G 1 1A 10.0 False 2010 1E 1F 1G 2 2A 20.0 True 2020 2E 2F 2G 3 3A 30.0 False 2030 3E 3F 3G 4 4A 40.0 True 2040 4E 4F 4G |
DataFrame.insert |
In [48]: df.insert(loc=1, column='H', value=['0H', '1H', '2H', '3H', '4H']) In [49]: df Out[49]: A H B C D E F G 0 0A 0H 00.0 True 2000 0E 0F 0G 1 1A 1H 10.0 False 2010 1E 1F 1G 2 2A 2H 20.0 True 2020 2E 2F 2G 3 3A 3H 30.0 False 2030 3E 3F 3G 4 4A 4H 40.0 True 2040 4E 4F 4G |
Remove column(s) | |
del |
In [50]: del df['H'] In [51]: df Out[51]: A B C D E F G 0 0A 00.0 True 2000 0E 0F 0G 1 1A 10.0 False 2010 1E 1F 1G 2 2A 20.0 True 2020 2E 2F 2G 3 3A 30.0 False 2030 3E 3F 3G 4 4A 40.0 True 2040 4E 4F 4G |
DataFrame.pop |
In [52]: df.pop('G') Out[52]: 0 0G 1 1G 2 2G 3 3G 4 4G Name: G, dtype: object In [53]: df Out[53]: A B C D E F 0 0A 00.0 True 2000 0E 0F 1 1A 10.0 False 2010 1E 1F 2 2A 20.0 True 2020 2E 2F 3 3A 30.0 False 2030 3E 3F 4 4A 40.0 True 2040 4E 4F |
DataFrame.drop |
In [54]: df = df.drop(['F'], axis=1) In [55]: df Out[55]]: A B C D E 0 0A 00.0 True 2000 0E 1 1A 10.0 False 2010 1E 2 2A 20.0 True 2020 2E 3 3A 30.0 False 2030 3E 4 4A 40.0 True 2040 4E In [56]: df.drop(['E'], axis=1, inplace=True) In [57]: df Out[57]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
Add row(s) | |
DataFrame.append
Note: Pay attention to the behavior of |
In [58]: df.append({'A':'5A', 'B':'50.0 ', 'C':False, 'D':2050}, ignore_index=True) Out[58]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 5 5A 50.0 False 2050 In [59]: df.append(pd.Series(data=['5A', '50.0 ', False, 2050], index=df.columns), ignore_index=True) Out[59]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 5 5A 50.0 False 2050 In [60]: df.append(pd.Series(data=['5A', '50.0 ', False, 2050], index=df.columns, name='5'), ignore_index=False) Out[60]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 5 5A 50.0 False 2050 |
DataFrame.loc |
In [61]: df.loc['5'] = ['5A', '50.0 ', False, 2050] In [62]: df Out[62]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 5 5A 50.0 False 2050 |
Remove row(s) | |
DataFrame.drop |
In [63]: df.drop(['4','5'], axis=0) Out[63]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 |
Select only what is required |
In [64]: df['0':'3'] Out[64]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 In [65]: df[0:4] Out[65]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 |
Change column data | |
Indexing operator [] |
In [66]: df['A'] = ['0a0', '1a1', '2a2', '3a3', '4a4', '5a5'] In [67]: df Out[67]: A B C D 0 0a0 00.0 True 2000 1 1a1 10.0 False 2010 2 2a2 20.0 True 2020 3 3a3 30.0 False 2030 4 4a4 40.0 True 2040 5 5a5 50.0 False 2050 |
Arithmetic operator(s) |
In [68]: df['D']+1 Out[68]: 0 2001 1 2011 2 2021 3 2031 4 2041 5 2051 Name: D, dtype: int64 |
DataFrame.apply |
In [69]: def f(x): x = x + 1 return x In [70]: df['D'].apply(f) Out[70]: 0 2001 1 2011 2 2021 3 2031 4 2041 5 2051 Name: D, dtype: int64 In [71]: df['D'].apply(lambda x: x+1) Out[71]: 0 2001 1 2011 2 2021 3 2031 4 2041 5 2051 Name: D, dtype: int64 |
Change row data | |
DataFrame.loc |
In [72]: df.loc['4'] = ['4A','40.01', False, 2041] In [73]: df Out[73]: A B C D 0 0a0 00.0 True 2000 1 1a1 10.0 False 2010 2 2a2 20.0 True 2020 3 3a3 30.0 False 2030 4 4A 40.01 False 2041 5 5a5 50.0 False 2050 |
DataFrame.iloc |
In [74]: df.iloc[5] = ['5A','50.01', True, 2051] In [75]: df Out[75]: A B C D 0 0a0 00.0 True 2000 1 1a1 10.0 False 2010 2 2a2 20.0 True 2020 3 3a3 30.0 False 2030 4 4A 40.01 False 2041 5 5A 50.01 True 2051 |
Change column type | |
DataFrame.astype |
In [76]: df['C'].astype('int8') Out[76]: 0 1 1 0 2 1 3 0 4 1 Name: C, dtype: int8 In [77]: df['D'].astype('float') Out[77]: 0 2000.0 1 2010.0 2 2020.0 3 2030.0 4 2040.0 Name: D, dtype: float64 |
pandas.to_numeric |
In [78]: pd.to_numeric(df['B']) Out[78]: 0 0.0 1 10.0 2 20.0 3 30.0 4 40.0 Name: B, dtype: float64 |
pandas.to_datetime |
In [79]: pd.to_datetime(df['D'], format="%Y") Out[79]: 0 2000-01-01 1 2010-01-01 2 2020-01-01 3 2030-01-01 4 2040-01-01 Name: D, dtype: datetime64[ns] |
Modify Index
Change column label |
In [80]: df.rename({'A':'a', 'B':'b'}, axis=1) Out[80]: a b C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
Change row label |
In [81]: df.rename({'2':'0.2', '4':'0.4'}, axis=0) Out[81]: A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 0.2 2A 20.0 True 2020 3 3A 30.0 False 2030 0.4 4A 40.0 True 2040 |
Change column index name |
In [82]: df.columns.name = "HEADER" In [82]: df Out[19]: HEADER A B C D 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
Change row index name Note: When exporting to a csv including the row labels, this name will be used as the column name in the exported file |
In [83]: df.index.name = "ID" In [83]: df Out[83]: HEADER A B C D ID 0 0A 00.0 True 2000 1 1A 10.0 False 2010 2 2A 20.0 True 2020 3 3A 30.0 False 2030 4 4A 40.0 True 2040 |
Descriptive Info
DataFrame.describe |
In [84]: df.describe() Out[84]: HEADER D count 5.000000 mean 2020.000000 std 15.811388 min 2000.000000 25% 2010.000000 50% 2020.000000 75% 2030.000000 max 2040.000000 |
DataFrame.info |
In [85]: df.info() Index: 5 entries, 0 to 4 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null object 1 B 5 non-null object 2 C 5 non-null bool 3 D 5 non-null int64 dtypes: bool(1), int64(1), object(2) memory usage: 165.0+ bytes |