Pandas DataFrame Operations

A cheatsheet with examples for the common Pandas DataFrame operations.

Create
Properties
 Number of rows and columns
 Number of columns
 Number of rows
 Column names/labels
 Row names/labels/index  Column data type
Query/Select/Slice Data
 Indexing operator []
.loc
.iloc
Modify Data
 Add column(s)
 Remove column(s)
 Add row(s)
 Remove row(s)
 Modify column(s)
 Modify row(s)
Modify Index
 Change column label
 Change row label
 Change column index name
 Change row index name
Descriptive Info
.describe
.info

Create

Import
In [1]: import pandas as pd
Data as dictionary
In [2]: pd.DataFrame(data = {'A': ['0A', '1A', '2A', '3A', '4A'],
                             'B': ['00.0 ', '10.0', '20.0', '30.0', '40.0'],
                             'C': [True, False, True, False, True],
                             'D': [2000, 2010, 2020, 2030, 2040] } )
Out[2]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A   10.0  False  2010
2  2A   20.0   True  2020
3  3A   30.0  False  2030
4  4A   40.0   True  2040

Data as a list

Creates default numeric row and column labels

In [3]: pd.DataFrame( [['0A', '00.0', True, 2000],
                      ['1A', '10.0', False, 2010],
                      ['2A', '20.0', True,  2020],
                      ['3A', '30.0', False, 2030],
                      ['4A', '40.0', True,  2040]] )

Out[3]:
    0     1      2     3
0  0A  00.0   True  2000
1  1A  10.0  False  2010
2  2A  20.0   True  2020
3  3A  30.0  False  2030
4  4A  40.0   True  2040

The following DataFrame will be used to demonstrate the various operations.

  • Typically, the row labels are numeric integers. But it can get confusing when trying to understand label based indexing and numeric/location based indexing. So the row labels here are intentionally string integers to be able to distinguish between them.
  • Numbers in column B has an intentional trailing whitespace so as to force it to be represented as a string.

Data as list

Using optional row and column labels

In [4] df = pd.DataFrame(data = [['0A', '00.0 ', True,  2000],
                                 ['1A', '10.0 ', False, 2010],
                                 ['2A', '20.0 ', True,  2020],
                                 ['3A', '30.0 ', False, 2030],
                                 ['4A', '40.0 ', True,  2040]],
                         index = ['0', '1', '2', '3', '4'],
                         columns = ['A', 'B', 'C', 'D'])

In [5]: df
Out[5]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040

Properties

Number of rows and columns
In [6]: nrow,ncol = df.shape

In [7]: nrow, ncol
Out[7]: (5, 4)
Number of columns
In [8]: df.shape[1]
Out[8]: 4

In [9]: len(df.columns)
Out[9]: 4
Number of rows
In [10]: df.shape[0]
Out[10]: 5

In [11]: len(df)
Out[11]: 5

In [12]: len(df.index)
Out[12]: 5
Column names/labels
In [13]: df.columns
Out[13]: Index(['A', 'B', 'C', 'D'], dtype='object')

In [14]: df.columns.to_list()
Out[14]: ['A', 'B', 'C', 'D']
Row names/labels/index
In [15]: df.index
Out[15]: Index(['0', '1', '2', '3', '4'], dtype='object')

In [16]: df.index.to_list()
Out[16]: ['0', '1', '2', '3', '4']
Column data type(s)
In [17]: df.dtypes
Out[17]:
A    object
B    object
C      bool
D     int64
dtype: object

Query/Select/Slice Data

Common ways to specify a selection

  • A single label or position
  • A list or array of labels or positions
  • A slice object with labels or positions
  • A boolean array (literals or generated dynamically) of the same length as the axis being sliced

Note: Row and column index can have integer labels. It very important to distinguish between integer labels and integer positions.

Indexing operator []

  • It is provided more for convenience. Prefer loc and iloc for production code.
  • It sometimes selects rows, and sometimes columns, which can be a bit confusing.
  • Whether it selects rows or columns seems to be based on what is considered as a common operation.

Column data by label

  • A single label (an integer is interpreted as a label, and never as an integer position) returns a Series object
  • A list or array of labels returns a DataFrame

Note: Difference between df['B'] and the weird double square bracket df[['B']]

  • df['B'] returns a Series object
  • df[['B']] is passing an array with a single element, returns a DataFrame object (with a single column)

In [18]: df['A']
Out[18]:
0  0A
1  1A
2  2A
3  3A
4  4A
Name: A, dtype: object

In [19]: df[['A']]
Out[19]:
    A
0  0A
1  1A
2  2A
3  3A
4  4A

In [20]: df[['A', 'B']]
Out[20]:
    A      B
0  0A  00.0
1  1A  10.0
2  2A  20.0
3  3A  30.0
4  4A  40.0

Row data by label slice

  • If both the start and the stop labels are present in the index, then elements located between the two (including them) are selected
In [21]: df['0':'2']
Out[21]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
Row data by position slice
  • Purely integer position
  • Zero based indexing
  • Lower bound is included, while the upper bound is excluded
In [22]: df[0:2]
Out[22]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
Row data by boolean indexing
In [23]: df[[True, False, True, False, True]]
Out[23]:
    A      B     C     D
0  0A  00.0   True  2000
2  2A  20.0   True  2020
4  4A  40.0   True  2040


In [24]: df[ pd.Series({'0':True,
                        '1':False,
                        '2':True,
                        '3':False,
                        '4':True}) ]
Out[24]:
    A      B     C     D
0  0A  00.0   True  2000
2  2A  20.0   True  2020
4  4A  40.0   True  2040

.loc

Access rows and columns by label(s) or a boolean array.

Note: Unlike a normal function, .loc uses square brackets [] like indexing. This is because loc is not a function. DataFrame.loc is an attribute which holds an instance of a _LocIndexer class. This internal class has an implementation for the [] indexing operator which gets invoked.

In [25]: type(df.loc)
Out[25]: pandas.core.indexing._LocIndexer
DataFrame.loc[]

Note: Shortcuts to specify all rows or columns

  • Use : without a range to indicate all rows or columns
  • Leave out the column specification to indicate all columns
In [26]: df.loc['1':'3', ['A','C']]
Out[26]:
    A      C
1  1A  False
2  2A   True
3  3A  False

In [27]: df.loc['1':'3', 'B':'C']
Out[27]:
       B      C
1  10.0   False
2  20.0    True
3  30.0   False

In [28]: df.loc['1':'3', [False, True, False, True]]
Out[28]:
       B     D
1  10.0   2010
2  20.0   2020
3  30.0   2030

In [29]: df.loc['1':'3'] # all columns
Out[29]:
    A      B      C     D
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030

In [30]: df.loc[:, [False, True, False, True]] # all rows
Out[30]:
       B     D
0  00.0   2000
1  10.0   2010
2  20.0   2020
3  30.0   2030
4  40.0   2040

.iloc

Access rows and columns by purely integer positions or a boolean array.

Note: Similar to .loc, this is not a function but an attribute to an instance of an internal class _iLocIndexer which implements the indexing operator [].

In [10]: type(df.iloc)
Out[10]: pandas.core.indexing._iLocIndexer
DataFrame.iloc[]

Note: Shortcuts to specify all rows or columns

  • Use : without a range to indicate all rows or columns
  • Leave out the column specification to indicate all columns
In [31]: df.iloc[1:3, [0,2]]
Out[31]:
    A      C
1  1A  False
2  2A   True

In [32]: df.iloc[1:3, 1:3]
Out[32]:
       B      C
1  10.0   False
2  20.0    True

In [33]: df.iloc[1:3, [False, True, False, True]]
Out[33]:
       B     D
1  10.0   2010
2  20.0   2020

In [34]: df.iloc[1:3] # all columns
Out[34]:
    A      B      C     D
1  1A  10.0   False  2010
2  2A  20.0    True  2020

In [35]: df.iloc[:, 1:3] # all rows
Out[35]:
       B      C
0  00.0    True
1  10.0   False
2  20.0    True
3  30.0   False

Advanced boolean indexing

The earlier examples used boolean literals for the selection. The real power is when the boolean list is generated by a combination of conditions.

Conditional selections with an elegant syntax
In [36]: df[df['D']>2020]
Out[36]:
    A      B      C     D
3  3A  30.0   False  2030
4  4A  40.0    True  2040
It consists of 2 steps
  • Condition generates a boolean index
  • The boolean index is applied to the DataFrame
In [37]: s = df['D']>2020

In [38]: s
Out[38]:
0    False
1    False
2    False
3     True
4     True
Name: D, dtype: bool

In [39]: df[s]
Out[39]:
    A      B      C     D
3  3A  30.0   False  2030
4  4A  40.0    True  2040
It can extend to complex conditions with logical operators
In [40]: df[(df['D'] > 2000) & (df['D'] < 2040)]
Out[40]:
    A      B      C     D
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030

In [41]: df.loc[(df['D'] > 2000) & (df['D'] < 2040)]
Out[41]:
    A      B      C     D
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030

Modify Data

Note the subtle behaviors of the APIs
  • whether it modifies the dataframe inplace or not
  • whether it returns a dataframe or series
  • generally, what can be done with one column, can be done with multiple columns

Add column(s)

Indexing operator []
In [42]: df['E'] = ['0E', '1E', '2E', '3E', '4E']

In [43]: df
Out[43]:
    A      B      C     D   E
0  0A  00.0    True  2000  0E
1  1A  10.0   False  2010  1E
2  2A  20.0    True  2020  2E
3  3A  30.0   False  2030  3E
4  4A  40.0    True  2040  4E
DataFrame.loc
In [44]: df.loc[:, 'F'] = ['0F', '1F', '2F', '3F', '4F']

In [45]: df
Out[45]:
    A      B      C     D   E   F
0  0A  00.0    True  2000  0E  0F
1  1A  10.0   False  2010  1E  1F
2  2A  20.0    True  2020  2E  2F
3  3A  30.0   False  2030  3E  3F
4  4A  40.0    True  2040  4E  4F
DataFrame.assign

returns a copy
In [46]: df = df.assign(G=['0G', '1G', '2G', '3G', '4G'])

In [47]: df
Out[47]:
    A      B      C     D   E   F   G
0  0A  00.0    True  2000  0E  0F  0G
1  1A  10.0   False  2010  1E  1F  1G
2  2A  20.0    True  2020  2E  2F  2G
3  3A  30.0   False  2030  3E  3F  3G
4  4A  40.0    True  2040  4E  4F  4G
DataFrame.insert
In [48]: df.insert(loc=1,
                   column='H',
                   value=['0H', '1H', '2H', '3H', '4H'])

In [49]: df
Out[49]:
    A   H      B      C     D   E   F   G
0  0A  0H  00.0    True  2000  0E  0F  0G
1  1A  1H  10.0   False  2010  1E  1F  1G
2  2A  2H  20.0    True  2020  2E  2F  2G
3  3A  3H  30.0   False  2030  3E  3F  3G
4  4A  4H  40.0    True  2040  4E  4F  4G

Remove column(s)

del
In [50]: del df['H']

In [51]: df
Out[51]:
    A      B      C     D   E   F   G
0  0A  00.0    True  2000  0E  0F  0G
1  1A  10.0   False  2010  1E  1F  1G
2  2A  20.0    True  2020  2E  2F  2G
3  3A  30.0   False  2030  3E  3F  3G
4  4A  40.0    True  2040  4E  4F  4G
DataFrame.pop
In [52]: df.pop('G')
Out[52]:
0    0G
1    1G
2    2G
3    3G
4    4G
Name: G, dtype: object

In [53]: df
Out[53]:
    A      B      C     D   E   F
0  0A  00.0    True  2000  0E  0F
1  1A  10.0   False  2010  1E  1F
2  2A  20.0    True  2020  2E  2F
3  3A  30.0   False  2030  3E  3F
4  4A  40.0    True  2040  4E  4F
DataFrame.drop
In [54]: df = df.drop(['F'], axis=1)

In [55]: df
Out[55]]:
    A      B      C     D   E
0  0A  00.0    True  2000  0E
1  1A  10.0   False  2010  1E
2  2A  20.0    True  2020  2E
3  3A  30.0   False  2030  3E
4  4A  40.0    True  2040  4E

In [56]: df.drop(['E'], axis=1, inplace=True)

In [57]: df
Out[57]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040

Add row(s)

DataFrame.append

Note: Pay attention to the behavior of ignore_index flag

In [58]: df.append({'A':'5A', 'B':'50.0 ', 'C':False, 'D':2050},
                   ignore_index=True)
Out[58]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040
5  5A  50.0   False  2050

In [59]: df.append(pd.Series(data=['5A', '50.0 ', False, 2050], index=df.columns),
                   ignore_index=True)
Out[59]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040
5  5A  50.0   False  2050

In [60]: df.append(pd.Series(data=['5A', '50.0 ', False, 2050], index=df.columns, name='5'),
                   ignore_index=False)
Out[60]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040
5  5A  50.0   False  2050
DataFrame.loc
In [61]: df.loc['5'] = ['5A', '50.0 ', False, 2050]

In [62]: df
Out[62]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040
5  5A  50.0   False  2050

Remove row(s)

DataFrame.drop
In [63]: df.drop(['4','5'], axis=0)
Out[63]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
Select only what is required
In [64]: df['0':'3']
Out[64]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030

In [65]: df[0:4]
Out[65]:
    A      B      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030

Change column data

Indexing operator []
In [66]: df['A'] = ['0a0', '1a1', '2a2', '3a3', '4a4', '5a5']

In [67]: df
Out[67]:
     A      B      C     D
0  0a0  00.0    True  2000
1  1a1  10.0   False  2010
2  2a2  20.0    True  2020
3  3a3  30.0   False  2030
4  4a4  40.0    True  2040
5  5a5  50.0   False  2050
Arithmetic operator(s)
In [68]: df['D']+1
Out[68]:
0    2001
1    2011
2    2021
3    2031
4    2041
5    2051
Name: D, dtype: int64
DataFrame.apply
In [69]: def f(x):
             x = x + 1
             return x

In [70]: df['D'].apply(f)
Out[70]:
0    2001
1    2011
2    2021
3    2031
4    2041
5    2051
Name: D, dtype: int64

In [71]: df['D'].apply(lambda x: x+1)
Out[71]:
0    2001
1    2011
2    2021
3    2031
4    2041
5    2051
Name: D, dtype: int64

Change row data

DataFrame.loc
In [72]: df.loc['4'] = ['4A','40.01', False, 2041]

In [73]: df
Out[73]:
     A      B      C     D
0  0a0  00.0    True  2000
1  1a1  10.0   False  2010
2  2a2  20.0    True  2020
3  3a3  30.0   False  2030
4   4A  40.01  False  2041
5  5a5  50.0   False  2050
DataFrame.iloc
In [74]: df.iloc[5] = ['5A','50.01', True, 2051]

In [75]: df
Out[75]:
     A      B      C     D
0  0a0  00.0    True  2000
1  1a1  10.0   False  2010
2  2a2  20.0    True  2020
3  3a3  30.0   False  2030
4   4A  40.01  False  2041
5   5A  50.01   True  2051

Change column type

DataFrame.astype
In [76]: df['C'].astype('int8')
Out[76]:
0    1
1    0
2    1
3    0
4    1
Name: C, dtype: int8

In [77]: df['D'].astype('float')
Out[77]:
0    2000.0
1    2010.0
2    2020.0
3    2030.0
4    2040.0
Name: D, dtype: float64
pandas.to_numeric
In [78]: pd.to_numeric(df['B'])
Out[78]:
0     0.0
1    10.0
2    20.0
3    30.0
4    40.0
Name: B, dtype: float64
pandas.to_datetime
In [79]: pd.to_datetime(df['D'], format="%Y")
Out[79]:
0   2000-01-01
1   2010-01-01
2   2020-01-01
3   2030-01-01
4   2040-01-01
Name: D, dtype: datetime64[ns]

Modify Index

Change column label
In [80]: df.rename({'A':'a', 'B':'b'}, axis=1)
Out[80]:
    a      b      C     D
0  0A  00.0    True  2000
1  1A  10.0   False  2010
2  2A  20.0    True  2020
3  3A  30.0   False  2030
4  4A  40.0    True  2040
Change row label
In [81]: df.rename({'2':'0.2', '4':'0.4'}, axis=0)
Out[81]:
      A      B      C     D
0    0A  00.0    True  2000
1    1A  10.0   False  2010
0.2  2A  20.0    True  2020
3    3A  30.0   False  2030
0.4  4A  40.0    True  2040
Change column index name
In [82]: df.columns.name = "HEADER"

In [82]: df
Out[19]:
HEADER   A      B      C     D
0       0A  00.0    True  2000
1       1A  10.0   False  2010
2       2A  20.0    True  2020
3       3A  30.0   False  2030
4       4A  40.0    True  2040

Change row index name

Note: When exporting to a csv including the row labels, this name will be used as the column name in the exported file

In [83]: df.index.name = "ID"

In [83]: df
Out[83]:
HEADER   A      B      C     D
ID
0       0A  00.0    True  2000
1       1A  10.0   False  2010
2       2A  20.0    True  2020
3       3A  30.0   False  2030
4       4A  40.0    True  2040

Descriptive Info

DataFrame.describe
In [84]: df.describe()
Out[84]:
HEADER            D
count      5.000000
mean    2020.000000
std       15.811388
min     2000.000000
25%     2010.000000
50%     2020.000000
75%     2030.000000
max     2040.000000
DataFrame.info
In [85]: df.info()

Index: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      object
 1   B       5 non-null      object
 2   C       5 non-null      bool
 3   D       5 non-null      int64
dtypes: bool(1), int64(1), object(2)
memory usage: 165.0+ bytes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s