Pandas DataFrame

Introduction
Data Sturcture
pandas.DataFrame
pandas.Index
pandas.Series
numpy.ndarray
list (built-in)
Conversion
Conclusion

Introduction

The building block for the popular Python package Scikit-learn is the Pandas DataFrame. Python is known for its ease of seamless implicit conversion between data types. Yet, in the case of the DataFrame, some of the APIs and terminology may not feel intuitive. That’s because the underlying data structure is a bit more complex that we tend to perceive.

We may look at it as cells, like in an Excel spreadsheet, which has rows with a row number and and columns with a columns header. Because of this simplistic view, we may be thrown off by the results of some of the DataFrame APIs. For example, You may expect a simple Python list when querying for the column headers or row numbers or column data, but that is not the case.

Tabular Data

Data structure

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.

Some introductory terminology would be helpful.

  • Axes are the 2 dimensions of the table, axis=0 refers to row labels and axis=1 refers to column labels.
  • Columns are column labels (or header), not the column data.
  • Index refers to the row labels. Though we generally refer to it with implicit row numbers, they can also have explicit labels.

Note: Index is a bit overused in Pandas. Not only are the rows referred to as index, there is also Index objects and the indexing operator [].

It’s important to understand the following Pandas objects to be able to manipulate a DataFrame.

pandas.DataFrame

A Two-dimensional, size-mutable, tabular data where each column can potentially be of different data types. It contains `Index` objects for the axes labels (row labels and column labels). The data can be thought of as a dicttionaary like container for `Series` objects. It supports arithmetic operations on both row data and column data.

Pandas DataFrame

In [1]: import pandas

In [2]:  df = pandas.DataFrame(data = [['0A', '0B', '0C'],
   ...:                                ['1A', '1B', '1C'],
   ...:                                ['2A', '2B', '2C'],
   ...:                                ['3A', '3B', '3C'],
   ...:                                ['4A', '4B', '4C']],
   ...:                               columns=['A', 'B', 'C'])

In [3]: df
Out[3]:
    A   B   C
0  0A  0B  0C
1  1A  1B  1C
2  2A  2B  2C
3  3A  3B  3C
4  4A  4B  4C

In [4]: type(df)
Out[4]: pandas.core.frame.DataFrame

pandas.Index

Encapsulation of row and column labels (or axes)Immutable sequence used for indexing and alignment. The row and column labels are encapsulated in Index objects.

Pandas Index

In [5]: df.index
Out[5]: RangeIndex(start=0, stop=5, step=1)

In [6]: type(df.index)
Out[6]: pandas.core.indexes.range.RangeIndex

In [7]: df.index.values
Out[7]: array([0, 1, 2, 3, 4], dtype=int64)

In [8]: type(df.index.values)
Out[8]: numpy.ndarray


In [9]: df.columns
Out[9]: Index(['A', 'B', 'C'], dtype='object')

In [10]: type(df.columns)
Out[10]: pandas.core.indexes.base.Index

In [11]: df.columns.values
Out[11]: array(['A', 'B', 'C'], dtype=object)

In [12]: type(df.columns.values)
Out[12]: numpy.ndarray

pandas.Series

Encapsulation of one-dimensional numpy.ndarray containing the column data, row labels and the column label.

Pandas Series

In [13]: srs = df['A']

In [14]: srs
Out[14]:
0    0A
1    1A
2    2A
3    3A
4    4A
Name: A, dtype: object

In [15]: type(srs)
Out[15]: pandas.core.series.Series

In [16]: srs.index
Out[16]: RangeIndex(start=0, stop=5, step=1)

In [17]: type(srs.index)
Out[17]: pandas.core.indexes.range.RangeIndex

In [18]: srs.name
Out[18]: 'A'

In [19]: type(srs.name)
Out[19]: str

In [20]: srs.values
Out[20]: array(['0A', '1A', '2A', '3A', '4A'], dtype=object)

In [20]: type(srs.values)
Out[20]: numpy.ndarray

numpy.ndarray

When you see a reference to *array*, it invariably refers to the numpy.ndarray as Python does not have a native built-in array data structure.

list (built-in)

list is a native Python mutable sequence, typically used to store collections of homogeneous items.

Conversion

Translating data from each of the above can look intimidating but fortunately, there are some convenient methods for conversion.

Pandas Numpy Python List Conversion

Conclusion

Understanding the data structure will allow you to navigate the Pandas APIs beyond the basic operations.