The building block for the popular Python package Scikit-learn is the Pandas DataFrame. Python is known for its ease of seamless implicit conversion between data types. Yet, in the case of the DataFrame, some of the APIs and terminology may not feel intuitive. That’s because the underlying data structure is a bit more complex that we tend to perceive.
We may look at it as cells, like in an Excel spreadsheet, which has rows with a row number and and columns with a columns header. Because of this simplistic view, we may be thrown off by the results of some of the DataFrame APIs. For example, You may expect a simple Python list when querying for the column headers or row numbers or column data, but that is not the case.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
Some introductory terminology would be helpful.
- Axes are the 2 dimensions of the table, axis=0 refers to row labels and axis=1 refers to column labels.
- Columns are column labels (or header), not the column data.
- Index refers to the row labels. Though we generally refer to it with implicit row numbers, they can also have explicit labels.
Note: Index is a bit overused in Pandas. Not only are the rows referred to as index, there is also
Index objects and the indexing operator
It’s important to understand the following Pandas objects to be able to manipulate a DataFrame.
A Two-dimensional, size-mutable, tabular data where each column can potentially be of different data types. It contains `Index` objects for the axes labels (row labels and column labels). The data can be thought of as a dicttionaary like container for `Series` objects. It supports arithmetic operations on both row data and column data.
In : import pandas In : df = pandas.DataFrame(data = [['0A', '0B', '0C'], ...: ['1A', '1B', '1C'], ...: ['2A', '2B', '2C'], ...: ['3A', '3B', '3C'], ...: ['4A', '4B', '4C']], ...: columns=['A', 'B', 'C']) In : df Out: A B C 0 0A 0B 0C 1 1A 1B 1C 2 2A 2B 2C 3 3A 3B 3C 4 4A 4B 4C In : type(df) Out: pandas.core.frame.DataFrame
Encapsulation of row and column labels (or axes)Immutable sequence used for indexing and alignment. The row and column labels are encapsulated in
In : df.index Out: RangeIndex(start=0, stop=5, step=1) In : type(df.index) Out: pandas.core.indexes.range.RangeIndex In : df.index.values Out: array([0, 1, 2, 3, 4], dtype=int64) In : type(df.index.values) Out: numpy.ndarray In : df.columns Out: Index(['A', 'B', 'C'], dtype='object') In : type(df.columns) Out: pandas.core.indexes.base.Index In : df.columns.values Out: array(['A', 'B', 'C'], dtype=object) In : type(df.columns.values) Out: numpy.ndarray
Encapsulation of one-dimensional
numpy.ndarray containing the column data, row labels and the column label.
In : srs = df['A'] In : srs Out: 0 0A 1 1A 2 2A 3 3A 4 4A Name: A, dtype: object In : type(srs) Out: pandas.core.series.Series In : srs.index Out: RangeIndex(start=0, stop=5, step=1) In : type(srs.index) Out: pandas.core.indexes.range.RangeIndex In : srs.name Out: 'A' In : type(srs.name) Out: str In : srs.values Out: array(['0A', '1A', '2A', '3A', '4A'], dtype=object) In : type(srs.values) Out: numpy.ndarray
When you see a reference to *array*, it invariably refers to the
numpy.ndarray as Python does not have a native built-in array data structure.
list is a native Python mutable sequence, typically used to store collections of homogeneous items.
Translating data from each of the above can look intimidating but fortunately, there are some convenient methods for conversion.
Understanding the data structure will allow you to navigate the Pandas APIs beyond the basic operations.