Introduction

Data Sturcture

`pandas.DataFrame`

`pandas.Index`

`pandas.Series`

`numpy.ndarray`

`list`

(built-in)

Conversion

Conclusion

# Introduction

The building block for the popular Python package Scikit-learn is the Pandas DataFrame. Python is known for its ease of seamless implicit conversion between data types. Yet, in the case of the DataFrame, some of the APIs and terminology may not feel intuitive. That’s because the underlying data structure is a bit more complex that we tend to perceive.

We may look at it as cells, like in an Excel spreadsheet, which has rows with a row number and and columns with a columns header. Because of this simplistic view, we may be thrown off by the results of some of the DataFrame APIs. For example, You may expect a simple Python list when querying for the column headers or row numbers or column data, but that is not the case.

# Data structure

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.

Some introductory terminology would be helpful.

**Axes**are the 2 dimensions of the table, axis=0 refers to row labels and axis=1 refers to column labels.**Columns**are column labels (or header), not the column data.**Index**refers to the row labels. Though we generally refer to it with implicit row numbers, they can also have explicit labels.

Note: Index is a bit overused in Pandas. Not only are the rows referred to as *index*, there is also `Index`

objects and the indexing operator `[]`

.

It’s important to understand the following Pandas objects to be able to manipulate a DataFrame.

### pandas.DataFrame

A Two-dimensional, size-mutable, tabular data where each column can potentially be of different data types. It contains `Index` objects for the axes labels (row labels and column labels). The data can be thought of as a dicttionaary like container for `Series` objects. It supports arithmetic operations on both row data and column data.

In [1]: import pandas In [2]: df = pandas.DataFrame(data = [['0A', '0B', '0C'], ...: ['1A', '1B', '1C'], ...: ['2A', '2B', '2C'], ...: ['3A', '3B', '3C'], ...: ['4A', '4B', '4C']], ...: columns=['A', 'B', 'C']) In [3]: df Out[3]: A B C 0 0A 0B 0C 1 1A 1B 1C 2 2A 2B 2C 3 3A 3B 3C 4 4A 4B 4C In [4]: type(df) Out[4]: pandas.core.frame.DataFrame

### pandas.Index

Encapsulation of row and column labels (or axes)Immutable sequence used for indexing and alignment. The row and column labels are encapsulated in `Index`

objects.

In [5]: df.index Out[5]: RangeIndex(start=0, stop=5, step=1) In [6]: type(df.index) Out[6]: pandas.core.indexes.range.RangeIndex In [7]: df.index.values Out[7]: array([0, 1, 2, 3, 4], dtype=int64) In [8]: type(df.index.values) Out[8]: numpy.ndarray In [9]: df.columns Out[9]: Index(['A', 'B', 'C'], dtype='object') In [10]: type(df.columns) Out[10]: pandas.core.indexes.base.Index In [11]: df.columns.values Out[11]: array(['A', 'B', 'C'], dtype=object) In [12]: type(df.columns.values) Out[12]: numpy.ndarray

### pandas.Series

Encapsulation of one-dimensional `numpy.ndarray`

containing the column data, row labels and the column label.

In [13]: srs = df['A'] In [14]: srs Out[14]: 0 0A 1 1A 2 2A 3 3A 4 4A Name: A, dtype: object In [15]: type(srs) Out[15]: pandas.core.series.Series In [16]: srs.index Out[16]: RangeIndex(start=0, stop=5, step=1) In [17]: type(srs.index) Out[17]: pandas.core.indexes.range.RangeIndex In [18]: srs.name Out[18]: 'A' In [19]: type(srs.name) Out[19]: str In [20]: srs.values Out[20]: array(['0A', '1A', '2A', '3A', '4A'], dtype=object) In [20]: type(srs.values) Out[20]: numpy.ndarray

### numpy.ndarray

When you see a reference to *array*, it invariably refers to the `numpy.ndarray`

as Python does not have a native built-in array data structure.

### list (built-in)

`list`

is a native Python mutable sequence, typically used to store collections of homogeneous items.

# Conversion

Translating data from each of the above can look intimidating but fortunately, there are some convenient methods for conversion.

# Conclusion

Understanding the data structure will allow you to navigate the Pandas APIs beyond the basic operations.