Data types in Python, Numpy and Pandas

Overview

Python defines only one type of a particular data class (there is only one integer type, one floating-point type, etc.). This can be convenient in applications that don’t need to be concerned with all the ways data can be represented in a computer. For scientific computing, however, more control is often needed.

NumPy supports a much greater variety of numerical types than Python. The primitive types supported are tied closely to those in the C language.

Pandas for the most part uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns]. Pandas adds a few of its own data types but the discussion here will be limited to the numpy datatypes as they are most common.

Since Pandas uses the Numpy data types, querying for a column data type will actually return a numpy.dtype. Because the numpy.dtype has to support nuanced types for scientific computing, normal users of Pandas may feel overwhelmed with it and at times may even be confusing.

Example

import pandas as pd
import numpy as np

data = {'Bool':[True, False, True],
        'Int':[1, 2, 3],
        'Float':[10.0, 20.0, 30.0],
        'String':['A', 'B', 'C'],
        'Date':['2018-01-21', '2019-02-22', '2020-03-23']}
df = pd.DataFrame(data)
 	Bool 	Int   Float   String    Date
0 	True 	1     10.0     A        2018-01-21
1 	False 	2     20.0     B        2019-02-22
2 	True 	3     30.0     C        2020-03-23
df.dtypes
Bool        bool
Int         int32
Float       float64
String      object
Date        datetime64[ns]
dtype: object

Numpy data type objects (dtype)

A data type object is an instance of numpy.dtype class. It describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.

Following is the hierarchy of type objects representing the array data types. Though it is quite an elaborate hierarchy, it’s manageable as long as you are aware of it. However, it gets more confusing due the numerous aliases available for the concrete types.

Numpy data types hierarchy

The following attributes are useful in identifying a specific datatype.

numpy.dtype.char
A unique character code for each of the 21 different built-in types.

numpy.dtype.kind
A character code (one of biufcmMOSUV) identifying the general kind of data.

numpy.dtype.name
A bit-width name for this data-type.

numpy.dtype.str
The array-protocol typestring of this data-type object. The basic string format consists of 3 parts.
 • character describing the byteorder of the data (“ for big-endian, | for not-relevant)
 • character code giving the basic type of the array
 • integer providing the number of bytes the type uses

numpy.dtype.type
The underlying numpy class or one of its alias.

The various attributes can be examined by simply creating an instance of the dtype.

In [1]: import numpy
In [2]: x = numpy.float_(1)

In [3]: print(x)
1.0

In [4]: print(x.dtype.char)
d

In [5]: print(x.dtype.kind)
f

In [6]: print(x.dtype.name)
float64

In [7]: print(x.dtype.str)
<f8

In [8]: print(x.dtype.type)
<class 'numpy.float64'>

The table below shows the mapping of a few Python data types with that of Numpy data type (and their attributes).

Python numpy class char kind name str type
bool numpy.bool_ ? b bool |b1 class ‘numpy.bool_’
numpy.byte b i int8 |i1 class ‘numpy.int8’
numpy.short h i int16 <i2 class ‘numpy.int16’
int numpy.int_ l i int32 <i4 class ‘numpy.int32’
float numpy.float_ d f float64 <f8 class ‘numpy.float64’
datetime numpy.datetime64 M M datetime64[ns] <M8[ns] class ‘numpy.datetime64’
timedelta numpy.timedelta64 m m timdelta64[ns] <m8[ns] class ‘numpy.timedelta64’

repr and str

repr returns a string containing a printable representation of an object.
str returns the string version of the object.
The dtype implementation for each (which outputs a string) can be different combinations of the attributes.

The shell, console and debugger may invoke one of them for the output, so the output may look different for the same object. This can be quite confusing if you are not aware of it. For example:

In [1]: dt = np.datetime64("1980")

In [2]: repr(dt.dtype)
Out[2]: "dtype('<M8[Y]')"

In [3]: str(dt.dtype)
Out[3]: 'datetime64[Y]'

Numpy data type aliases

As you have seen there are numerous Numpy data types. But there are many more aliases to the data types which can be source of confusion.

  • The concrete types are shown in the class hierarchy
  • Thes concrete types are mostly based on the types available in the C language that CPython is written in, with several additional types compatible with Python’s types.
  • There are aliases (without the trailing underscore) that are exactly the Python built-in types. These are not numpy.dtype and hence do not have the characteristic attributes.
  • Some of the types are essentially equivalent to fundamental Python types and many inherit from them, as well as from the generic numpy type.
  • Along with their (mostly) C-derived names, there are aliases that use bit width convention.
Python      Numpy exact alias         Numpy equivalent          Numpy fixed
built-in    to Python built-in        to Python built-in        width alias
----------------------------------------------------------------------------
bool        numpy.bool                numpy.bool_               
int         numpy.int                 numpy.int_                numpy.int32
float       numpy.float               numpy.float_              numpy.float64

All the available alias can be listed with numpy.sctypeDict which is a scalar type dictionary which lists all the aliases.

In [1]: len(numpy.sctypeDict)
Out[1] 152

In [2]: numpy.sctypeDict
Out[2]
{'?': numpy.bool_,
 0: numpy.bool_,
 'byte': numpy.int8,
 'b': numpy.int8,
 1: numpy.int8,
 'ubyte': numpy.uint8,
 'B': numpy.uint8,
 2: numpy.uint8,
 'short': numpy.int16,
 'h': numpy.int16,
 3: numpy.int16,
 'ushort': numpy.uint16,
 'H': numpy.uint16,
 4: numpy.uint16,
        :
        :
        :
 'bool_': numpy.bool_,
 'bytes_': numpy.bytes_,
 'string_': numpy.bytes_,
 'unicode_': numpy.str_,
 'object_': numpy.object_,
 'str_': numpy.str_,
 'int': numpy.int32,
 'float': numpy.float64,
 'complex': numpy.complex128,
 'bool': numpy.bool_,
 'object': numpy.object_,
 'str': numpy.str_,
 'bytes': numpy.bytes_,
 'a': numpy.bytes_}

A specific type can be confirmed using the isinstance() function.

In [1]: x = np.float64(1.0)

In [2]: isinstance(x, np.float64)
Out[2]: True

In [3]: isinstance(x, np.float32)
Out[3]: False

In [4]: isinstance(x, np.float_)
Out[4]: True

In [5]: isinstance(x, np.float)
Out[5]: True

The types, aliases and hierarchy can be inspected with the MRO (Method Resolution Order).

In [1]: import inspect

In [2]: inspect.getmro(numpy.float)
Out[2]: (float, object)

In [3]: inspect.getmro(numpy.float_)
Out[3]:
(numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object)

The types, aliases and hierarchy can also be accessed more directly.

In [4]: numpy.float_.__mro__
Out[4]:
(numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object)

Example continued

Now that we are aware of all the “coded” attributes of a dtype, examining the dtypes in a dataframe should look more meaningful. These dtypes are coming from the underlying numpy.ndarray in the pandas.Series columns of the pandas.DataFrame.

def print_dtypes_attributes(df):
    print('{:<7} {:<5} {:<5} {:<15} {:<8} {}'.format('column', 'char', 'kind', 'name', 'str', 'type'))
    print('-------------------------------------------------------------------------')
    for col, dt in zip(df.columns, df.dtypes):
        print('{:<7} {:<5} {:<5} {:<15} {:<8} {}'.format(col, dt.char, dt.kind, dt.name, dt.str, str(dt.type)))
print_dtypes_attributes(df)
column  char  kind  name            str      type
-------------------------------------------------------------------------
Bool    ?     b     bool            |b1      <class 'numpy.bool_'>
Int     q     i     int64           <i8      <class 'numpy.int64'>
Float   d     f     float64         <f8      <class 'numpy.float64'>
String  O     O     object          |O       <class 'numpy.object_'>
Date    O     O     object          |O       <class 'numpy.object_'>
df['Int'] = df['Int'].astype(np.int32)
df['Date'] = pd.to_datetime(df['Date'])

print_dtypes_attributes(df)
column  char  kind  name            str      type
-------------------------------------------------------------------------
Bool    ?     b     bool            |b1      <class 'numpy.bool_'>
Int     l     i     int32           <i4      <class 'numpy.int32'>
Float   d     f     float64         <f8      <class 'numpy.float64'>
String  O     O     object          |O       <class 'numpy.object_'>
Date    M     M     datetime64[ns]  <M8[ns]  <class 'numpy.datetime64'>

Array in Python

Unlike many programming languages, Python does not have a built-in array data structure. The built-in list has behavior very similar to a typical array.
Any reference to an array implies a Numpy array. It’s a N-dimensional array implemented in the numpy.ndarray class. So array, ndarray, Numpy array and numpy.ndarray are all synonymous.

numpy.ndarray is always associated with a numpy.dtype and never with a Python built-in though they maybe equivalent. Even if you specify a Python built-in type, the array will be of a compatible numpy.dtype.

In [1]: x = np.array([[1, 2, 3], [4, 5, 6]], int)

In [2]: x.dtype
Out[2]: dtype('int32')

Since pandas.DataFrame is a set of pandas.Series which has an underlying numpy.ndarray, pandas.DataFrame.dtypes will always be a Numpy specific dtype and never a Python type.

References

Numpy: Array scalars
Numpy: Types
Numpy: dtype
Numpy: Array interface
Numpy: Date and Time
Pandas: dyptes