Overview
Example
Numpy data type objects (dtype)
repr
and str
Numpy data type aliases
Example Continued
Array in Python
References
Overview
Python defines only one type of a particular data class (there is only one integer type, one floating-point type, etc.). This can be convenient in applications that don’t need to be concerned with all the ways data can be represented in a computer. For scientific computing, however, more control is often needed.
NumPy supports a much greater variety of numerical types than Python. The primitive types supported are tied closely to those in the C
language.
Pandas for the most part uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float
, int
, bool
, timedelta64[ns]
and datetime64[ns]
. Pandas adds a few of its own data types but the discussion here will be limited to the numpy datatypes as they are most common.
Since Pandas uses the Numpy data types, querying for a column data type will actually return a numpy.dtype
. Because the numpy.dtype
has to support nuanced types for scientific computing, normal users of Pandas may feel overwhelmed with it and at times may even be confusing.
Example
import pandas as pd import numpy as np data = {'Bool':[True, False, True], 'Int':[1, 2, 3], 'Float':[10.0, 20.0, 30.0], 'String':['A', 'B', 'C'], 'Date':['2018-01-21', '2019-02-22', '2020-03-23']} df = pd.DataFrame(data)
Bool Int Float String Date 0 True 1 10.0 A 2018-01-21 1 False 2 20.0 B 2019-02-22 2 True 3 30.0 C 2020-03-23
df.dtypes
Bool bool Int int32 Float float64 String object Date datetime64[ns] dtype: object
Numpy data type objects (dtype)
A data type object is an instance of numpy.dtype
class. It describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted.
Following is the hierarchy of type objects representing the array data types. Though it is quite an elaborate hierarchy, it’s manageable as long as you are aware of it. However, it gets more confusing due the numerous aliases available for the concrete types.
The following attributes are useful in identifying a specific datatype.
numpy.dtype.char
A unique character code for each of the 21 different built-in types.
numpy.dtype.kind
A character code (one of biufcmMOSUV
) identifying the general kind of data.
numpy.dtype.name
A bit-width name for this data-type.
numpy.dtype.str
The array-protocol typestring of this data-type object. The basic string format consists of 3 parts.
• character describing the byteorder of the data (“ for big-endian, |
for not-relevant)
• character code giving the basic type of the array
• integer providing the number of bytes the type uses
numpy.dtype.type
The underlying numpy class or one of its alias.
The various attributes can be examined by simply creating an instance of the dtype.
In [1]: import numpy In [2]: x = numpy.float_(1) In [3]: print(x) 1.0 In [4]: print(x.dtype.char) d In [5]: print(x.dtype.kind) f In [6]: print(x.dtype.name) float64 In [7]: print(x.dtype.str) <f8 In [8]: print(x.dtype.type) <class 'numpy.float64'>
The table below shows the mapping of a few Python data types with that of Numpy data type (and their attributes).
Python | numpy class | char | kind | name | str | type |
---|---|---|---|---|---|---|
bool | numpy.bool_ | ? | b | bool | |b1 | class ‘numpy.bool_’ |
numpy.byte | b | i | int8 | |i1 | class ‘numpy.int8’ | |
numpy.short | h | i | int16 | <i2 | class ‘numpy.int16’ | |
int | numpy.int_ | l | i | int32 | <i4 | class ‘numpy.int32’ |
float | numpy.float_ | d | f | float64 | <f8 | class ‘numpy.float64’ |
datetime | numpy.datetime64 | M | M | datetime64[ns] | <M8[ns] | class ‘numpy.datetime64’ |
timedelta | numpy.timedelta64 | m | m | timdelta64[ns] | <m8[ns] | class ‘numpy.timedelta64’ |
repr and str
repr
returns a string containing a printable representation of an object.
str
returns the string version of the object.
The dtype
implementation for each (which outputs a string) can be different combinations of the attributes.
The shell, console and debugger may invoke one of them for the output, so the output may look different for the same object. This can be quite confusing if you are not aware of it. For example:
In [1]: dt = np.datetime64("1980") In [2]: repr(dt.dtype) Out[2]: "dtype('<M8[Y]')" In [3]: str(dt.dtype) Out[3]: 'datetime64[Y]'
Numpy data type aliases
As you have seen there are numerous Numpy data types. But there are many more aliases to the data types which can be source of confusion.
- The concrete types are shown in the class hierarchy
- Thes concrete types are mostly based on the types available in the C language that CPython is written in, with several additional types compatible with Python’s types.
- There are aliases (without the trailing underscore) that are exactly the Python built-in types. These are not
numpy.dtype
and hence do not have the characteristic attributes. - Some of the types are essentially equivalent to fundamental Python types and many inherit from them, as well as from the generic numpy type.
- Along with their (mostly) C-derived names, there are aliases that use bit width convention.
Python Numpy exact alias Numpy equivalent Numpy fixed built-in to Python built-in to Python built-in width alias ---------------------------------------------------------------------------- bool numpy.bool numpy.bool_ int numpy.int numpy.int_ numpy.int32 float numpy.float numpy.float_ numpy.float64
All the available alias can be listed with numpy.sctypeDict
which is a scalar type dictionary which lists all the aliases.
In [1]: len(numpy.sctypeDict) Out[1] 152 In [2]: numpy.sctypeDict Out[2] {'?': numpy.bool_, 0: numpy.bool_, 'byte': numpy.int8, 'b': numpy.int8, 1: numpy.int8, 'ubyte': numpy.uint8, 'B': numpy.uint8, 2: numpy.uint8, 'short': numpy.int16, 'h': numpy.int16, 3: numpy.int16, 'ushort': numpy.uint16, 'H': numpy.uint16, 4: numpy.uint16, : : : 'bool_': numpy.bool_, 'bytes_': numpy.bytes_, 'string_': numpy.bytes_, 'unicode_': numpy.str_, 'object_': numpy.object_, 'str_': numpy.str_, 'int': numpy.int32, 'float': numpy.float64, 'complex': numpy.complex128, 'bool': numpy.bool_, 'object': numpy.object_, 'str': numpy.str_, 'bytes': numpy.bytes_, 'a': numpy.bytes_}
A specific type can be confirmed using the isinstance()
function.
In [1]: x = np.float64(1.0) In [2]: isinstance(x, np.float64) Out[2]: True In [3]: isinstance(x, np.float32) Out[3]: False In [4]: isinstance(x, np.float_) Out[4]: True In [5]: isinstance(x, np.float) Out[5]: True
The types, aliases and hierarchy can be inspected with the MRO (Method Resolution Order).
In [1]: import inspect In [2]: inspect.getmro(numpy.float) Out[2]: (float, object) In [3]: inspect.getmro(numpy.float_) Out[3]: (numpy.float64, numpy.floating, numpy.inexact, numpy.number, numpy.generic, float, object)
The types, aliases and hierarchy can also be accessed more directly.
In [4]: numpy.float_.__mro__ Out[4]: (numpy.float64, numpy.floating, numpy.inexact, numpy.number, numpy.generic, float, object)
Example continued
Now that we are aware of all the “coded” attributes of a dtype
, examining the dtypes
in a dataframe should look more meaningful. These dtypes
are coming from the underlying numpy.ndarray
in the pandas.Series
columns of the pandas.DataFrame
.
def print_dtypes_attributes(df): print('{:<7} {:<5} {:<5} {:<15} {:<8} {}'.format('column', 'char', 'kind', 'name', 'str', 'type')) print('-------------------------------------------------------------------------') for col, dt in zip(df.columns, df.dtypes): print('{:<7} {:<5} {:<5} {:<15} {:<8} {}'.format(col, dt.char, dt.kind, dt.name, dt.str, str(dt.type)))
print_dtypes_attributes(df)
column char kind name str type ------------------------------------------------------------------------- Bool ? b bool |b1 <class 'numpy.bool_'> Int q i int64 <i8 <class 'numpy.int64'> Float d f float64 <f8 <class 'numpy.float64'> String O O object |O <class 'numpy.object_'> Date O O object |O <class 'numpy.object_'>
df['Int'] = df['Int'].astype(np.int32) df['Date'] = pd.to_datetime(df['Date']) print_dtypes_attributes(df)
column char kind name str type ------------------------------------------------------------------------- Bool ? b bool |b1 <class 'numpy.bool_'> Int l i int32 <i4 <class 'numpy.int32'> Float d f float64 <f8 <class 'numpy.float64'> String O O object |O <class 'numpy.object_'> Date M M datetime64[ns] <M8[ns] <class 'numpy.datetime64'>
Array in Python
Unlike many programming languages, Python does not have a built-in array data structure. The built-in list
has behavior very similar to a typical array.
Any reference to an array implies a Numpy array. It’s a N-dimensional array implemented in the numpy.ndarray
class. So array, ndarray
, Numpy array and numpy.ndarray
are all synonymous.
numpy.ndarray
is always associated with a numpy.dtype
and never with a Python built-in though they maybe equivalent. Even if you specify a Python built-in type, the array will be of a compatible numpy.dtype
.
In [1]: x = np.array([[1, 2, 3], [4, 5, 6]], int) In [2]: x.dtype Out[2]: dtype('int32')
Since pandas.DataFrame
is a set of pandas.Series which has an underlying numpy.ndarray
, pandas.DataFrame.dtypes
will always be a Numpy specific dtype
and never a Python type.
References
Numpy: Array scalars
Numpy: Types
Numpy: dtype
Numpy: Array interface
Numpy: Date and Time
Pandas: dyptes