
Advanced uses of Pandas for data analysis
In this section we will consider some advanced Pandas use cases.
Hierarchical indexing
Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:
>>> s8 = pd.Series(np.random.rand(8), index=[['a','a','b','b','c','c', 'd','d'], [0, 1, 0, 1, 0,1, 0, 1, ]]) >>> s8 a 0 0.721652 1 0.297784 b 0 0.271995 1 0.125342 c 0 0.444074 1 0.948363 d 0 0.197565 1 0.883776 dtype: float64
In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack
function. In an inverse situation, the stack
function can be used:
>>> s8.unstack() 0 1 a 0.549211 0.420874 b 0.051516 0.715021 c 0.503072 0.720772 d 0.373037 0.207026
We can also create a DataFrame to have a hierarchical index in both axes:
>>> df = pd.DataFrame(np.random.rand(12).reshape(4,3), index=[['a', 'a', 'b', 'b'], [0, 1, 0, 1]], columns=[['x', 'x', 'y'], [0, 1, 0]]) >>> df x y 0 1 0 a 0 0.636893 0.729521 0.747230 1 0.749002 0.323388 0.259496 b 0 0.214046 0.926961 0.679686 0.013258 0.416101 0.626927 >>> df.index MultiIndex(levels=[['a', 'b'], [0, 1]], labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) >>> df.columns MultiIndex(levels=[['x', 'y'], [0, 1]], labels=[[0, 0, 1], [0, 1, 0]])
The methods for getting or setting values or subsets of the data objects with multiple index levels are similar to those of the nonhierarchical case:
>>> df['x'] 0 1 a 0 0.636893 0.729521 1 0.749002 0.323388 b 0 0.214046 0.926961 0.013258 0.416101 >>> df[[0]] x 0 a 0 0.636893 1 0.749002 b 0 0.214046 0.013258 >>> df.ix['a', 'x'] 0 1 0 0.636893 0.729521 0.749002 0.323388 >>> df.ix['a','x'].ix[1] 0 0.749002 1 0.323388 Name: 1, dtype: float64
After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:
>>> df.std(level=1) x y 0 1 0 0 0.298998 0.139611 0.047761 0.520250 0.065558 0.259813 >>> df.std(level=0) x y 0 1 0 a 0.079273 0.287180 0.344880 b 0.141979 0.361232 0.037306
The Panel data
The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray
or a dictionary of DataFrame objects:
# create a Panel from 3D ndarray >>> panel = pd.Panel(np.random.rand(2, 4, 5), items = ['item1', 'item2']) >>> panel <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis) Items axis: item1 to item2 Major_axis axis: 0 to 3 Minor_axis axis: 0 to 4 >>> df1 = pd.DataFrame(np.arange(12).reshape(4, 3), columns=['a','b','c']) >>> df1 a b c 0 0 1 2 1 3 4 5 2 6 7 8 9 10 11 >>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a','b','c']) >>> df2 a b c 0 0 1 2 1 3 4 5 6 7 8 # create another Panel from a dict of DataFrame objects >>> panel2 = pd.Panel({'item1': df1, 'item2': df2}) >>> panel2 <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis) Items axis: item1 to item2 Major_axis axis: 0 to 3 Minor_axis axis: a to c
Each item in a Panel is a DataFrame. We can select an item, by item name:
>>> panel2['item1'] a b c 0 0 1 2 1 3 4 5 2 6 7 8 3 9 10 11
Alternatively, if we want to select data via an axis or data position, we can use the ix
method, like on Series or DataFrame:
>>> panel2.ix[:, 1:3, ['b', 'c']] <class 'pandas.core.panel.Panel'> Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis) Items axis: item1 to item2 Major_axis axis: 1 to 3 Minor_axis axis: b to c >>> panel2.ix[:, 2, :] item1 item2 a 6 6 b 7 7 c 8 8