Python:Data Analytics and Visualization
上QQ阅读APP看书,第一时间看更新

Advanced uses of Pandas for data analysis

In this section we will consider some advanced Pandas use cases.

Hierarchical indexing

Hierarchical indexing provides us with a way to work with higher dimensional data in a lower dimension by structuring the data object into multiple index levels on an axis:

>>> s8 = pd.Series(np.random.rand(8), index=[['a','a','b','b','c','c', 'd','d'], [0, 1, 0, 1, 0,1, 0, 1, ]])
>>> s8
a 0 0.721652
 1 0.297784
b 0 0.271995
 1 0.125342
c 0 0.444074
 1 0.948363
d 0 0.197565
 1 0.883776
dtype: float64

In the preceding example, we have a Series object that has two index levels. The object can be rearranged into a DataFrame using the unstack function. In an inverse situation, the stack function can be used:

>>> s8.unstack()
 0 1
a 0.549211 0.420874
b 0.051516 0.715021
c 0.503072 0.720772
d 0.373037 0.207026

We can also create a DataFrame to have a hierarchical index in both axes:

>>> df = pd.DataFrame(np.random.rand(12).reshape(4,3),
 index=[['a', 'a', 'b', 'b'],
 [0, 1, 0, 1]],
 columns=[['x', 'x', 'y'], [0, 1, 0]])
>>> df
 x y
 0 1 0
a 0 0.636893 0.729521 0.747230
 1 0.749002 0.323388 0.259496
b 0 0.214046 0.926961 0.679686
0.013258 0.416101 0.626927
>>> df.index
MultiIndex(levels=[['a', 'b'], [0, 1]],
 labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> df.columns
MultiIndex(levels=[['x', 'y'], [0, 1]],
 labels=[[0, 0, 1], [0, 1, 0]])

The methods for getting or setting values or subsets of the data objects with multiple index levels are similar to those of the nonhierarchical case:

>>> df['x']
 0 1
a 0 0.636893 0.729521
 1 0.749002 0.323388
b 0 0.214046 0.926961
0.013258 0.416101
>>> df[[0]]
 x
 0
a 0 0.636893
 1 0.749002
b 0 0.214046
0.013258
>>> df.ix['a', 'x']
 0 1
0 0.636893 0.729521
0.749002 0.323388
>>> df.ix['a','x'].ix[1]
0 0.749002
1 0.323388
Name: 1, dtype: float64

After grouping data into multiple index levels, we can also use most of the descriptive and statistics functions that have a level option, which can be used to specify the level we want to process:

>>> df.std(level=1)
 x y
 0 1 0
0 0.298998 0.139611 0.047761
0.520250 0.065558 0.259813
>>> df.std(level=0)
 x y
 0 1 0
a 0.079273 0.287180 0.344880
b 0.141979 0.361232 0.037306

The Panel data

The Panel is another data structure for three-dimensional data in Pandas. However, it is less frequently used than the Series or the DataFrame. You can think of a Panel as a table of DataFrame objects. We can create a Panel object from a 3D ndarray or a dictionary of DataFrame objects:

# create a Panel from 3D ndarray
>>> panel = pd.Panel(np.random.rand(2, 4, 5),
 items = ['item1', 'item2'])
>>> panel
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 4

>>> df1 = pd.DataFrame(np.arange(12).reshape(4, 3), 
 columns=['a','b','c'])
>>> df1
 a b c
0 0 1 2
1 3 4 5
2 6 7 8
9 10 11
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3), 
 columns=['a','b','c'])
>>> df2
 a b c
0 0 1 2
1 3 4 5
6 7 8
# create another Panel from a dict of DataFrame objects
>>> panel2 = pd.Panel({'item1': df1, 'item2': df2})
>>> panel2
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 0 to 3
Minor_axis axis: a to c

Each item in a Panel is a DataFrame. We can select an item, by item name:

>>> panel2['item1']
 a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11

Alternatively, if we want to select data via an axis or data position, we can use the ix method, like on Series or DataFrame:

>>> panel2.ix[:, 1:3, ['b', 'c']]
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 2 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 1 to 3
Minor_axis axis: b to c
>>> panel2.ix[:, 2, :]
 item1 item2
a 6 6
b 7 7
c 8 8