
Data munging
The arsenal of tools for data munging is huge, and while we will focus on Python we want to mention some useful tools as well. If they are available on your system and you expect to work a lot with data, they are worth learning.
One group of tools belongs to the UNIX tradition, which emphasizes text processing and as a consequence has, over the last four decades, developed many high-performance and battle-tested tools for dealing with text. Some common tools are: sed
, grep
, awk
, sort
, uniq
, tr
, cut
, tail
, and head
. They do very elementary things, such as filtering out lines (grep
) or columns (cut
) from files, replacing text (sed
, tr
) or displaying only parts of files (head
, tail
).
We want to demonstrate the power of these tools with a single example only.
Imagine you are handed the log files of a web server and you are interested in the distribution of the IP addresses.
Each line of the log file contains an entry in the common log server format (you can download this data set from http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html):
$ cat epa-html.txt wpbfl2-45.gate.net [29:23:56:12] "GET /Access/ HTTP/1.0" 200 2376ebaca.icsi.net [30:00:22:20] "GET /Info.html HTTP/1.0" 200 884
For instance, we want to know how often certain users have visited our site.
We are interested in the first column only, since this is where the IP address or hostname can be found. After that, we need to count the number of occurrences of each host and finally display the results in a friendly way.
The sort
| uniq -c
stanza is our workhorse here: it sorts the data first and uniq -c
will save the number of occurrences along with the value. The sort -nr | head -15
is our formatting part; we sort numerically (-n
) and in reverse (-r
), and keep only the top 15 entries.
Putting it all together with pipes:
$ cut -d ' ' -f 1 epa-http.txt | sort | uniq -c | sort -nr | head -15 294 sandy.rtptok1.epa.gov 292 e659229.boeing.com 266 wicdgserv.wic.epa.gov 263 keyhole.es.dupont.com 248 dwilson.pr.mcs.net 176 oea4.r8stw56.epa.gov 174 macip26.nacion.co.cr 172 dcimsd23.dcimsd.epa.gov 167 www-b1.proxy.aol.com 158 piweba3y.prodigy.com 152 wictrn13.dcwictrn.epa.gov 151 nntp1.reach.com 151 inetg1.arco.com 149 canto04.nmsu.edu 146 weisman.metrokc.gov
With one command, we get to convert a sequential server log into an ordered list of the most common hosts that visited our site. We also see that we do not seem to have large differences in the number of visits among our top users.
There are more little helpful tools of which the following are just a tiny selection:
csvkit
: This is the suite of utilities for working with CSV, the king of tabular file formatsjq
: This is a lightweight and flexible command-line JSON processorxmlstarlet
: This is a tool that supports XML queries with XPath, among other thingsq
: This runs SQL on text files
Where the UNIX command line ends, lightweight languages take over. You might be able to get an impression from text only, but your colleagues might appreciate visual representations, such as charts or pretty graphs, generated by matplotlib, much more.
Python and its data tools ecosystem are much more versatile than the command line, but for first explorations and simple operations the effectiveness of the command line is often unbeatable.
Cleaning data
Most real-world data will have some defects and therefore will need to go through a cleaning step first. We start with a small file. Although this file contains only four rows, it will allow us to demonstrate the process up to a cleaned data set:
$ cat small.csv 22,6.1 41,5.7 18,5.3* 29,NA
Note that this file has a few issues. The lines that contain values are all comma-separated, but we have missing (NA) and probably unclean (5.3*) values. We can load this file into a data frame, nevertheless:
>>> import pandas as pd >>> df = pd.read_csv("small.csv") >>> df 22 6.1 0 41 5.7 1 18 5.3* 2 29 NaN
Pandas used the first row as header
, but this is not what we want:
>>> df = pd.read_csv("small.csv", header=None) >>> df 0 1 0 22 6.1 1 41 5.7 2 18 5.3* 3 29 NaN
This is better, but instead of numeric values, we would like to supply our own column names:
>>> df = pd.read_csv("small.csv", names=["age", "height"]) >>> df age height 0 22 6.1 1 41 5.7 2 18 5.3* 3 29 NaN
The age
column looks good, since Pandas already inferred the intended type, but the height
cannot be parsed into numeric values yet:
>>> df.age.dtype dtype('int64') >>> df.height.dtype dtype('O')
If we try to coerce the height
column into float values, Pandas will report an exception:
>>> df.height.astype('float') ValueError: invalid literal for float(): 5.3*
We could use whatever value is parseable as a float and throw away the rest with the convert_objects
method:
>>> df.height.convert_objects(convert_numeric=True) 0 6.1 1 5.7 2 NaN 3 NaN Name: height, dtype: float64
If we know in advance the undesirable characters in our data set, we can augment the read_csv
method with a custom converter function:
>>> remove_stars = lambda s: s.replace("*", "") >>> df = pd.read_csv("small.csv", names=["age", "height"], converters={"height": remove_stars}) >>> df age height 0 22 6.1 1 41 5.7 2 18 5.3 3 29 NA
Now we can finally make the height column a bit more useful. We can assign it the updated version, which has the favored type:
>>> df.height = df.height.convert_objects(convert_numeric=True) >>> df age height 0 22 6.1 1 41 5.7 2 18 5.3 3 29 NaN
If we wanted to only keep the complete entries, we could drop any row that contains undefined values:
>>> df.dropna() age height 0 22 6.1 1 41 5.7 2 18 5.3
We could use a default height, maybe a fixed value:
>>> df.fillna(5.0) age height 0 22 6.1 1 41 5.7 2 18 5.3 3 29 5.0
On the other hand, we could also use the average of the existing values:
>>> df.fillna(df.height.mean()) age height 0 22 6.1 1 41 5.7 2 18 5.3 3 29 5.7
The last three data frames are complete and correct, depending on your definition of correct when dealing with missing values. Especially, the columns have the requested types and are ready for further analysis. Which of the data frames is best suited will depend on the task at hand.
Filtering
Even if we have clean and probably correct data, we might want to use only parts of it or we might want to check for outliers. An outlier is an observation point that is distant from other observations because of variability or measurement errors. In both cases, we want to reduce the number of elements in our data set to make it more relevant for further processing.
In this example, we will try to find potential outliers. We will use the Europe Brent Crude Oil Spot Price as recorded by the U.S. Energy Information Administration. The raw Excel data is available from http://www.eia.gov/dnav/pet/hist_xls/rbrted.xls (it can be found in the second worksheet). We cleaned the data slightly (the cleaning process is part of an exercise at the end of this chapter) and will work with the following data frame, containing 7160 entries, ranging from 1987 to 2015:
>>> df.head() date price 0 1987-05-20 18.63 1 1987-05-21 18.45 2 1987-05-22 18.55 3 1987-05-25 18.60 4 1987-05-26 18.63 >>> df.tail() date price 7155 2015-08-04 49.08 7156 2015-08-05 49.04 7157 2015-08-06 47.80 7158 2015-08-07 47.54 7159 2015-08-10 48.30
While many people know about oil prices – be it from the news or the filling station – let us forget anything we know about it for a minute. We could first ask for the extremes:
>>> df[df.price==df.price.min()] date price 2937 1998-12-10 9.1 >>> df[df.price==df.price.max()] date price 5373 2008-07-03 143.95
Another way to find potential outliers would be to ask for values that deviate most from the mean. We can use the np.abs
function to calculate the deviation from the mean first:
>>> np.abs(df.price - df.price.mean()) 0 26.17137 1 26.35137 ... 7157 2.99863 7158 2.73863 7159 3.49863
We can now compare this deviation from a multiple – we choose 2.5 – of the standard deviation:
>>> import numpy as np >>> df[np.abs(df.price - df.price.mean()) > 2.5 * df.price.std()] date price 5354 2008-06-06 132.81 5355 2008-06-09 134.43 5356 2008-06-10 135.24 5357 2008-06-11 134.52 5358 2008-06-12 132.11 5359 2008-06-13 134.29 5360 2008-06-16 133.90 5361 2008-06-17 131.27 5363 2008-06-19 131.84 5364 2008-06-20 134.28 5365 2008-06-23 134.54 5366 2008-06-24 135.37 5367 2008-06-25 131.59 5368 2008-06-26 136.82 5369 2008-06-27 139.38 5370 2008-06-30 138.40 5371 2008-07-01 140.67 5372 2008-07-02 141.24 5373 2008-07-03 143.95 5374 2008-07-07 139.62 5375 2008-07-08 134.15 5376 2008-07-09 133.91 5377 2008-07-10 135.81 5378 2008-07-11 143.68 5379 2008-07-14 142.43 5380 2008-07-15 136.02 5381 2008-07-16 133.31 5382 2008-07-17 134.16
We see that those few days in summer 2008 must have been special. Sure enough, it is not difficult to find articles and essays with titles like Causes and Consequences of the Oil Shock of 2007–08. We have discovered a trace to these events solely by looking at the data.
We could ask the above question for each decade separately. We first make our data frame look more like a time series:
>>> df.index = df.date >>> del df["date"] >>> df.head() price date 1987-05-20 18.63 1987-05-21 18.45 1987-05-22 18.55 1987-05-25 18.60 1987-05-26 18.63
We could filter out the eighties:
>>> decade = df["1980":"1989"] >>> decade[np.abs(decade.price - decade.price.mean()) > 2.5 * decade.price.std()] price date 1988-10-03 11.60 1988-10-04 11.65 1988-10-05 11.20 1988-10-06 11.30 1988-10-07 11.35
We observe that within the data available (1987–1989), the fall of 1988 exhibits a slight spike in the oil prices. Similarly, during the nineties, we see that we have a larger deviation, in the fall of 1990:
>>> decade = df["1990":"1999"] >>> decade[np.abs(decade.price - decade.price.mean()) > 5 * decade.price.std()] price date 1990-09-24 40.75 1990-09-26 40.85 1990-09-27 41.45 1990-09-28 41.00 1990-10-09 40.90 1990-10-10 40.20 1990-10-11 41.15
There are many more use cases for filtering data. Space and time are typical units: you might want to filter census data by state or city, or economical data by quarter. The possibilities are endless and will be driven by your project.
Merging data
The situation is common: you have multiple data sources, but in order to make statements about the content, you would rather combine them. Fortunately, Pandas' concatenation and merge functions abstract away most of the pain, when combining, joining, or aligning data. It does so in a highly optimized manner as well.
In a case where two data frames have a similar shape, it might be useful to just append one after the other. Maybe A
and B
are products and one data frame contains the number of items sold per product in a store:
>>> df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) >>> df1 A B 0 1 4 1 2 5 2 3 6 >>> df2 = pd.DataFrame({'A': [4, 5, 6], 'B': [7, 8, 9]}) >>> df2 A B 0 4 7 1 5 8 2 6 9 >>> df1.append(df2) A B 0 1 4 1 2 5 2 3 6 0 4 7 1 5 8 2 6 9
Sometimes, we won't care about the indices of the originating data frames:
>>> df1.append(df2, ignore_index=True) A B 0 1 4 1 2 5 2 3 6 3 4 7 4 5 8 5 6 9
A more flexible way to combine objects is offered by the pd.concat
function, which takes an arbitrary number of series, data frames, or panels as input. The default behavior resembles an append:
>>> pd.concat([df1, df2]) A B 0 1 4 1 2 5 2 3 6 0 4 7 1 5 8 2 6 9
The default concat
operation appends both frames along the rows – or index, which corresponds to axis 0. To concatenate along the columns, we can pass in the axis keyword argument:
>>> pd.concat([df1, df2], axis=1) A B A B 0 1 4 4 7 1 2 5 5 8 2 3 6 6 9
We can add keys to create a hierarchical index.
>>> pd.concat([df1, df2], keys=['UK', 'DE']) A B UK 0 1 4 1 2 5 2 3 6 DE 0 4 7 1 5 8 2 6 9
This can be useful if you want to refer back to parts of the data frame later. We use the ix
indexer:
>>> df3 = pd.concat([df1, df2], keys=['UK', 'DE']) >>> df3.ix["UK"] A B 0 1 4 1 2 5 2 3 6
Data frames resemble database tables. It is therefore not surprising that Pandas implements SQL-like join operations on them. What is positively surprising is that these operations are highly optimized and extremely fast:
>>> import numpy as np >>> df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': range(4)}) >>> df1 key value 0 A 0 1 B 1 2 C 2 3 D 3 >>> df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'], 'value': range(10, 14)}) >>> df2 key value 0 B 10 1 D 11 2 D 12 3 E 13
If we merge on key
, we get an inner join. This creates a new data frame by combining the column values of the original data frames based upon the join predicate, here the key
attribute is used:
>>> df1.merge(df2, on='key') key value_x value_y 0 B 1 10 1 D 3 11 2 D 3 12
A left, right and full join can be specified by the how
parameter:
>>> df1.merge(df2, on='key', how='left') key value_x value_y 0 A 0 NaN 1 B 1 10 2 C 2 NaN 3 D 3 11 4 D 3 12 >>> df1.merge(df2, on='key', how='right') key value_x value_y 0 B 1 10 1 D 3 11 2 D 3 12 3 E NaN 13 >>> df1.merge(df2, on='key', how='outer') key value_x value_y 0 A 0 NaN 1 B 1 10 2 C 2 NaN 3 D 3 11 4 D 3 12 5 E NaN 13
The merge methods can be specified with the how parameter. The following table shows the methods in comparison with SQL:

Reshaping data
We saw how to combine data frames but sometimes we have all the right data in a single data structure, but the format is impractical for certain tasks. We start again with some artificial weather data:
>>> df date city value 0 2000-01-03 London 6 1 2000-01-04 London 3 2 2000-01-05 London 4 3 2000-01-03 Mexico 3 4 2000-01-04 Mexico 9 5 2000-01-05 Mexico 8 6 2000-01-03 Mumbai 12 7 2000-01-04 Mumbai 9 8 2000-01-05 Mumbai 8 9 2000-01-03 Tokyo 5 10 2000-01-04 Tokyo 5 11 2000-01-05 Tokyo 6
If we want to calculate the maximum temperature per city, we could just group the data by city and then take the max
function:
>>> df.groupby('city').max() date value city London 2000-01-05 6 Mexico 2000-01-05 9 Mumbai 2000-01-05 12 Tokyo 2000-01-05 6
However, if we have to bring our data into form every time, we could be a little more effective, by creating a reshaped data frame first, having the dates as an index and the cities as columns.
We can create such a data frame with the pivot
function. The arguments are the index (we use date), the columns (we use the cities), and the values (which are stored in the value column of the original data frame):
>>> pv = df.pivot("date", "city", "value") >>> pv city date London Mexico Mumbai Tokyo 2000-01-03 6 3 12 5 2000-01-04 3 9 9 5 2000-01-05 4 8 8 6
We can use max
function on this new data frame directly:
>>> pv.max() city London 6 Mexico 9 Mumbai 12 Tokyo 6 dtype: int64
With a more suitable shape, other operations become easier as well. For example, to find the maximum temperature per day, we can simply provide an additional axis argument:
>>> pv.max(axis=1) date 2000-01-03 12 2000-01-04 9 2000-01-05 8 dtype: int64