Analyzing and Visualizing Data With F# PDF
Analyzing and Visualizing Data With F# PDF
Data with F#
Tomas Petricek
Analyzing and Visualizing Data with F#
by Tomas Petricek
Copyright © 2016 O’Reilly Media, Inc. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA
95472.
O’Reilly books may be purchased for educational, business, or sales promotional use.
Online editions are also available for most titles (http://safaribooksonline.com). For
more information, contact our corporate/institutional sales department:
800-998-9938 or corporate@oreilly.com .
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.
978-1-491-93953-6
[LSI]
Table of Contents
Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
vii
Conclusions 42
This report would never exist without the amazing F# open source
community that creates and maintains many of the libraries used in
the report. It is impossible to list all the contributors, but let me say
thanks to Gustavo Guerra, Howard Mansell, and Taha Hachana for
their work on F# Data, R type provider, and XPlot, and to Steffen
Forkmann for his work on the projects that power much of the F#
open source infrastructure. Many thanks to companies that support
the F# projects, including Microsoft and BlueMountain Capital.
I would also like to thank Mathias Brandewinder who wrote many
great examples using F# for machine learning and whose blog post
about clustering with F# inspired the example in Chapter 4. Last but
not least, I’m thankful to Brian MacDonald, Heather Scherer from
O’Reilly, and the technical reviewers for useful feedback on early
drafts of the report.
ix
CHAPTER 1
Accessing Data with Type
Providers
Working with data was not always as easy as nowadays. For exam‐
ple, processing the data from the decennial 1880 US Census took
eight years. For the 1890 census, the United States Census Bureau
hired Herman Hollerith, who invented a number of devices to auto‐
mate the process. A pantograph punch was used to punch the data
on punch cards, which were then fed to the tabulator that counted
cards with certain properties, or to the sorter for filtering. The cen‐
sus still required a large amount of clerical work, but Hollerith’s
machines sped up the process eight times to just one year.1
These days, filtering and calculating sums over hundreds of millions
of rows (the number of forms received in the 2010 US Census) can
take seconds. Much of the data from the US Census, various Open
Government Data initiatives, and from international organizations
like the World Bank is available online and can be analyzed by any‐
one. Hollerith’s tabulator and sorter have become standard library
functions in many programming languages and data analytics libra‐
ries.
1 Hollerith’s company later merged with three other companies to form a company that
was renamed International Business Machines Corporation (IBM) in 1924. You can
find more about Hollerith’s machines in Mark Priestley’s excellent book, A Science of
Operations (Springer).
1
Making data analytics easier no longer involves building new physi‐
cal devices, but instead involves creating better software tools and
programming languages. So, let’s see how the F# language and its
unique features like type providers make the task of modern data
analysis even easier!
If you ask any data scientist, she’ll tell you that accessing data is the
most frustrating part of the workflow. You need to download CSV
files, figure out what columns contain what values, then determine
how missing values are represented and parse them. When calling
REST-based services, you need to understand the structure of the
returned JSON and extract the values you care about. As you’ll see
in this chapter, the data access part is largely simplified in F# thanks
to type providers that integrate external data sources directly into the
language.
Enough talking, let’s look at some code! To set the theme for this
chapter, let’s look at the forecasted temperatures around the world.
To do this, we combine data from two sources. We use the World
Bank2 to access information about countries, and we use the Open
Weather Map3 to get the forecasted temperature in all the capitals of
all the countries in the world.
3 See http://openweathermap.org/.
4 See http://fslab.org/FSharp.Data.
Once you have all the packages, you can replace the sample script
file with the following simple code snippet:
#load "packages/FsLab/FsLab.fsx"
open FSharp.Data
let wb = WorldBankData.GetDataContext()
The first line loads the FsLab.fsx file, which comes from the FsLab
package, and loads all the libraries that are a part of FsLab, so you do
not have to reference them one by one. The last line uses GetData
Context to to create an instance that we’ll need in the next step to
fetch some data.
The next step is to use the World Bank type provider to get some
data. Assuming everything is set up in your editor, you should be
able to type wb.Countries followed by . (a period) and get auto-
completion on the country names as shown in Figure 1-1. This is
not a magic! The country names, are just ordinary properties. The
trick is that they are generated on the fly by the type provider based
on the schema retrieved from the World Bank.
The first line calls the GetSample method to obtain the forecast
using the sample URL—in our case, the temperature in Prague in
getTomorrowTemp "Prague"
getTomorrowTemp "Cambridge,UK"
The Open Weather Map returns the JSON document with the same
structure for all cities. This means that we can use the Load method
to load data from a different URL, because it will still have the same
properties. Once we have the document, we call Seq.head to get the
forecast for the first day in the list.
As mentioned before, F# is statically typed, but we did not have to
write any type annotations for the getTomorrowTemp function. That’s
because the F# compiler is smart enough to infer that place has to
be a string (because we are appending it to another string) and that
To better understand the code, you can look at the type of the world
Temps value that we are defining. This is printed in F# Interactive
when you run the code, and most F# editors also show a tooltip
when you place the mouse pointer over the identifier. The type of
the value is (string * float) list, which means that we get a list
of pairs with two elements: the first is a string (country name) and
the second is a floating-point number (temperature).5
After you run the code and download the temperatures, you’re ready
to plot the temperatures on a map. To do this, we use the XPlot
library, which is a lightweight F# wrapper for Google Charts:
open XPlot.GoogleCharts
Chart.Geo(worldTemps)
5 If you are coming from a C# background, you can also read this as
List<Tuple<string, float>>.
worldTemps
|> Chart.Geo
|> Chart.WithOptions(Options(colorAxis=axis))
|> Chart.WithLabel "Temp"
Figure 1-2. Forecasted temperatures for tomorrow with label and cus‐
tom color scale
The resulting chart should look like the one in Figure 1-2. Just be
careful, if you are running the code in the winter, you might need to
tweak the scale!
Conclusions | 13
CHAPTER 2
Analyzing Data Using F# and
Deedle
1 See http://fslab.org/Deedle/.
15
open Deedle
open FSharp.Data
open XPlot.GoogleCharts
open XPlot.GoogleCharts.Deedle
There are two new things here. First, we need to reference the
System.Xml.Linq library, which is required by the XML type pro‐
vider. Next, we open the Deedle namespace together with extensions
that let us pass data from the Deedle series directly to XPlot for visu‐
alization.
To call the service, we need to provide the per_page and date query
parameters. Those are specified as a list of pairs. The first parameter
has a constant value of "1000". The second parameter needs to be a
date range written as "2015:2015", so we use sprintf to format the
string.
The function then downloads the data using the Http.Request
String helper which takes the URL and a list of query parameters.
Then we use WorldData.Parse to read the data using our provided
type. We could also use WorkldData.Load, but by using the Http
helper we do not have to concatenate the URL by hand (the helper is
also useful if you need to specify an HTTP method or provide
HTTP headers).
Next we define a helper function orNaN. This deserves some explan‐
ation. The type provider correctly infers that data for some countries
may be missing and gives us option<decimal> as the value. This is a
high-precision decimal number wrapped in an option to indicate
that it may be missing. For convenience, we want to treat missing
values as nan. To do this, we first convert the value into float (if it is
available) using Option.map float value. Then we use defaultArg
to return either the value (if it is available) or nan (if it is not avail‐
able).
Finally, the last line creates a series with country names as keys and
the World Bank data as values. This is similar to what we did in the
As you can see in Figure 2-1, we got data for most countries of the
world, but not for all of them. The range of the values is between
-70% to +1200%, but emissions in most countries are growing more
slowly. To see this, we specify a green color for -10%, yellow for 0%,
orange for +100, red for +200%, and very dark red for +1200%.
In this example, we used Deedle to align two series with country
names as indices. This kind of operation is useful all the time when
combining data from multiple sources, no matter whether your keys
are product IDs, email addresses, or stock tickers. If you’re working
with a time series, Deedle offers even more. For example, for every
key from one time-series, you can find a value from another series
whose key is the closest to the time of the value in the first series.
You can find a detailed overview in the Deedle page about working
with time series.
let world =
frame [ for name, ind in codes ->
name, getData 2010 ind.IndicatorCode ]
The code snippet defines a list with pairs consisting of a short indi‐
cator name and the code from the World Bank. You can run it and
see what the codes look like—choosing an indicator from an auto-
complete list is much easier than finding it in the API documenta‐
tion!
The last line does all the actual work. It creates a list of key value
pairs using a sequence expression [ ... ], but this time, the value
is a series with data for all countries. So, we create a list with an indi‐
cator name and data series. This is then passed to the frame func‐
tion, which creates a data frame.
A data frame is a Deedle data structure that stores multiple series.
You can think of it as a table with multiple columns and rows (simi‐
lar to a data table or spreadsheet). When creating a data frame, Dee‐
dle again makes sure that the values are correctly aligned based on
their keys.
: …
Data frames are useful for interactive data exploration. When you
create a data frame, F# Interactive formats it nicely so you can get a
quick idea about the data. For example, in Table 2-1 you can see the
ranges of the values and which values are frequently missing.
Data frames are also useful for interoperability. You can easily save
data frames to CSV files. If you want to use F# for data access and
cleanup, but then load the data in another language or tool such as
R, Mathematica, or Python, data frames give you an easy way to do
that. However, if you are interested in calling R, this is even easier
with the F# R type provider.
R.plot(world)
If you are typing the code in your editor, you can use auto-
completion in two places. First, after typing RProvider and . (dot),
you can see a list with all available packages. Second, after typing R
and . (dot), you can see functions in all the packages you opened.
Also note that we are calling the R function with a Deedle data
frame as an argument. This is possible because the R provider
2 See http://fslab.org/RProvider.
let filled =
world
|> Frame.transpose
|> Frame.fillMissingUsing (fun _ ind -> avg.[ind])
let norm =
(filled - lo) / (hi - lo)
|> Frame.transpose
The normalization is done in three steps:
1. First, we use functions from the Stats module to get the small‐
est, largest, and average values. When applied on a frame, the
functions return series with one number for each column, so we
get aggregates for all indicators.
2. Second, we fill the missing values. The fillMissingUsing oper‐
ation iterates over all columns and then fills the missing value
for each item in the column by calling the function we provide.
To use it, we first transpose the frame (to switch rows and col‐
umns). Then fillMissingUsing iterates over all countries, gives
us the indicator name ind, and we look up the average value for
the indicator using avg.[ind]. We do not need the value of the
The fact that the explanation here is much longer than the code
shows just how much you can do with just a couple of lines of code
with F# and Deedle. The library provides functions for joining
frames, grouping, and aggregation, as well as windowing and sam‐
pling (which are especially useful for time-indexed data). For more
information about the available functions, check out the documen‐
tation for the Stats module and the documentation for the Frame
module on the Deedle website.
To finish the chapter with an interesting visualization, let’s use the
normalized data to build a scatter plot that shows the correlation
between GDP and life expectancy. As suggested earlier, the growth is
not linear so we take the logarithm of GDP:
let gdp = log norm.["GDP"] |> Series.values
let life = norm.["Life"] |> Series.values
Conclusions
In this chapter, we looked at a more realistic case study of doing data
science with F#. We still used World Bank as our data source, but
this time we called it using the XML provider directly. This demon‐
strates a general approach that would work with any REST-based
service.
Next, we looked at the data in two different ways. We used Deedle to
print a data frame showing the numerical values. This showed us
that some values are missing and that different indicators have very
different ranges, and we later normalized the values for further pro‐
cessing. Next, we used the R type provider to get a quick overview of
correlations. Here, we really just scratched the surface of what is
Conclusions | 27
CHAPTER 3
Implementing Machine Learning
Algorithms
All of the analysis that we discussed so far in this report was manual.
We looked at some data, we had some idea what we wanted to find
or highlight, we transformed the data, and we built a visualization.
Machine learning aims to make the process more automated. In
general, machine learning is the process of building models automat‐
ically from data. There are two basic kinds of algorithms. Supervised
algorithms learn to generalize from data with known answers, while
unsupervised algorithms automatically learn to model data without
known structure.
In this chapter, we implement a basic, unsupervised machine learn‐
ing algorithm called k-means clustering that automatically splits
inputs into a specified number of groups. We’ll use it to group coun‐
tries based on the indicators obtained in the previous chapter.
This chapter also shows the F# language from a different perspec‐
tive. So far, we did not need to implement any complicated logic and
mostly relied on existing libraries. In contrast, this chapter uses just
the standard F# library, and you’ll see a number of ways in which F#
makes it very easy to implement new algorithms—the primary way
is type inference which lets us write efficient and correct code while
keeping it very short and readable.
29
How k-Means Clustering Works
The k-means clustering algorithm takes input data, together with
the number k that specifies how many clusters we want to obtain,
and automatically assigns the individual inputs to one of the clus‐
ters. It is iterative, meaning that it runs in a loop until it reaches the
final result or a maximal number of steps.
The idea of the algorithm is that it creates a number of centroids that
represent the centers of the clusters. As it runs, it keeps adjusting the
centroids so that they better cluster the input data. It is an unsuper‐
vised algorithm, which means that we do not need to know any
information about the clusters (say, sample inputs that belong
there).
To demonstrate how the algorithm works, we look at an example
that can be easily drawn in a diagram. Let’s say that we have a num‐
ber of points with X and Y coordinates and we want to group them
in clusters. Figure 3-1 shows the points (as circles) and current cent‐
roids (as stars). Colors illustrate the current clustering that we are
trying to improve. This is very simple, but it is sufficient to get
started.
The example in Figure 3-1 shows the state before and after one iter‐
ation of the loop. In “Before,” we randomly generated the location of
the centroids (shown as stars) and assigned all of the inputs to the
correct cluster (shown as different colors). In “After,” we see the new
state after running steps 3 and 2. In step 3, we move the green cent‐
roid to the right (the leftmost green circle becomes blue), and we
move the orange centroid to the bottom and a bit to the left (the
rightmost blue circle becomes orange).
To run the algorithm, we do not need any classified samples, but we
do need two things. We need to be able to measure the distance (to
find the nearest centroid), and we need to be able to aggregate the
inputs (to calculate a new centroid). As we’ll see in “Writing a Reus‐
able Clustering Function” on page 36, this information will be nicely
reflected in the F# type information at the end of the chapter, so it’s
worth remembering.
Clustering 2D Points
Rather than getting directly to the full problem and clustering coun‐
tries, we start with a simpler example. Once we know that the code
works on the basic sample, we’ll turn it into a reusable F# function
and use it on the full data set.
Our sample data set consists of just six points. Assuming 0.0, 0.0
is the bottom left corner, we have two points in the bottom left, two
in the bottom right, and two in the top left corner:
let data =
[ (0.0, 1.0); (1.0, 1.0);
Clustering 2D Points | 31
(10.0, 1.0); (13.0, 3.0);
(4.0, 10.0); (5.0, 8.0) ]
The notation [ ... ] is the list expression (which we’ve seen in pre‐
vious chapters), but this time we’re creating a list of explicitly given
tuples.
If you run the code in F# Interactive, you’ll see that the type of the
data value is list<float * float>,1 so the tuple float * float is
the type of individual input. As discussed before, we need the dis‐
tance and aggregate functions for the inputs:
let distance (x1, y1) (x2, y2) : float =
sqrt ((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2))
The distance function takes two points and produces a single num‐
ber. Note that in F#, function parameters are separated by spaces,
and so (x1, y1) is the first parameter and (x2, y2) is the second.
However, both parameters are bound to patterns that decompose the
tuple into individual components, and we get access to the X and Y
coordinates for both points. We also included the type annotation
specifying that the result is float. This is needed here because the
F# compiler would not know what numerical type we intend to use.
The body then simply calculates the distance between the two
points.
The aggregate function takes a list of inputs and calculates their
centers. This is done using the List.averageBy function, which
takes two arguments. The second argument is the input list, and the
first argument is a projection function that specifies what value
(from the input) should be averaged. The fst and snd functions
return the first and second element of a tuple, respectively, and this
averages the X and Y coordinates.
1 The F# compiler also reports this as (float * float) list, which is just a different
way of writing the same type.
let centroids =
let random = System.Random()
[ for i in 1 .. clusterCount ->
List.nth data (random.Next(data.Length)) ]
The code snippet uses the List.nth function to access the element
at the random offset (in F# 4.0, List.nth is deprecated, and you can
use the new List.item instead). We also define the random value as
part of the definition of centroids—this makes it accessible only
inside the definition of centroids and we keep it local to the initial‐
ization code.
Our logic here is not perfect, because we could accidentally pick the
same input twice and two clusters would fully overlap. This is some‐
thing we should improve in a proper implementation, but it works
well enough for our demo.
The next step is to find the closest centroid for each input. To do
this, we write a function closest that takes all centroids and the
input we want to classify:
let closest centroids input =
centroids
|> List.mapi (fun i v -> i, v)
|> List.minBy (fun (_, cent) -> distance cent input)
|> fst
The function works in three steps that are composed in a sequence
using the pipeline |> operator that we’ve seen in the first chapter.
Here, we start with centroids, which is a list, and apply a number of
transformations on the list:
2 If you are familiar with LINQ, then this is the Select extension method.
let assignment =
update (List.map (closest centroids) data)
The function first calculates new centroids. To do this, it iterates
over the centroid indices. For each centroid, it finds all items from
data that are currently assigned to the centroid. Here, we use
List.zip to create a list containing items from data together with
their assignments. We then use the aggregate function (defined ear‐
lier) to calculate the center of the items.
Once we have new centroids, we calculate new assignments based
on the updated clusters (using List.map (closest centroids)
data, as in the previous section).
Even though we did all the work using an extremely simple special
case, we now have everything in place to turn the code into a reusa‐
ble function. This nicely shows the typical F# development process.
let centroids =
let rnd = System.Random()
[ for i in 1 .. clusterCount ->
List.nth data (rnd.Next(data.Length)) ]
Learning to read the type signatures takes some time, but it quickly
becomes an invaluable tool of every F# programmer. You can look at
the inferred type and verify whether it matches your intuition. In
the case of k-means clustering, the type signature matches the intro‐
duction discussed earlier in “How k-Means Clustering Works” on
page 30.
To experiment with the type inference, try removing one of the
parameters from the signature of the kmeans function. When you
Clustering Countries
Now that we have a reusable kmeans function, there is one step left:
run it on the information about the countries that we downloaded at
the end of the previous chapter. Recall that we previously defined
norm, which is a data frame of type Frame<string, string> that has
countries as rows and a number of indicators as columns. For call‐
ing kmeans, we need a list of values, so we get the rows of the frame
(representing individual countries) and turn them into a list using
List.ofSeq:
let data =
norm.GetRows<float>().Values |> List.ofSeq
The distance function takes two series and uses the point-wise *
and - operators to calculate the squares of differences for each col‐
umn, then sums them to get a single distance metric. We need to
provide type annotations, written as (s1:Series<string,float>),
to tell the F# compiler that the parameter is a series and that it
should use the overloaded numerical operators provided by Deedle
(rather than treating them as operators on integers).
Clustering Countries | 39
The aggregate takes a list of series (countries in a cluster) of type
list<Series<string,float>>. It should return the averaged value
that represents the center of the cluster. To do this, we use a simple
trick: we turn the series into a frame and then use Stats.mean from
Deedle to calculate averages over all columns of the frame. This
gives us a series where each indicator is the average of all input indi‐
cators. Deedle also conveniently skips over missing values.
Now we just need to call the kmeans function and draw a chart
showing the clusters:
let clrs = ColorAxis(colors=[|"red";"blue";"orange"|])
let countryClusters =
kmeans distance aggregate 3 data
The snippet is not showing anything new. We call kmeans with our
new data and the distance and aggregate functions. Then we
combine the country names (norm.RowKeys) with their cluster
assignments and draw a geo chart that uses red, blue, and orange for
the three clusters. The result is the map in Figure 3-2.
3 Available at http://www.m-brace.net/.
Conclusions
In this chapter, we completed our brief tour by using the F# lan‐
guage to implement the k-means clustering algorithm. This illustra‐
ted two aspects of F# that make it nice for writing algorithms:
4 Available at http://www.m-brace.net/programming-model.html.
5 Available at http://bit.ly/decisiontreeblog.
6 Available at http://evelinag.com/Ariadne/.
Conclusions | 43
CHAPTER 4
Conclusions and Next Steps
This brief report shows just a few examples of what you can do with
F#, but we used it to demonstrate many of the key features of the
language that make it a great tool for data science and machine
learning. With type providers, you can elegantly access data. We
used the XPlot library for visualization, but F# also gives you access
to the ggplot2 package from R and numerous other tools. As for
analysis, we used the Deedle library and R type provider, but we also
implemented our own clustering algorithm.
45
for wrapping F# code as a web application or a web service;1 and so
you can expose the functionality as a simple REST service and host
it on Heroku, AWS, or Azure.