-
-
Notifications
You must be signed in to change notification settings - Fork 35
/
000-getting-started.html
148 lines (138 loc) · 17.6 KB
/
000-getting-started.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.ml.dataset Getting Started</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">7.009</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 current"><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.neanderthal.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>neanderthal</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -672px;"><span class="top" style="height: 681px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>smile</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.smile.data.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>data</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="document" id="content"><div class="doc"><div class="markdown"><h1>tech.ml.dataset Getting Started</h1>
<h2>What kind of data?</h2>
<p>TMD processes <em>tabular</em> data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being <em>functional</em>; thereby extending the language's reach to new problems and domains.</p>
<pre><code class="language-clojure">> (ds/->dataset "lucy.csv")
lucy.csv [3 3]:
| name | age | likes |
|-------|----:|-------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
| sally | 21 | opera |
</code></pre>
<h2>Reading and writing datasets</h2>
<p>TMD can read datasets from many common formats (e.g., csv, tsv, xls, xlsx, json, parquet, arrow, ...). When given a file path, the <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var--.3Edataset">->dataset</a> function can often detect the format automatically by the file extension and obtain the dataset. The same function can make datasets from other sources, such as sequences of Clojure maps in memory, or (again with broad format support) data downloaded from the internet.</p>
<p>For output, the <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-rows">rows</a> function gives the dataset as a sequence of maps, and the <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21">write!</a> function can be used to serialize any dataset into any supported format.</p>
<pre><code class="language-clojure">> (ds/->dataset [{:name "fred"
:age 42
:likes "pizza"}
{:name "ethel"
:age 42
:likes "sushi"}
{:name "sally"
:age 21
:likes "opera"}])
_unnamed [3 3]:
| :name | :age | :likes |
|-------|-----:|--------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
| sally | 21 | opera |
</code></pre>
<h2>Filtering data</h2>
<p>TMD datasets are logically <em>maps</em> of column name to column data; this means that (for example) Clojure's <code>dissoc</code> can be used to remove a column. Datasets can also be filtered row-wise, by predicates of a single column, or of entire rows - this is similar to Clojure's <code>filter</code> function, but can operate much more efficiently by exploiting tabular structure.</p>
<pre><code class="language-clojure">> (-> (ds/->dataset "lucy.csv")
(dissoc "likes"))
lucy.csv [3 2]:
| name | age |
|-------|----:|
| fred | 42 |
| ethel | 42 |
| sally | 21 |
</code></pre>
<pre><code class="language-clojure">> (-> (ds/->dataset "lucy.csv")
(ds/filter-column "age" #(> % 30)))
lucy.csv [2 3]:
| name | age | likes |
|-------|----:|-------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
</code></pre>
<h2>Adding data</h2>
<p>The powerful <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-row-map">row-map</a> function can be used to create or update columns that derive from data already in the dataset. Adding rows is typically accomplished by concatenating two (or more) datasets. The <a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.functional.html">functional</a> namespace provides convenient functions for operating on scalar, element-wise, or columnar data.</p>
<pre><code class="language-clojure">> (-> (ds/->dataset "lucy.csv")
(ds/row-map (fn [{:strs [age]}]
{"half-age" (/ age 2.0)})))
lucy.csv [3 4]:
| name | age | likes | half-age |
|-------|----:|-------|---------:|
| fred | 42 | pizza | 21.0 |
| ethel | 42 | sushi | 21.0 |
| sally | 21 | opera | 10.5 |
</code></pre>
<h2>Statistics</h2>
<p>TMD has tools for calculating summary statistics on datasets. The <a href="https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-descriptive-stats">descriptive-stats</a> function produces a dataset of summary statistics for each column in the input dataset - perfect for initial exploration, or further meta analysis or operation. Broad support for further columnar statistical analysis is provided by the <a href="https://cnuernber.github.io/dtype-next/tech.v3.datatype.statistics.html">statistics</a> namespace.</p>
<pre><code class="language-clojure">> (-> (ds/->dataset "lucy.csv")
(ds/row-map (fn [{:strs [age]}]
{"half-age" (/ age 2.0)}))
(ds/descriptive-stats {:stat-names [:col-name :datatype :min :mean :max :standard-deviation]}))
lucy.csv: descriptive-stats [4 6]:
| :col-name | :datatype | :min | :mean | :max | :standard-deviation |
|-----------|-----------|-----:|------:|-----:|--------------------:|
| name | :string | | | | |
| age | :int16 | 21.0 | 35.0 | 42.0 | 12.12435565 |
| likes | :string | | | | |
| half-age | :float64 | 10.5 | 17.5 | 21.0 | 6.06217783 |
</code></pre>
<h2>Grouping</h2>
<p>Like a Clojure sequence, a dataset can be grouped into a <em>map</em> of value to dataset with that value. The <a href="https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-group-by">group-by</a> function accomplishes this. The related <a href="https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-group-by-.3Eindexes">group-by->indexes</a> function produces maps of value to row-indexes of the input dataset - working with indexes can be more efficient than constructing concrete grouped datasets.</p>
<pre><code class="language-clojure">> (-> (ds/->dataset "lucy.csv")
(ds/group-by #(if (> (get % "age") 30) :old :not-old)))
{:old lucy.csv [2 3]:
| name | age | likes |
|-------|----:|-------|
| fred | 42 | pizza |
| ethel | 42 | sushi |
, :not-old lucy.csv [1 3]:
| name | age | likes |
|-------|----:|-------|
| sally | 21 | opera |
}
</code></pre>
<pre><code class="language-clojure">> (-> (ds/->dataset "lucy.csv")
(ds/group-by->indexes #(if (> (get % "age") 30) :old :not-old)))
{:old [0 1], :not-old [2]}
</code></pre>
<h2>Combining datasets</h2>
<p>Because datasets are <em>maps</em> of column name to column data, they can be combined column-wise using Clojure's <code>merge</code> function. The <a href="https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.html#var-concat">concat</a> function can be used for row-wise combination of two or more datasets. The <a href="https://techascent.github.io/tech.ml.dataset/docs/tech.v3.dataset.join.html">join</a> namespace provides database-like joins for aligning data from multiple datasets.</p>
<pre><code class="language-clojure">> (merge (ds/->dataset (for [i (range 3)] {"index" i}))
(ds/->dataset "lucy.csv"))
_unnamed [3 4]:
| index | name | age | likes |
|------:|-------|----:|-------|
| 0 | fred | 42 | pizza |
| 1 | ethel | 42 | sushi |
| 2 | sally | 21 | opera |
</code></pre>
<h2>Date, time, and other datatypes</h2>
<p>TMD knows about dates, times, instants, and many other types from the comprehensive <code>java.time</code> library. Working with these types can be much more convenient than dealing with them as strings, and datatypes are preserved throughout operations, so downstream tooling can avoid dealing with these data as strings as well.</p>
<p>In addition to <code>java.time</code> types, all Clojure types (e.g., keywords), UUIDs, as well as a comprehensive set of signed and unsigned numeric types of different widths are also transparently supported.</p>
<pre><code class="language-clojure">> (def ds (ds/->dataset [{:date "1981-03-10"}
{:date "1999-12-31"}]
{:parser-fn {:date :local-date}}))
#'ds
> (.until (first (:date ds))
(last (:date ds)))
#object[java.time.Period 0x2d9a2c24 "P18Y9M21D"]
</code></pre>
<hr />
<h2>Further reading</h2>
<ul>
<li>
<p>The <a href="https://github.com/techascent/tech.ml.dataset#techmldataset">README</a> on GitHub has information about installing, and first steps with TMD.</p>
</li>
<li>
<p>The <a href="https://techascent.github.io/tech.ml.dataset/100-walkthrough.html">walkthrough</a> topic has long-form examples of processing real data with TMD.</p>
</li>
<li>
<p>The <a href="https://techascent.github.io/tech.ml.dataset/200-quick-reference.html">quick reference</a> summarizes many of the most frequently used functions with hints about their use.</p>
</li>
<li>
<p>The <a href="https://techascent.github.io/tech.ml.dataset/index.html">API docs</a> list every function available in TMD.</p>
</li>
</ul>
</div></div></div></body></html>