-
-
Notifications
You must be signed in to change notification settings - Fork 35
/
tech.v3.libs.arrow.html
140 lines (138 loc) · 19.7 KB
/
tech.v3.libs.arrow.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
<!DOCTYPE html PUBLIC ""
"">
<html><head><meta charset="UTF-8" /><title>tech.v3.libs.arrow documentation</title><script async="true" src="https://www.googletagmanager.com/gtag/js?id=G-RGTB4J7LGP"></script><script>window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-95TVFC1FEB');</script><link rel="stylesheet" type="text/css" href="css/default.css" /><link rel="stylesheet" type="text/css" href="highlight/solarized-light.css" /><script type="text/javascript" src="highlight/highlight.min.js"></script><script type="text/javascript" src="js/jquery.min.js"></script><script type="text/javascript" src="js/page_effects.js"></script><script>hljs.initHighlightingOnLoad();</script></head><body><div id="header"><h2>Generated by <a href="https://github.com/weavejester/codox">Codox</a> with <a href="https://github.com/xsc/codox-theme-rdash">RDash UI</a> theme</h2><h1><a href="index.html"><span class="project-title"><span class="project-name">TMD</span> <span class="project-version">7.012</span></span></a></h1></div><div class="sidebar primary"><h3 class="no-link"><span class="inner">Project</span></h3><ul class="index-link"><li class="depth-1 "><a href="index.html"><div class="inner">Index</div></a></li></ul><h3 class="no-link"><span class="inner">Topics</span></h3><ul><li class="depth-1 "><a href="000-getting-started.html"><div class="inner"><span>tech.ml.dataset Getting Started</span></div></a></li><li class="depth-1 "><a href="100-walkthrough.html"><div class="inner"><span>tech.ml.dataset Walkthrough</span></div></a></li><li class="depth-1 "><a href="200-quick-reference.html"><div class="inner"><span>tech.ml.dataset Quick Reference</span></div></a></li><li class="depth-1 "><a href="columns-readers-and-datatypes.html"><div class="inner"><span>tech.ml.dataset Columns, Readers, and Datatypes</span></div></a></li><li class="depth-1 "><a href="nippy-serialization-rocks.html"><div class="inner"><span>tech.ml.dataset And nippy</span></div></a></li><li class="depth-1 "><a href="supported-datatypes.html"><div class="inner"><span>tech.ml.dataset Supported Datatypes</span></div></a></li></ul><h3 class="no-link"><span class="inner">Namespaces</span></h3><ul><li class="depth-1"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tech</span></div></div></li><li class="depth-2"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>v3</span></div></div></li><li class="depth-3"><a href="tech.v3.dataset.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>dataset</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.categorical.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>categorical</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.clipboard.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>clipboard</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.column-filters.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>column-filters</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>io</span></div></div></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.csv.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>csv</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.datetime.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>datetime</span></div></a></li><li class="depth-5 branch"><a href="tech.v3.dataset.io.string-row-parser.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>string-row-parser</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.io.univocity.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>univocity</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.join.html"><div class="inner"><span class="tree" style="top: -145px;"><span class="top" style="height: 154px;"></span><span class="bottom"></span></span><span>join</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.math.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>math</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.metamorph.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>metamorph</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.modelling.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>modelling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.neanderthal.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>neanderthal</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.print.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>print</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.reductions.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>reductions</span></div></a></li><li class="depth-5"><a href="tech.v3.dataset.reductions.apache-data-sketch.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>apache-data-sketch</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.rolling.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>rolling</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.set.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>set</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.dataset.tensor.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>tensor</span></div></a></li><li class="depth-4"><a href="tech.v3.dataset.zip.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>zip</span></div></a></li><li class="depth-3"><div class="no-link"><div class="inner"><span class="tree" style="top: -672px;"><span class="top" style="height: 681px;"></span><span class="bottom"></span></span><span>libs</span></div></div></li><li class="depth-4 branch current"><a href="tech.v3.libs.arrow.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>arrow</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.fastexcel.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>fastexcel</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>guava</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.guava.cache.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>cache</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.parquet.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>parquet</span></div></a></li><li class="depth-4 branch"><a href="tech.v3.libs.poi.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>poi</span></div></a></li><li class="depth-4"><div class="no-link"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>smile</span></div></div></li><li class="depth-5"><a href="tech.v3.libs.smile.data.html"><div class="inner"><span class="tree"><span class="top"></span><span class="bottom"></span></span><span>data</span></div></a></li><li class="depth-4"><a href="tech.v3.libs.tribuo.html"><div class="inner"><span class="tree" style="top: -52px;"><span class="top" style="height: 61px;"></span><span class="bottom"></span></span><span>tribuo</span></div></a></li></ul></div><div class="sidebar secondary"><h3><a href="#top"><span class="inner">Public Vars</span></a></h3><ul><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-dataset-.3Estream.21"><div class="inner"><span>dataset->stream!</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-dataset-seq-.3Estream.21"><div class="inner"><span>dataset-seq->stream!</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-stream-.3Edataset"><div class="inner"><span>stream->dataset</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-stream-.3Edataset-seq"><div class="inner"><span>stream->dataset-seq</span></div></a></li><li class="depth-1"><a href="tech.v3.libs.arrow.html#var-type-atom"><div class="inner"><span>type-atom</span></div></a></li></ul></div><div class="namespace-docs" id="content"><h1 class="anchor" id="top">tech.v3.libs.arrow</h1><div class="doc"><div class="markdown"><p>Support for reading/writing apache arrow datasets. Datasets may be memory mapped
but default to being read via an input stream.</p>
<p>Supported datatypes:</p>
<ul>
<li>All numeric types - <code>:uint8</code>, <code>:int8</code>, <code>:uint16</code>, <code>:int16</code>, <code>:uint32</code>, <code>:int32</code>,
<code>:uint64</code>, <code>:int64</code>, <code>:float32</code>, <code>:float64</code>, <code>:boolean</code>.</li>
<li>String types - <code>:string</code>, <code>:text</code>. During write you have the option to always write
data as text which can be more efficient in the memory-mapped read case as it doesnt'
require the creation of string tables at load time.</li>
<li>Datetime Types - <code>:local-date</code>, <code>:local-time</code>, <code>:instant</code>. During read you have the
option to keep these types in their source numeric format e.g. 32 bit <code>:epoch-days</code>
for <code>:local-date</code> datatypes. This format can make some types of processing, such as
set creation, more efficient.</li>
</ul>
<p>When writing a dataset an arrow file with a single record set is created. When
writing a sequence of datasets downstream schemas must be compatible with the schema
of the initial dataset so for instance a conversion of int32 to double is fine but
double to int32 is not.</p>
<p>mmap support on systems running JDK-17 requires the foreign or memory module to be
loaded. Appropriate JVM arguments can be found
<a href="https://github.com/techascent/tech.ml.dataset/blob/0524ddd5bbcb9421a0f11290ec8a01b7795dcff9/project.clj#L69">here</a>.</p>
<h2>Required Dependencies</h2>
<p>In order to support both memory mapping and JDK-17, we only rely on the Arrow SDK's
flatbuffer and schema definitions:</p>
<pre><code class="language-clojure"> [org.apache.arrow/arrow-vector "6.0.0"]
[com.cnuernber/jarrow "1.000"]
[org.apache.commons/commons-compress "1.21"]
;;Compression codecs
[org.lz4/lz4-java "1.8.0"]
;;Required for decompressing lz4 streams with dependent blocks.
[net.java.dev.jna/jna "5.10.0"]
[com.github.luben/zstd-jni "1.5.4-1"]
</code></pre>
<p>The lz4 decompression system will fallback to lz4-java if liblz4 isn't installed or if
jna isn't loaded. The lz4-java java library will fail for arrow files that have dependent
block compression which are sometimes saved by python or R arrow implementations.
On current ubuntu, in order to install the lz4 library you need to do:</p>
<pre><code class="language-console"> sudo apt install liblz4-1
</code></pre>
</div></div><div class="public anchor" id="var-dataset-.3Estream.21"><h3>dataset->stream!</h3><div class="usage"><code>(dataset->stream! ds path options)</code><code>(dataset->stream! ds path)</code></div><div class="doc"><div class="markdown"><p>Write a dataset as an arrow file. File will contain one record set.
See documentation for <a href="tech.v3.libs.arrow.html#var-dataset-seq-.3Estream.21">dataset-seq->stream!</a>.</p>
<ul>
<li><code>:strings-as-text?</code> defaults to false.</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2149">view source</a></div></div><div class="public anchor" id="var-dataset-seq-.3Estream.21"><h3>dataset-seq->stream!</h3><div class="usage"><code>(dataset-seq->stream! path options ds-seq)</code><code>(dataset-seq->stream! path ds-seq)</code></div><div class="doc"><div class="markdown"><p>Write a sequence of datasets as an arrow stream file. File will contain one record set
per dataset. Datasets in the sequence must have matching schemas or downstream schema
must be able to be safely widened to the first schema.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:strings-as-text?</code> - defaults to true - Save out strings into arrow files without
dictionaries. This works well if you want to load an arrow file in-place or if
you know the strings in your dataset are either really large or should not be in
string tables. <strong>Saving multiple datasets with <code>{:strings-as-text false}</code> requires arrow
7.0.0+ support from your python or R code due to
<a href="https://issues.apache.org/jira/browse/ARROW-13467">Arrow issue 13467</a>. - the conservative
pathway for now is to set <code>:strings-as-text?</code> to true and only save text!!</strong>.</p>
</li>
<li>
<p><code>:format</code> - one of <code>[:file :ipc]</code>, defaults to <code>:file</code>.</p>
<ul>
<li><code>:file</code> - arrow file format, compatible with pyarrow's <a href="https://arrow.apache.org/docs/python/generated/pyarrow.ipc.open_file.html#pyarrow.ipc.open_file">open_file</a>. The suggested
suffix is <code>.arrow</code>.</li>
<li><code>:ipc</code> - arrow streaming format, compatible with pyarrow's <a href="https://arrow.apache.org/docs/python/generated/pyarrow.ipc.open_file.html#pyarrow.ipc.open_ipc">open_ipc</a> pathway. The
suggested suffix is <code>.arrows</code>.</li>
</ul>
</li>
<li>
<p><code>:compression</code> - Either <code>:zstd</code> or <code>:lz4</code>, defaults to no compression (nil).
Per-column compression of the data can result in some significant size savings
(2x+) and thus some significant time savings when loading over the network.
Using compression makes loading via mmap non-lazy - If you are going to use
compression mmap probably doesn't make sense and most likely will result in
slower loading times.</p>
<ul>
<li><code>:lz4</code> - Decent and very fast compression.</li>
<li><code>:zstd</code> - Good compression, somewhat slower than <code>:lz4</code>. Can also have a
level parameter that ranges from 1-12 in which case compression is specified
in map form: <code>{:compression-type :zstd :level 5}</code>.</li>
</ul>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L2072">view source</a></div></div><div class="public anchor" id="var-stream-.3Edataset"><h3>stream->dataset</h3><div class="usage"><code>(stream->dataset fname options)</code><code>(stream->dataset fname)</code></div><div class="doc"><div class="markdown"><p>Reads data non-lazily in arrow streaming format expecting to find a single dataset.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:open-type</code> - Either <code>:mmap</code> or <code>:input-stream</code> defaulting to the slower but more robust
<code>:input-stream</code> pathway. When using <code>:mmap</code> resources will be released when the resource
system dictates - see documentation for <a href="https://techascent.github.io/tech.resource/tech.v3.resource.html">tech.v3.resource</a>.
When using <code>:input-stream</code> the stream will be closed when the lazy sequence is either fully realized or an
exception is thrown. Memory mapping is not supported on m-1 macs unless you are using JDK-17.</p>
</li>
<li>
<p><code>close-input-stream?</code> - When using <code>:input-stream</code> <code>:open-type</code>, close the input stream upon
exception or when stream is fully realized. Defaults to true.</p>
</li>
<li>
<p><code>:integer-datetime-types?</code> - when true arrow columns in the appropriate packed
datatypes will be represented as their integer types as opposed to their respective
packed types. For example columns of type <code>:epoch-days</code> will be returned to the user
as datatype <code>:epoch-days</code> as opposed to <code>:packed-local-date</code>. This means reading values
will return integers as opposed to <code>java.time.LocalDate</code>s.</p>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1859">view source</a></div></div><div class="public anchor" id="var-stream-.3Edataset-seq"><h3>stream->dataset-seq</h3><div class="usage"><code>(stream->dataset-seq fname & [options])</code></div><div class="doc"><div class="markdown"><p>Loads data up to and including the first data record. Returns the a lazy
sequence of datasets. Datasets can be loaded using mmapped data and when that is true
realizing the entire sequence is usually safe, even for datasets that are larger than
available RAM.
The default resourc management pathway for this is :auto but you can override this
by explicity setting the option <code>:resource-type</code>. See documentation for
tech.v3.datatype.mmap/mmap-file.</p>
<p>Options:</p>
<ul>
<li>
<p><code>:open-type</code> - Either <code>:mmap</code> or <code>:input-stream</code> defaulting to the slower but more robust
<code>:input-stream</code> pathway. When using <code>:mmap</code> resources will be released when the resource
system dictates - see documentation for <a href="https://techascent.github.io/tech.resource/tech.v3.resource.html">tech.v3.resource</a>.
When using <code>:input-stream</code> the stream will be closed when the lazy sequence is either
fully realized or an exception is thrown.</p>
</li>
<li>
<p><code>close-input-stream?</code> - When using <code>:input-stream</code> <code>:open-type</code>, close the input stream upon
exception or when stream is fully realized. Defaults to true.</p>
</li>
<li>
<p><code>:integer-datetime-types?</code> - when true arrow columns in the appropriate packed
datatypes will be represented as their integer types as opposed to their respective
packed types. For example columns of type <code>:epoch-days</code> will be returned to the user
as datatype <code>:epoch-days</code> as opposed to <code>:packed-local-date</code>. This means reading values
will return integers as opposed to <code>java.time.LocalDate</code>s.</p>
</li>
</ul>
</div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L1810">view source</a></div></div><div class="public anchor" id="var-type-atom"><h3>type-atom</h3><div class="usage"></div><div class="doc"><div class="markdown"></div></div><div class="src-link"><a href="https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/libs/arrow.clj#L335">view source</a></div></div></div></body></html>