tech.ml.dataset Getting Started
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Getting Started
What kind of data?
TMD processes tabular data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being functional; thereby extending the language's reach to new problems and domains.
> (ds/->dataset "lucy.csv")
diff --git a/docs/100-walkthrough.html b/docs/100-walkthrough.html
index 8151344f..4bf43e1c 100644
--- a/docs/100-walkthrough.html
+++ b/docs/100-walkthrough.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Walkthrough
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Walkthrough
tech.ml.dataset
(TMD) is a Clojure library designed to ease working with tabular data, similar to data.table
in R or Python's Pandas. TMD takes inspiration from the design of those tools, but does not aim to copy their functionality. Instead, TMD is a building block that increases Clojure's already considerable data processing power.
High Level Design
In TMD, a dataset is logically a map of column name to column data. Column data is typed (e.g., a column of 16 bit integers, or a column of 64 bit floating point numbers), similar to a database. Column names may be any Java object - keywords and strings are typical - and column values may be any Java primitive type, or type supported by tech.datatype
, datetimes, or arbitrary objects. Column data is stored contiguously in JVM arrays, and missing values are indicated with bitsets.
diff --git a/docs/200-quick-reference.html b/docs/200-quick-reference.html
index 92ee04a0..89b6af59 100644
--- a/docs/200-quick-reference.html
+++ b/docs/200-quick-reference.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Quick Reference
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Quick Reference
This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the tech.ml.dataset
namespace.
For a more thorough treatment, the API docs list every available function.
Table of Contents
diff --git a/docs/columns-readers-and-datatypes.html b/docs/columns-readers-and-datatypes.html
index f345f3be..08cffe6c 100644
--- a/docs/columns-readers-and-datatypes.html
+++ b/docs/columns-readers-and-datatypes.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Columns, Readers, and Datatypes
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Columns, Readers, and Datatypes
In tech.ml.dataset
, columns are composed of three things:
data, metadata, and the missing set.
The column's datatype is the datatype of the data
member. The data member can
diff --git a/docs/index.html b/docs/index.html
index f535bc2f..f7f6d837 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -1,10 +1,10 @@
-
TMD 7.033 TMD 7.033
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
+ gtag('config', 'G-95TVFC1FEB');
TMD 7.034
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
in memory datasets.
Public variables and functions:
- ->>dataset
- ->dataset
- add-column
- add-or-update-column
- all-descriptive-stats-names
- append-columns
- assoc-ds
- assoc-metadata
- bind->
- brief
- categorical->number
- categorical->one-hot
- column
- column->dataset
- column-cast
- column-count
- column-labeled-mapseq
- column-map
- column-map-m
- column-names
- columns
- columns-with-missing-seq
- columnwise-concat
- concat
- concat-copying
- concat-inplace
- data->dataset
- dataset->data
- dataset-name
- dataset-parser
- dataset?
- descriptive-stats
- drop-columns
- drop-missing
- drop-rows
- empty-dataset
- ensure-array-backed
- filter
- filter-column
- filter-dataset
- group-by
- group-by->indexes
- group-by-column
- group-by-column->indexes
- group-by-column-consumer
- has-column?
- head
- induction
- major-version
- mapseq-parser
- mapseq-reader
- mapseq-rf
- min-n-by-column
- missing
- new-column
- new-dataset
- order-column-names
- pmap-ds
- print-all
- rand-nth
- remove-column
- remove-columns
- remove-rows
- rename-columns
- replace-missing
- replace-missing-value
- reverse-rows
- row-at
- row-count
- row-map
- row-mapcat
- rows
- rowvec-at
- rowvecs
- sample
- select
- select-by-index
- select-columns
- select-columns-by-index
- select-missing
- select-rows
- set-dataset-name
- shape
- shuffle
- sort-by
- sort-by-column
- tail
- take-nth
- unique-by
- unique-by-column
- unordered-select
- unroll-column
- update
- update-column
- update-columns
- update-columnwise
- update-elemwise
- value-reader
- write!
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
are supported, a straight value->integer map and one-hot encoding.
diff --git a/docs/nippy-serialization-rocks.html b/docs/nippy-serialization-rocks.html
index f6abefd0..c3158312 100644
--- a/docs/nippy-serialization-rocks.html
+++ b/docs/nippy-serialization-rocks.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset And nippy
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset And nippy
We are big fans of the nippy system for
freezing/thawing data. So we were pleasantly surprized with how well it performs
with dataset and how easy it was to extend the dataset object to support nippy
diff --git a/docs/supported-datatypes.html b/docs/supported-datatypes.html
index 299a5f35..afbac180 100644
--- a/docs/supported-datatypes.html
+++ b/docs/supported-datatypes.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.ml.dataset Supported Datatypes
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Supported Datatypes
tech.ml.dataset
supports a wide range of datatypes and has a system for expanding
the supported datatype set, aliasing new names to existing datatypes, and packing
object datatypes into primitive containers. Let's walk through each of these topics
diff --git a/docs/tech.v3.dataset.categorical.html b/docs/tech.v3.dataset.categorical.html
index a313a279..6fa166d7 100644
--- a/docs/tech.v3.dataset.categorical.html
+++ b/docs/tech.v3.dataset.categorical.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
are supported, a straight value->integer map and one-hot encoding.
The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta
dataset->categorical-maps
(dataset->categorical-maps dataset)
Given a dataset, return a sequence of categorical map entries.
diff --git a/docs/tech.v3.dataset.clipboard.html b/docs/tech.v3.dataset.clipboard.html
index 5a52266d..4aab0b7a 100644
--- a/docs/tech.v3.dataset.clipboard.html
+++ b/docs/tech.v3.dataset.clipboard.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.clipboard
Optional namespace that copies a dataset to the clipboard for pasting into
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.clipboard
Optional namespace that copies a dataset to the clipboard for pasting into
applications such as excel or google sheets.
Reading defaults to 'csv' format while writing defaults to 'tsv' format.
clipboard
(clipboard)
Get the system clipboard.
diff --git a/docs/tech.v3.dataset.column-filters.html b/docs/tech.v3.dataset.column-filters.html
index 9a777349..7f8a9084 100644
--- a/docs/tech.v3.dataset.column-filters.html
+++ b/docs/tech.v3.dataset.column-filters.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.column-filters
Queries to select column subsets that have various properites such as all numeric
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.column-filters
Queries to select column subsets that have various properites such as all numeric
columns, all feature columns, or columns that have a specific datatype.
Further a few set operations (union, intersection, difference) are provided
to further manipulate subsets of columns.
diff --git a/docs/tech.v3.dataset.column.html b/docs/tech.v3.dataset.column.html
index 5780e84e..49f5a8a8 100644
--- a/docs/tech.v3.dataset.column.html
+++ b/docs/tech.v3.dataset.column.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.column
clone
(clone col)
Clone this column not changing anything.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.column
column-map
(column-map map-fn res-dtype & args)
Map a scalar function across one or more columns.
This is the semi-missing-set aware version of tech.v3.datatype/emap. This function
is never lazy.
diff --git a/docs/tech.v3.dataset.html b/docs/tech.v3.dataset.html
index 061b2717..424ec14c 100644
--- a/docs/tech.v3.dataset.html
+++ b/docs/tech.v3.dataset.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
in memory datasets.
->>dataset
(->>dataset options dataset)
(->>dataset dataset)
Please see documentation of ->dataset. Options are the same.
->dataset
(->dataset dataset options)
(->dataset dataset)
Create a dataset from either csv/tsv or a sequence of maps.
diff --git a/docs/tech.v3.dataset.io.csv.html b/docs/tech.v3.dataset.io.csv.html
index 5538269e..17924733 100644
--- a/docs/tech.v3.dataset.io.csv.html
+++ b/docs/tech.v3.dataset.io.csv.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.csv
CSV parsing based on charred.api/read-csv.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.csv
CSV parsing based on charred.api/read-csv.
csv->dataset
(csv->dataset input & [options])
Read a csv into a dataset. Same options as tech.v3.dataset/->dataset.
csv->dataset-seq
(csv->dataset-seq input & [options])
Read a csv into a lazy sequence of datasets. All options of tech.v3.dataset/->dataset
are suppored aside from :n-initial-skip-rows
with an additional option of
diff --git a/docs/tech.v3.dataset.io.datetime.html b/docs/tech.v3.dataset.io.datetime.html
index 11f36473..f8e418ba 100644
--- a/docs/tech.v3.dataset.io.datetime.html
+++ b/docs/tech.v3.dataset.io.datetime.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.datetime
Helpful and well tested string->datetime pathways.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.datetime
Helpful and well tested string->datetime pathways.
datetime-formatter-or-str->parser-fn
(datetime-formatter-or-str->parser-fn datatype format-string-or-formatter)
Given a datatype and one of fn? string? DateTimeFormatter,
return a function that takes strings and returns datetime objects
diff --git a/docs/tech.v3.dataset.io.string-row-parser.html b/docs/tech.v3.dataset.io.string-row-parser.html
index b4c540bf..d2c97cb4 100644
--- a/docs/tech.v3.dataset.io.string-row-parser.html
+++ b/docs/tech.v3.dataset.io.string-row-parser.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.string-row-parser
Parsing functions based on raw data that is represented by a sequence
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.string-row-parser
Parsing functions based on raw data that is represented by a sequence
of string arrays.
partition-all-rows
(partition-all-rows {:keys [header-row?], :or {header-row? true}} n row-seq)
Given a sequence of rows, partition into an undefined number of partitions of at most
N rows but keep the header row as the first for all sequences.
diff --git a/docs/tech.v3.dataset.io.univocity.html b/docs/tech.v3.dataset.io.univocity.html
index a77f8a93..357e0bdb 100644
--- a/docs/tech.v3.dataset.io.univocity.html
+++ b/docs/tech.v3.dataset.io.univocity.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.io.univocity
Bindings to univocity. Transforms csv's, tsv's into sequences
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.io.univocity
Bindings to univocity. Transforms csv's, tsv's into sequences
of string arrays that are then passed into tech.v3.dataset.io.string-row-parser
methods.
create-csv-parser
(create-csv-parser {:keys [header-row? num-rows column-whitelist column-blacklist column-allowlist column-blocklist separator n-initial-skip-rows], :or {header-row? true}, :as options})
Create an implementation of univocity csv parser.
diff --git a/docs/tech.v3.dataset.join.html b/docs/tech.v3.dataset.join.html
index b691dd58..7df69cfc 100644
--- a/docs/tech.v3.dataset.join.html
+++ b/docs/tech.v3.dataset.join.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
hash-join
(hash-join colname lhs rhs)
(hash-join colname lhs rhs {:keys [operation-space], :or {operation-space :int32}, :as options})
Join by column. For efficiency, lhs should be smaller than rhs.
colname - may be a single item or a tuple in which is destructures as:
(let lhs-colname rhs-colname colname] ...)
diff --git a/docs/tech.v3.dataset.math.html b/docs/tech.v3.dataset.math.html
index 09102d6c..7dc502e3 100644
--- a/docs/tech.v3.dataset.math.html
+++ b/docs/tech.v3.dataset.math.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.math
Various mathematic transformations of datasets such as (inefficiently)
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.math
Various mathematic transformations of datasets such as (inefficiently)
building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1.
More in-depth transformations are found at tech.v3.dataset.neanderthal
.
correlation-table
(correlation-table dataset & {:keys [correlation-type colname-seq]})
Return a map of colname->list of sorted tuple of colname, coefficient.
diff --git a/docs/tech.v3.dataset.metamorph.html b/docs/tech.v3.dataset.metamorph.html
index b7927e66..7860ac34 100644
--- a/docs/tech.v3.dataset.metamorph.html
+++ b/docs/tech.v3.dataset.metamorph.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.metamorph
This is an auto-generated api system - it scans the namespaces and changes the first
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.metamorph
This is an auto-generated api system - it scans the namespaces and changes the first
to be metamorph-compliant which means transforming an argument that is just a dataset into
an argument that is a metamorph context - a map of {:metamorph/data ds}
. They also return
their result as a metamorph context.
diff --git a/docs/tech.v3.dataset.modelling.html b/docs/tech.v3.dataset.modelling.html
index 7d37b353..909f3107 100644
--- a/docs/tech.v3.dataset.modelling.html
+++ b/docs/tech.v3.dataset.modelling.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.modelling
Methods related specifically to machine learning such as setting the inference
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.modelling
Methods related specifically to machine learning such as setting the inference
target. This file integrates tightly with tech.v3.dataset.categorical which provides
categorical -> number and one-hot transformation pathways.
The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta
diff --git a/docs/tech.v3.dataset.print.html b/docs/tech.v3.dataset.print.html
index 07c6919c..c3212f7a 100644
--- a/docs/tech.v3.dataset.print.html
+++ b/docs/tech.v3.dataset.print.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.print
dataset->str
(dataset->str ds options)
(dataset->str ds)
Convert a dataset to a string. Prints a single line header and then calls
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.print
dataset->str
(dataset->str ds options)
(dataset->str ds)
Convert a dataset to a string. Prints a single line header and then calls
dataset-data->str.
For options documentation see dataset-data->str.
dataset-data->str
(dataset-data->str dataset)
(dataset-data->str dataset options)
Convert the dataset values to a string.
diff --git a/docs/tech.v3.dataset.reductions.apache-data-sketch.html b/docs/tech.v3.dataset.reductions.apache-data-sketch.html
index 925754b2..a80c280d 100644
--- a/docs/tech.v3.dataset.reductions.apache-data-sketch.html
+++ b/docs/tech.v3.dataset.reductions.apache-data-sketch.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.reductions.apache-data-sketch
Reduction reducers based on the apache data sketch family of algorithms.
diff --git a/docs/tech.v3.dataset.reductions.html b/docs/tech.v3.dataset.reductions.html
index f789c303..ac2e8c55 100644
--- a/docs/tech.v3.dataset.reductions.html
+++ b/docs/tech.v3.dataset.reductions.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
of datasets. This allows aggregations to be done in situations where the dataset is
larger than what will fit in memory on a normal machine. Due to this fact, summation
is implemented using Kahan algorithm and various statistical methods are done in using
diff --git a/docs/tech.v3.dataset.rolling.html b/docs/tech.v3.dataset.rolling.html
index 79bc1023..1840bafa 100644
--- a/docs/tech.v3.dataset.rolling.html
+++ b/docs/tech.v3.dataset.rolling.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.rolling
Implement a generalized rolling window including support for time-based variable
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.rolling
Implement a generalized rolling window including support for time-based variable
width windows.
expanding
(expanding ds reducer-map)
Run a set of reducers across a dataset with an expanding set of windows. These
will produce a cumsum-type operation.
diff --git a/docs/tech.v3.dataset.set.html b/docs/tech.v3.dataset.set.html
index 90459216..f97b5d3b 100644
--- a/docs/tech.v3.dataset.set.html
+++ b/docs/tech.v3.dataset.set.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.set
Extensions to datasets to do per-row bag-semantics set/union and intersection.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.set
Extensions to datasets to do per-row bag-semantics set/union and intersection.
intersection
(intersection a)
(intersection a b)
(intersection a b & args)
Intersect two datasets producing a new dataset with the union of tuples.
Tuples repeated across all datasets repeated in final dataset at their minimum
diff --git a/docs/tech.v3.dataset.tensor.html b/docs/tech.v3.dataset.tensor.html
index 799aeb21..031d7ea1 100644
--- a/docs/tech.v3.dataset.tensor.html
+++ b/docs/tech.v3.dataset.tensor.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.tensor
Conversion mechanisms from dataset to tensor and back.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.dataset.tensor
Conversion mechanisms from dataset to tensor and back.
dataset->tensor
(dataset->tensor dataset datatype)
(dataset->tensor dataset)
Convert a dataset to a tensor. Columns of the dataset will be converted
to columns of the tensor. Default datatype is :float64.
mean-center-columns!
(mean-center-columns! tens {:keys [nan-strategy means], :or {nan-strategy :remove}})
(mean-center-columns! tens)
in-place nan-aware mean-center the rows of the tensor. If tensor is writeable then this
diff --git a/docs/tech.v3.dataset.zip.html b/docs/tech.v3.dataset.zip.html
index ed656541..434506bd 100644
--- a/docs/tech.v3.dataset.zip.html
+++ b/docs/tech.v3.dataset.zip.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.zip
Load zip data. Zip files with a single file entry can be loaded with ->dataset. When
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.dataset.zip
Load zip data. Zip files with a single file entry can be loaded with ->dataset. When
a zip file has multiple entries you have to call zipfile->dataset-seq.
dataset-seq->zipfile!
(dataset-seq->zipfile! output options ds-seq)
(dataset-seq->zipfile! output ds-seq)
Write a sequence of datasets to zipfiles. You can control the inner type with the
:file-type option which defaults to .tsv
diff --git a/docs/tech.v3.libs.arrow.html b/docs/tech.v3.libs.arrow.html
index 79c1edfb..aabe6bed 100644
--- a/docs/tech.v3.libs.arrow.html
+++ b/docs/tech.v3.libs.arrow.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.libs.arrow
Support for reading/writing apache arrow datasets. Datasets may be memory mapped
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.arrow
Support for reading/writing apache arrow datasets. Datasets may be memory mapped
but default to being read via an input stream.
Supported datatypes:
diff --git a/docs/tech.v3.libs.clj-transit.html b/docs/tech.v3.libs.clj-transit.html
index 4d946d4d..42cf98ac 100644
--- a/docs/tech.v3.libs.clj-transit.html
+++ b/docs/tech.v3.libs.clj-transit.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.libs.clj-transit
Transit bindings for the jvm version of tech.v3.dataset.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.libs.clj-transit
Transit bindings for the jvm version of tech.v3.dataset.
dataset->transit
(dataset->transit ds out & [format handlers])
Convert a dataset into a transit encoded writer. See source for details.
dataset->transit-str
(dataset->transit-str ds & [format handlers])
Convert a dataset to a transit-encoded json string. See dataset->transit.
java-time-read-handlers
Transit read handlers for java.time.LocalDate and java.time.Instant
diff --git a/docs/tech.v3.libs.fastexcel.html b/docs/tech.v3.libs.fastexcel.html
index ace7e26b..b6d70618 100644
--- a/docs/tech.v3.libs.fastexcel.html
+++ b/docs/tech.v3.libs.fastexcel.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.libs.fastexcel
Parse a dataset in xlsx format. This namespace auto-registers a handler for
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.fastexcel
Parse a dataset in xlsx format. This namespace auto-registers a handler for
the 'xlsx' file type so that when using ->dataset, xlsx
will automatically map to
(first (workbook->datasets))
.
Note that this namespace does not auto-register a handler for the xls
file type.
diff --git a/docs/tech.v3.libs.guava.cache.html b/docs/tech.v3.libs.guava.cache.html
index 0903b06c..e097670a 100644
--- a/docs/tech.v3.libs.guava.cache.html
+++ b/docs/tech.v3.libs.guava.cache.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.guava.cache
Use a google guava cache to memoize function results. Function must not return
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.guava.cache
Use a google guava cache to memoize function results. Function must not return
nil values. Exceptions propagate to caller.
memoize
(memoize f & {:keys [write-ttl-ms access-ttl-ms soft-values? weak-values? max-size record-stats?]})
Create a threadsafe, efficient memoized function using a guavacache backing store.
diff --git a/docs/tech.v3.libs.parquet.html b/docs/tech.v3.libs.parquet.html
index dfc97861..d9c8622f 100644
--- a/docs/tech.v3.libs.parquet.html
+++ b/docs/tech.v3.libs.parquet.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.libs.parquet
Support for reading Parquet files. You must require this namespace to
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.parquet
Support for reading Parquet files. You must require this namespace to
enable parquet read/write support.
Supported datatypes:
diff --git a/docs/tech.v3.libs.poi.html b/docs/tech.v3.libs.poi.html
index 4bc6d53b..f7202ac4 100644
--- a/docs/tech.v3.libs.poi.html
+++ b/docs/tech.v3.libs.poi.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.v3.libs.poi
Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for
+ gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.poi
Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for
the xls
file type so that when using ->dataset, xls
will automatically map to
(first (workbook->datasets))
.
Note that this namespace does not auto-register a handler for the xlsx
file
diff --git a/docs/tech.v3.libs.smile.data.html b/docs/tech.v3.libs.smile.data.html
index f7926e4a..a975d1f7 100644
--- a/docs/tech.v3.libs.smile.data.html
+++ b/docs/tech.v3.libs.smile.data.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.smile.data
Bindings to the smile DataFrame system.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.libs.smile.data
Bindings to the smile DataFrame system.
column->smile-column
(column->smile-column col)
Convert a dataset column to a smile vector.
dataset->smile-dataframe
(dataset->smile-dataframe ds)
Convert a dataset to a smile dataframe.
This operation may clone columns if they aren't backed by java heap arrays.
diff --git a/docs/tech.v3.libs.tribuo.html b/docs/tech.v3.libs.tribuo.html
index e8bfcbba..53ad272c 100644
--- a/docs/tech.v3.libs.tribuo.html
+++ b/docs/tech.v3.libs.tribuo.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
tech.v3.libs.tribuo
Bindings to make working with tribuo more straight forward when using datasets.
+ gtag('config', 'G-95TVFC1FEB');tech.v3.libs.tribuo
Bindings to make working with tribuo more straight forward when using datasets.
;; Classification
tech.v3.dataset.tribuo-test> (def ds (classification-example-ds 10000))
diff --git a/scripts/enable-jdk21-m1 b/scripts/enable-jdk21-m1
new file mode 100755
index 00000000..dec96d89
--- /dev/null
+++ b/scripts/enable-jdk21-m1
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+
+VERSION="21.0.2"
+
+if [ ! -e jdk-$VERSION ]; then
+ echo "Downloading JDK $VERSION"
+ wget https://download.java.net/java/GA/jdk21.0.2/f2283984656d49d69e91c558476027ac/13/GPL/openjdk-21.0.2_macos-aarch64_bin.tar.gz
+ tar -xvzf openjdk-21.0.2_macos-aarch64_bin.tar.gz
+ rm openjdk-21.0.2_macos-aarch64_bin.tar.gz
+fi
+
+export PATH=$(pwd)/jdk-$VERSION/bin:$PATH
+export JAVA_HOME=$(pwd)/jdk-$VERSION/
diff --git a/src/tech/v3/libs/clj_transit.clj b/src/tech/v3/libs/clj_transit.clj
index a5fd6b19..e58e4541 100644
--- a/src/tech/v3/libs/clj_transit.clj
+++ b/src/tech/v3/libs/clj_transit.clj
@@ -140,8 +140,8 @@
(text-col->data col)
(#{:packed-local-date :local-date} col-dt)
(obj-col->numeric-b64 col :int32 dtype-dt/local-date->days-since-epoch)
- (#{:packed-instant :instant} col-dt)
- (obj-col->numeric-b64 col :int64 dtype-dt/instant->microseconds-since-epoch)
+ (#{:packed-instant :instant :packed-milli-instant} col-dt)
+ (obj-col->numeric-b64 col :int64 dtype-dt/instant->milliseconds-since-epoch)
:else ;;Punt!!
(vec col))}))
@@ -238,7 +238,7 @@
(= :instant dtype)
(-> (b64->numeric-data data :int64)
(dtype/->array-buffer)
- (abuf/set-datatype :packed-instant))
+ (abuf/set-datatype :packed-milli-instant))
:else
(dtype/make-container dtype data))
:name (:name metadata)})))
@@ -260,12 +260,12 @@
(def ^{:doc "Transit write handlers for java.time.LocalDate and java.time.Instant"}
java-time-write-handlers
{LocalDate (t/write-handler "java.time.LocalDate" dtype-dt/local-date->days-since-epoch)
- Instant (t/write-handler "java.time.Instant" dtype-dt/instant->microseconds-since-epoch)})
+ Instant (t/write-handler "java.time.Instant" dtype-dt/instant->milliseconds-since-epoch)})
(def ^{:doc "Transit read handlers for java.time.LocalDate and java.time.Instant"}
java-time-read-handlers
{"java.time.LocalDate" (t/read-handler dtype-dt/days-since-epoch->local-date)
- "java.time.Instant" (t/read-handler dtype-dt/microseconds-since-epoch->instant)})
+ "java.time.Instant" (t/read-handler dtype-dt/milliseconds-since-epoch->instant)})
(defn dataset->transit
diff --git a/test/tech/v3/dataset/parse_test.clj b/test/tech/v3/dataset/parse_test.clj
index 3cc6add9..b7c21bd7 100644
--- a/test/tech/v3/dataset/parse_test.clj
+++ b/test/tech/v3/dataset/parse_test.clj
@@ -505,7 +505,8 @@
(deftest issue-434-transit-support
(let [ds (ds/->dataset {:a [1 2 3]
:b [:one :two :three]
- :c [(java.time.Instant/now) (java.time.Instant/now)]})
+ ;;transit encoding is milli instants
+ :c (dtype/make-container :packed-milli-instant [(java.time.Instant/now) (java.time.Instant/now)])})
str-data (ds-transit/dataset->transit-str ds)
nds (ds-transit/transit-str->dataset str-data)]
(is (= (ds :a) (nds :a)))