tech.ml.dataset Getting Started
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Getting Started
What kind of data?
TMD processes tabular data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being functional; thereby extending the language's reach to new problems and domains.
> (ds/->dataset "lucy.csv")
diff --git a/docs/100-walkthrough.html b/docs/100-walkthrough.html
index 7f49451d..7e4e68d8 100644
--- a/docs/100-walkthrough.html
+++ b/docs/100-walkthrough.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Walkthrough
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Walkthrough
tech.ml.dataset
(TMD) is a Clojure library designed to ease working with tabular data, similar to data.table
in R or Python's Pandas. TMD takes inspiration from the design of those tools, but does not aim to copy their functionality. Instead, TMD is a building block that increases Clojure's already considerable data processing power.
High Level Design
In TMD, a dataset is logically a map of column name to column data. Column data is typed (e.g., a column of 16 bit integers, or a column of 64 bit floating point numbers), similar to a database. Column names may be any Java object - keywords and strings are typical - and column values may be any Java primitive type, or type supported by tech.datatype
, datetimes, or arbitrary objects. Column data is stored contiguously in JVM arrays, and missing values are indicated with bitsets.
diff --git a/docs/200-quick-reference.html b/docs/200-quick-reference.html
index e2ce40bf..00ea4035 100644
--- a/docs/200-quick-reference.html
+++ b/docs/200-quick-reference.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Quick Reference
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Quick Reference
This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the tech.ml.dataset
namespace.
For a more thorough treatment, the API docs list every available function.
Table of Contents
diff --git a/docs/columns-readers-and-datatypes.html b/docs/columns-readers-and-datatypes.html
index 6aae2c2c..cc9a8ebc 100644
--- a/docs/columns-readers-and-datatypes.html
+++ b/docs/columns-readers-and-datatypes.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Columns, Readers, and Datatypes
+ gtag('config', 'G-95TVFC1FEB');tech.ml.dataset Columns, Readers, and Datatypes
In tech.ml.dataset
, columns are composed of three things:
data, metadata, and the missing set.
The column's datatype is the datatype of the data
member. The data member can
diff --git a/docs/index.html b/docs/index.html
index 52a7b314..034587e8 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -4,7 +4,7 @@
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
- gtag('config', 'G-95TVFC1FEB');
TMD 7.004
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
+ gtag('config', 'G-95TVFC1FEB');
TMD 7.004
A Clojure high performance data processing system.
Topics
- tech.ml.dataset Getting Started
- tech.ml.dataset Walkthrough
- tech.ml.dataset Quick Reference
- tech.ml.dataset Columns, Readers, and Datatypes
- tech.ml.dataset And nippy
- tech.ml.dataset Supported Datatypes
Namespaces
tech.v3.dataset
Column major dataset abstraction for efficiently manipulating
in memory datasets.
Public variables and functions:
- ->>dataset
- ->dataset
- add-column
- add-or-update-column
- all-descriptive-stats-names
- append-columns
- assoc-ds
- assoc-metadata
- bind->
- brief
- categorical->number
- categorical->one-hot
- column
- column->dataset
- column-cast
- column-count
- column-labeled-mapseq
- column-map
- column-map-m
- column-names
- columns
- columns-with-missing-seq
- columnwise-concat
- concat
- concat-copying
- concat-inplace
- data->dataset
- dataset->data
- dataset-name
- dataset-parser
- dataset?
- descriptive-stats
- drop-columns
- drop-missing
- drop-rows
- empty-dataset
- ensure-array-backed
- filter
- filter-column
- filter-dataset
- group-by
- group-by->indexes
- group-by-column
- group-by-column->indexes
- group-by-column-consumer
- has-column?
- head
- induction
- major-version
- mapseq-parser
- mapseq-reader
- mapseq-rf
- min-n-by-column
- missing
- new-column
- new-dataset
- order-column-names
- pmap-ds
- print-all
- rand-nth
- remove-column
- remove-columns
- remove-rows
- rename-columns
- replace-missing
- replace-missing-value
- reverse-rows
- row-at
- row-count
- row-map
- row-mapcat
- rows
- rowvec-at
- rowvecs
- sample
- select
- select-by-index
- select-columns
- select-columns-by-index
- select-missing
- select-rows
- set-dataset-name
- shape
- shuffle
- sort-by
- sort-by-column
- tail
- take-nth
- unique-by
- unique-by-column
- unordered-select
- unroll-column
- update
- update-column
- update-columns
- update-columnwise
- update-elemwise
- value-reader
- write!
tech.v3.dataset.categorical
Conversions of categorical values into numbers and back. Two forms of conversions
are supported, a straight value->integer map and one-hot encoding.
@@ -16,10 +16,7 @@
Public variables and functions:
tech.v3.dataset.io.datetime
Helpful and well tested string->datetime pathways.
tech.v3.dataset.io.string-row-parser
Parsing functions based on raw data that is represented by a sequence
of string arrays.
-Public variables and functions:
tech.v3.dataset.io.univocity
Bindings to univocity. Transforms csv's, tsv's into sequences
-of string arrays that are then passed into tech.v3.dataset.io.string-row-parser
-methods.
-Public variables and functions:
tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
+Public variables and functions:
tech.v3.dataset.join
implementation of join algorithms, both exact (hash-join) and near.
Public variables and functions:
tech.v3.dataset.math
Various mathematic transformations of datasets such as (inefficiently)
building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1.
More in-depth transformations are found at tech.v3.dataset.neanderthal
.
@@ -30,9 +27,7 @@
Public variables and functions:
- add-column
- add-or-update-column
- append-columns
- assoc-ds
- assoc-metadata
- brief
- build-pipelined-function
- categorical->number
- categorical->one-hot
- column
- column->dataset
- column-cast
- column-count
- column-labeled-mapseq
- column-map
- column-names
- column-values->categorical
- columns
- columns-with-missing-seq
- columnwise-concat
- concat
- concat-copying
- concat-inplace
- data->dataset
- dataset->categorical-xforms
- dataset->data
- dataset-name
- dataset?
- descriptive-stats
- drop-columns
- drop-missing
- drop-rows
- empty-dataset
- ensure-array-backed
- feature-ecount
- filter
- filter-column
- filter-dataset
- group-by
- group-by->indexes
- group-by-column
- group-by-column->indexes
- group-by-column-consumer
- has-column?
- head
- induction
- inference-column?
- inference-target-column-names
- inference-target-ds
- inference-target-label-inverse-map
- inference-target-label-map
- k-fold-datasets
- labels
- mapseq-reader
- min-n-by-column
- missing
- model-type
- new-column
- new-dataset
- num-inference-classes
- order-column-names
- pmap-ds
- print-all
- probability-distributions->label-column
- rand-nth
- remove-column
- remove-columns
- remove-rows
- rename-columns
- replace-missing
- replace-missing-value
- reverse-rows
- row-at
- row-count
- row-map
- row-mapcat
- rows
- rowvec-at
- rowvecs
- sample
- select
- select-by-index
- select-columns
- select-columns-by-index
- select-missing
- select-rows
- set-dataset-name
- set-inference-target
- shape
- shuffle
- sort-by
- sort-by-column
- tail
- take-nth
- train-test-split
- unique-by
- unique-by-column
- unordered-select
- unroll-column
- update
- update-column
- update-columns
- update-columnwise
- update-elemwise
- value-reader
- write!
tech.v3.dataset.modelling
Methods related specifically to machine learning such as setting the inference
target. This file integrates tightly with tech.v3.dataset.categorical which provides
categorical -> number and one-hot transformation pathways.
-Public variables and functions:
- column-values->categorical
- dataset->categorical-xforms
- feature-ecount
- inference-column?
- inference-target-column-names
- inference-target-ds
- inference-target-label-inverse-map
- inference-target-label-map
- k-fold-datasets
- labels
- model-type
- num-inference-classes
- probability-distributions->label-column
- set-inference-target
- train-test-split
tech.v3.dataset.neanderthal
Conversion of a dataset to/from a neanderthal dense matrix as well as various
-dataset transformations such as pca, covariance and correlation matrixes.
-Public variables and functions:
tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
+
Public variables and functions:
- column-values->categorical
- dataset->categorical-xforms
- feature-ecount
- inference-column?
- inference-target-column-names
- inference-target-ds
- inference-target-label-inverse-map
- inference-target-label-map
- k-fold-datasets
- labels
- model-type
- num-inference-classes
- probability-distributions->label-column
- set-inference-target
- train-test-split
tech.v3.dataset.reductions
Specific high performance reductions intended to be performend over a sequence
of datasets. This allows aggregations to be done in situations where the dataset is
larger than what will fit in memory on a normal machine. Due to this fact, summation
is implemented using Kahan algorithm and various statistical methods are done in using
@@ -45,18 +40,4 @@
Public variables and functions:
tech.v3.dataset.tensor
Conversion mechanisms from dataset to tensor and back.
Public variables and functions:
tech.v3.dataset.zip
Load zip data. Zip files with a single file entry can be loaded with ->dataset. When
a zip file has multiple entries you have to call zipfile->dataset-seq.
-Public variables and functions:
tech.v3.libs.arrow
Support for reading/writing apache arrow datasets. Datasets may be memory mapped
-but default to being read via an input stream.
-Public variables and functions:
tech.v3.libs.fastexcel
Parse a dataset in xlsx format. This namespace auto-registers a handler for
-the 'xlsx' file type so that when using ->dataset, xlsx
will automatically map to
-(first (workbook->datasets))
.
-Public variables and functions:
tech.v3.libs.guava.cache
Use a google guava cache to memoize function results. Function must not return
-nil values. Exceptions propagate to caller.
-Public variables and functions:
tech.v3.libs.parquet
Support for reading Parquet files. You must require this namespace to
-enable parquet read/write support.
-Public variables and functions:
tech.v3.libs.poi
Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for
-the xls
file type so that when using ->dataset, xls
will automatically map to
-(first (workbook->datasets))
.
-Public variables and functions:
tech.v3.libs.tribuo
Bindings to make working with tribuo more straight forward when using datasets.
-