diff --git a/docs/000-getting-started.html b/docs/000-getting-started.html index 45ffe06f..cb1fa9eb 100644 --- a/docs/000-getting-started.html +++ b/docs/000-getting-started.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Getting Started

+ gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Getting Started

What kind of data?

TMD processes tabular data, that is, data logically arranged in rows and columns. Similar to a spreadsheet (but handling much larger datasets) or a database (but much more convenient), TMD accelerates exploring, cleaning, and processing data tables. TMD inherits Clojure's data-orientation and flexible dynamic typing, without compromising on being functional; thereby extending the language's reach to new problems and domains.

> (ds/->dataset "lucy.csv")
diff --git a/docs/100-walkthrough.html b/docs/100-walkthrough.html
index 7f49451d..7e4e68d8 100644
--- a/docs/100-walkthrough.html
+++ b/docs/100-walkthrough.html
@@ -4,7 +4,7 @@
   function gtag(){dataLayer.push(arguments);}
   gtag('js', new Date());
 
-  gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Walkthrough

+ gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Walkthrough

tech.ml.dataset (TMD) is a Clojure library designed to ease working with tabular data, similar to data.table in R or Python's Pandas. TMD takes inspiration from the design of those tools, but does not aim to copy their functionality. Instead, TMD is a building block that increases Clojure's already considerable data processing power.

High Level Design

In TMD, a dataset is logically a map of column name to column data. Column data is typed (e.g., a column of 16 bit integers, or a column of 64 bit floating point numbers), similar to a database. Column names may be any Java object - keywords and strings are typical - and column values may be any Java primitive type, or type supported by tech.datatype, datetimes, or arbitrary objects. Column data is stored contiguously in JVM arrays, and missing values are indicated with bitsets.

diff --git a/docs/200-quick-reference.html b/docs/200-quick-reference.html index e2ce40bf..00ea4035 100644 --- a/docs/200-quick-reference.html +++ b/docs/200-quick-reference.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Quick Reference

+ gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Quick Reference

This topic summarizes many of the most frequently used TMD functions, together with some quick notes about their use. Functions here are linked to further documentation, or their source. Note, unless a namespace is specified, each function is accessible via the tech.ml.dataset namespace.

For a more thorough treatment, the API docs list every available function.

Table of Contents

diff --git a/docs/columns-readers-and-datatypes.html b/docs/columns-readers-and-datatypes.html index 6aae2c2c..cc9a8ebc 100644 --- a/docs/columns-readers-and-datatypes.html +++ b/docs/columns-readers-and-datatypes.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Columns, Readers, and Datatypes

+ gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Columns, Readers, and Datatypes

In tech.ml.dataset, columns are composed of three things: data, metadata, and the missing set. The column's datatype is the datatype of the data member. The data member can diff --git a/docs/index.html b/docs/index.html index 52a7b314..034587e8 100644 --- a/docs/index.html +++ b/docs/index.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

TMD 7.004

A Clojure high performance data processing system.

Topics

Namespaces

tech.v3.dataset

Column major dataset abstraction for efficiently manipulating + gtag('config', 'G-95TVFC1FEB');

TMD 7.004

A Clojure high performance data processing system.

Topics

Namespaces

tech.v3.dataset.categorical

Conversions of categorical values into numbers and back. Two forms of conversions are supported, a straight value->integer map and one-hot encoding.

@@ -16,10 +16,7 @@

Public variables and functions:

tech.v3.dataset.io.string-row-parser

Parsing functions based on raw data that is represented by a sequence of string arrays.

-

Public variables and functions:

tech.v3.dataset.io.univocity

Bindings to univocity. Transforms csv's, tsv's into sequences -of string arrays that are then passed into tech.v3.dataset.io.string-row-parser -methods.

-

tech.v3.dataset.join

implementation of join algorithms, both exact (hash-join) and near.

+

Public variables and functions:

tech.v3.dataset.join

implementation of join algorithms, both exact (hash-join) and near.

tech.v3.dataset.math

Various mathematic transformations of datasets such as (inefficiently) building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1. More in-depth transformations are found at tech.v3.dataset.neanderthal.

@@ -30,9 +27,7 @@

Public variables and functions:

tech.v3.dataset.modelling

Methods related specifically to machine learning such as setting the inference target. This file integrates tightly with tech.v3.dataset.categorical which provides categorical -> number and one-hot transformation pathways.

-

tech.v3.dataset.neanderthal

Conversion of a dataset to/from a neanderthal dense matrix as well as various -dataset transformations such as pca, covariance and correlation matrixes.

-

tech.v3.dataset.reductions

Specific high performance reductions intended to be performend over a sequence of datasets. This allows aggregations to be done in situations where the dataset is larger than what will fit in memory on a normal machine. Due to this fact, summation is implemented using Kahan algorithm and various statistical methods are done in using @@ -45,18 +40,4 @@

tech.v3.dataset.tensor

Conversion mechanisms from dataset to tensor and back.

Public variables and functions:

tech.v3.dataset.zip

Load zip data. Zip files with a single file entry can be loaded with ->dataset. When a zip file has multiple entries you have to call zipfile->dataset-seq.

-

Public variables and functions:

tech.v3.libs.arrow

Support for reading/writing apache arrow datasets. Datasets may be memory mapped -but default to being read via an input stream.

-

tech.v3.libs.fastexcel

Parse a dataset in xlsx format. This namespace auto-registers a handler for -the 'xlsx' file type so that when using ->dataset, xlsx will automatically map to -(first (workbook->datasets)).

-

Public variables and functions:

tech.v3.libs.guava.cache

Use a google guava cache to memoize function results. Function must not return -nil values. Exceptions propagate to caller.

-

Public variables and functions:

tech.v3.libs.parquet

Support for reading Parquet files. You must require this namespace to -enable parquet read/write support.

-

tech.v3.libs.poi

Parse a dataset in xls or xlsx format. This namespace auto-registers a handler for -the xls file type so that when using ->dataset, xls will automatically map to -(first (workbook->datasets)).

-

Public variables and functions:

tech.v3.libs.smile.data

Bindings to the smile DataFrame system.

-
\ No newline at end of file +

Public variables and functions:

\ No newline at end of file diff --git a/docs/nippy-serialization-rocks.html b/docs/nippy-serialization-rocks.html index d2a73fcf..4d851710 100644 --- a/docs/nippy-serialization-rocks.html +++ b/docs/nippy-serialization-rocks.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset And nippy

+ gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset And nippy

We are big fans of the nippy system for freezing/thawing data. So we were pleasantly surprized with how well it performs with dataset and how easy it was to extend the dataset object to support nippy diff --git a/docs/supported-datatypes.html b/docs/supported-datatypes.html index 9c382adb..5c35d614 100644 --- a/docs/supported-datatypes.html +++ b/docs/supported-datatypes.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Supported Datatypes

+ gtag('config', 'G-95TVFC1FEB');

tech.ml.dataset Supported Datatypes

tech.ml.dataset supports a wide range of datatypes and has a system for expanding the supported datatype set, aliasing new names to existing datatypes, and packing object datatypes into primitive containers. Let's walk through each of these topics diff --git a/docs/tech.v3.dataset.categorical.html b/docs/tech.v3.dataset.categorical.html index 008caecc..06e72ff3 100644 --- a/docs/tech.v3.dataset.categorical.html +++ b/docs/tech.v3.dataset.categorical.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.categorical

Conversions of categorical values into numbers and back. Two forms of conversions + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.categorical

Conversions of categorical values into numbers and back. Two forms of conversions are supported, a straight value->integer map and one-hot encoding.

The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta

fit-categorical-map

(fit-categorical-map dataset colname & [table-args res-dtype])

Given a column, map it into an numeric space via a discrete map of values diff --git a/docs/tech.v3.dataset.clipboard.html b/docs/tech.v3.dataset.clipboard.html index ea99e395..01ee0f2c 100644 --- a/docs/tech.v3.dataset.clipboard.html +++ b/docs/tech.v3.dataset.clipboard.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.clipboard

Optional namespace that copies a dataset to the clipboard for pasting into + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.clipboard

Optional namespace that copies a dataset to the clipboard for pasting into applications such as excel or google sheets.

Reading defaults to 'csv' format while writing defaults to 'tsv' format.

clipboard

(clipboard)

Get the system clipboard.

diff --git a/docs/tech.v3.dataset.column-filters.html b/docs/tech.v3.dataset.column-filters.html index e99ff7a6..c007bff4 100644 --- a/docs/tech.v3.dataset.column-filters.html +++ b/docs/tech.v3.dataset.column-filters.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.column-filters

Queries to select column subsets that have various properites such as all numeric + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.column-filters

Queries to select column subsets that have various properites such as all numeric columns, all feature columns, or columns that have a specific datatype.

Further a few set operations (union, intersection, difference) are provided to further manipulate subsets of columns.

diff --git a/docs/tech.v3.dataset.column.html b/docs/tech.v3.dataset.column.html index f2008a71..be5b1a65 100644 --- a/docs/tech.v3.dataset.column.html +++ b/docs/tech.v3.dataset.column.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.column

clone

(clone col)

Clone this column not changing anything.

+ gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.column

clone

(clone col)

Clone this column not changing anything.

column-map

(column-map map-fn res-dtype & args)

Map a scalar function across one or more columns. This is the semi-missing-set aware version of tech.v3.datatype/emap. This function is never lazy.

diff --git a/docs/tech.v3.dataset.html b/docs/tech.v3.dataset.html index 26fdfc68..46469b1a 100644 --- a/docs/tech.v3.dataset.html +++ b/docs/tech.v3.dataset.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset

Column major dataset abstraction for efficiently manipulating + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset

Column major dataset abstraction for efficiently manipulating in memory datasets.

->>dataset

(->>dataset options dataset)(->>dataset dataset)

Please see documentation of ->dataset. Options are the same.

->dataset

(->dataset dataset options)(->dataset dataset)

Create a dataset from either csv/tsv or a sequence of maps.

diff --git a/docs/tech.v3.dataset.io.csv.html b/docs/tech.v3.dataset.io.csv.html index fad24b15..a11f8e18 100644 --- a/docs/tech.v3.dataset.io.csv.html +++ b/docs/tech.v3.dataset.io.csv.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.io.csv

CSV parsing based on charred.api/read-csv.

+ gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.io.csv

CSV parsing based on charred.api/read-csv.

csv->dataset

(csv->dataset input & [options])

Read a csv into a dataset. Same options as tech.v3.dataset/->dataset.

csv->dataset-seq

(csv->dataset-seq input & [options])

Read a csv into a lazy sequence of datasets. All options of tech.v3.dataset/->dataset are suppored aside from :n-initial-skip-rows with an additional option of diff --git a/docs/tech.v3.dataset.io.datetime.html b/docs/tech.v3.dataset.io.datetime.html index 46bd8314..e3b6fa9e 100644 --- a/docs/tech.v3.dataset.io.datetime.html +++ b/docs/tech.v3.dataset.io.datetime.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.io.datetime

Helpful and well tested string->datetime pathways.

+ gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.io.datetime

Helpful and well tested string->datetime pathways.

datatype->general-parse-fn-map

Map of datetime datatype to generalized parse fn.

datetime-formatter-or-str->parser-fn

(datetime-formatter-or-str->parser-fn datatype format-string-or-formatter)

Given a datatype and one of fn? string? DateTimeFormatter, return a function that takes strings and returns datetime objects diff --git a/docs/tech.v3.dataset.io.string-row-parser.html b/docs/tech.v3.dataset.io.string-row-parser.html index a5843166..48a7085a 100644 --- a/docs/tech.v3.dataset.io.string-row-parser.html +++ b/docs/tech.v3.dataset.io.string-row-parser.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.io.string-row-parser

Parsing functions based on raw data that is represented by a sequence + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.io.string-row-parser

Parsing functions based on raw data that is represented by a sequence of string arrays.

partition-all-rows

(partition-all-rows {:keys [header-row?], :or {header-row? true}} n row-seq)

Given a sequence of rows, partition into an undefined number of partitions of at most N rows but keep the header row as the first for all sequences.

diff --git a/docs/tech.v3.dataset.join.html b/docs/tech.v3.dataset.join.html index aefef489..8381b70b 100644 --- a/docs/tech.v3.dataset.join.html +++ b/docs/tech.v3.dataset.join.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.join

implementation of join algorithms, both exact (hash-join) and near.

+ gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.join

implementation of join algorithms, both exact (hash-join) and near.

hash-join

(hash-join colname lhs rhs)(hash-join colname lhs rhs {:keys [operation-space], :or {operation-space :int32}, :as options})

Join by column. For efficiency, lhs should be smaller than rhs. colname - may be a single item or a tuple in which is destructures as: (let lhs-colname rhs-colname colname] ...) diff --git a/docs/tech.v3.dataset.math.html b/docs/tech.v3.dataset.math.html index 5c31d4a1..22da6bc4 100644 --- a/docs/tech.v3.dataset.math.html +++ b/docs/tech.v3.dataset.math.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.math

Various mathematic transformations of datasets such as (inefficiently) + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.math

Various mathematic transformations of datasets such as (inefficiently) building simple tables, pca, and normalizing columns to have mean of 0 and variance of 1. More in-depth transformations are found at tech.v3.dataset.neanderthal.

correlation-table

(correlation-table dataset & {:keys [correlation-type colname-seq]})

Return a map of colname->list of sorted tuple of colname, coefficient. diff --git a/docs/tech.v3.dataset.metamorph.html b/docs/tech.v3.dataset.metamorph.html index 7b19f943..59919569 100644 --- a/docs/tech.v3.dataset.metamorph.html +++ b/docs/tech.v3.dataset.metamorph.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.metamorph

This is an auto-generated api system - it scans the namespaces and changes the first + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.metamorph

This is an auto-generated api system - it scans the namespaces and changes the first to be metamorph-compliant which means transforming an argument that is just a dataset into an argument that is a metamorph context - a map of {:metamorph/data ds}. They also return their result as a metamorph context.

diff --git a/docs/tech.v3.dataset.modelling.html b/docs/tech.v3.dataset.modelling.html index 401df51c..24b3af08 100644 --- a/docs/tech.v3.dataset.modelling.html +++ b/docs/tech.v3.dataset.modelling.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.modelling

Methods related specifically to machine learning such as setting the inference + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.modelling

Methods related specifically to machine learning such as setting the inference target. This file integrates tightly with tech.v3.dataset.categorical which provides categorical -> number and one-hot transformation pathways.

The functions in this namespace manipulate the metadata on the columns of the dataset, wich can be inspected via clojure.core/meta

diff --git a/docs/tech.v3.dataset.print.html b/docs/tech.v3.dataset.print.html index d57330d3..320f6f38 100644 --- a/docs/tech.v3.dataset.print.html +++ b/docs/tech.v3.dataset.print.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.print

dataset->str

(dataset->str ds options)(dataset->str ds)

Convert a dataset to a string. Prints a single line header and then calls + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.print

dataset->str

(dataset->str ds options)(dataset->str ds)

Convert a dataset to a string. Prints a single line header and then calls dataset-data->str.

For options documentation see dataset-data->str.

dataset-data->str

(dataset-data->str dataset)(dataset-data->str dataset options)

Convert the dataset values to a string.

diff --git a/docs/tech.v3.dataset.reductions.apache-data-sketch.html b/docs/tech.v3.dataset.reductions.apache-data-sketch.html index 171ac81d..bb0366d4 100644 --- a/docs/tech.v3.dataset.reductions.apache-data-sketch.html +++ b/docs/tech.v3.dataset.reductions.apache-data-sketch.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.reductions.apache-data-sketch

Reduction reducers based on the apache data sketch family of algorithms.

+ gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.reductions.apache-data-sketch

Reduction reducers based on the apache data sketch family of algorithms.

diff --git a/docs/tech.v3.dataset.reductions.html b/docs/tech.v3.dataset.reductions.html index 7aa6e195..871845d8 100644 --- a/docs/tech.v3.dataset.reductions.html +++ b/docs/tech.v3.dataset.reductions.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.reductions

Specific high performance reductions intended to be performend over a sequence + gtag('config', 'G-95TVFC1FEB');

tech.v3.dataset.reductions

Specific high performance reductions intended to be performend over a sequence of datasets. This allows aggregations to be done in situations where the dataset is larger than what will fit in memory on a normal machine. Due to this fact, summation is implemented using Kahan algorithm and various statistical methods are done in using @@ -57,7 +57,7 @@ :n-dates (ds-reduce/count-distinct :date :int32)} [ds-seq])

-

count-distinct

(count-distinct colname op-space)(count-distinct colname)

distinct

(distinct colname finalizer)(distinct colname)

Create a reducer that will return a set of values.

+

count-distinct

(count-distinct colname op-space)(count-distinct colname)

distinct

(distinct colname finalizer)(distinct colname)

Create a reducer that will return a set of values.

distinct-int32

(distinct-int32 colname finalizer)(distinct-int32 colname)

Get the set of distinct items given you know the space is no larger than int32 space. The optional finalizer allows you to post-process the data.

first-value

(first-value colname)

group-by-column-agg

(group-by-column-agg colname agg-map options ds-seq)(group-by-column-agg colname agg-map ds-seq)

Group a sequence of datasets by a column and aggregate down into a new dataset.

@@ -70,7 +70,11 @@ functions from dataset to hamf (non-parallel) reducers. Note that transducer-compatible rf's - such as kixi.mean, are valid hamf reducers.

+
  • +

    ds-seq - Either a single dataset or sequence of datasets.

    +
  • +

    See also group-by-column-agg-rf.

    Options:

    • :map-initial-capacity - initial hashmap capacity. Resizing hash-maps is expensive so we @@ -81,50 +85,47 @@ datasets, this is a bit faster than using filter before the aggregation.

    Example:

    -
    user> (require '[tech.v3.dataset :as ds])
    +
    
    +user> (require '[tech.v3.dataset :as ds])
     nil
     user> (require '[tech.v3.dataset.reductions :as ds-reduce])
     nil
    -user> (def stocks (ds/->dataset "test/data/stocks.csv" {:key-fn keyword}))
    -#'user/stocks
    +user> (def ds (ds/->dataset "https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv"
    +                            {:key-fn keyword}))
    +
    +#'user/ds
     user> (ds-reduce/group-by-column-agg
            :symbol
    -       {:symbol (ds-reduce/first-value :symbol)
    -        :price-avg (ds-reduce/mean :price)
    +       {:price-avg (ds-reduce/mean :price)
             :price-sum (ds-reduce/sum :price)}
    -       [stocks stocks stocks])
    -:symbol-aggregation [5 3]:
    +       ds)
    +_unnamed [5 3]:
     
     | :symbol |   :price-avg | :price-sum |
    -|---------|--------------|------------|
    -|    MSFT |  24.73674797 |    9127.86 |
    -|     IBM |  91.26121951 |   33675.39 |
    -|    AAPL |  64.73048780 |   23885.55 |
    -|    GOOG | 415.87044118 |   84837.57 |
    -|    AMZN |  47.98707317 |   17707.23 |
    -
    +|---------|-------------:|-----------:|
    +|    MSFT |  24.73674797 |    3042.62 |
    +|    AAPL |  64.73048780 |    7961.85 |
    +|     IBM |  91.26121951 |   11225.13 |
    +|    AMZN |  47.98707317 |    5902.41 |
    +|    GOOG | 415.87044118 |   28279.19 |
     
    -
    -tech.v3.dataset.reductions-test> (def tstds
    -                                   (ds/->dataset {:a ["a" "a" "a" "b" "b" "b" "c" "d" "e"]
    -                                                  :b [22   21  22 44  42  44   77 88 99]}))
    -#'tech.v3.dataset.reductions-test/tstds
    -tech.v3.dataset.reductions-test>  (ds-reduce/group-by-column-agg
    -                                   [:a :b] {:a (ds-reduce/first-value :a)
    -                                            :b (ds-reduce/first-value :b)
    -                                            :c (ds-reduce/row-count)}
    -                                   [tstds tstds tstds])
    -:tech.v3.dataset.reductions/_temp_col-aggregation [7 3]:
    +user> (def testds (ds/->dataset {:a ["a" "a" "a" "b" "b" "b" "c" "d" "e"]
    +                                 :b [22   21  22 44  42  44   77 88 99]}))
    +#'user/testds
    +user> (ds-reduce/group-by-column-agg
    +       [:a :b] {:c (ds-reduce/row-count)}
    +       testds)
    +_unnamed [7 3]:
     
     | :a | :b | :c |
     |----|---:|---:|
    -|  a | 21 |  3 |
    -|  a | 22 |  6 |
    -|  b | 42 |  3 |
    -|  b | 44 |  6 |
    -|  c | 77 |  3 |
    -|  d | 88 |  3 |
    -|  e | 99 |  3 |
    +|  e | 99 |  1 |
    +|  a | 21 |  1 |
    +|  c | 77 |  1 |
    +|  d | 88 |  1 |
    +|  b | 44 |  2 |
    +|  b | 42 |  1 |
    +|  a | 22 |  2 |
     

    group-by-column-agg-rf

    (group-by-column-agg-rf colname agg-map)(group-by-column-agg-rf colname agg-map options)

    Produce a transduce-compatible rf that will perform the group-by-column-agg pathway. See documentation for group-by-column-agg.

    diff --git a/docs/tech.v3.dataset.rolling.html b/docs/tech.v3.dataset.rolling.html index b32048ab..e746cd2a 100644 --- a/docs/tech.v3.dataset.rolling.html +++ b/docs/tech.v3.dataset.rolling.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.rolling

    Implement a generalized rolling window including support for time-based variable + gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.rolling

    Implement a generalized rolling window including support for time-based variable width windows.

    expanding

    (expanding ds reducer-map)

    Run a set of reducers across a dataset with an expanding set of windows. These will produce a cumsum-type operation.

    diff --git a/docs/tech.v3.dataset.set.html b/docs/tech.v3.dataset.set.html index c52e9665..6a701172 100644 --- a/docs/tech.v3.dataset.set.html +++ b/docs/tech.v3.dataset.set.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.set

    Extensions to datasets to do per-row bag-semantics set/union and intersection.

    + gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.set

    Extensions to datasets to do per-row bag-semantics set/union and intersection.

    difference

    (difference a)(difference a b)

    Remove tuples from a that also appear in b.

    intersection

    (intersection a)(intersection a b)(intersection a b & args)

    Intersect two datasets producing a new dataset with the union of tuples. Tuples repeated across all datasets repeated in final dataset at their minimum diff --git a/docs/tech.v3.dataset.tensor.html b/docs/tech.v3.dataset.tensor.html index 409764a3..f2c54cb1 100644 --- a/docs/tech.v3.dataset.tensor.html +++ b/docs/tech.v3.dataset.tensor.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.tensor

    Conversion mechanisms from dataset to tensor and back.

    + gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.tensor

    Conversion mechanisms from dataset to tensor and back.

    dataset->tensor

    (dataset->tensor dataset datatype)(dataset->tensor dataset)

    Convert a dataset to a tensor. Columns of the dataset will be converted to columns of the tensor. Default datatype is :float64.

    mean-center-columns!

    (mean-center-columns! tens {:keys [nan-strategy means], :or {nan-strategy :remove}})(mean-center-columns! tens)

    in-place nan-aware mean-center the rows of the tensor. If tensor is writeable then this diff --git a/docs/tech.v3.dataset.zip.html b/docs/tech.v3.dataset.zip.html index 1a679a71..184aa2cd 100644 --- a/docs/tech.v3.dataset.zip.html +++ b/docs/tech.v3.dataset.zip.html @@ -4,7 +4,7 @@ function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); - gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.zip

    Load zip data. Zip files with a single file entry can be loaded with ->dataset. When + gtag('config', 'G-95TVFC1FEB');

    tech.v3.dataset.zip

    Load zip data. Zip files with a single file entry can be loaded with ->dataset. When a zip file has multiple entries you have to call zipfile->dataset-seq.

    dataset-seq->zipfile!

    (dataset-seq->zipfile! output options ds-seq)(dataset-seq->zipfile! output ds-seq)

    Write a sequence of datasets to zipfiles. You can control the inner type with the :file-type option which defaults to .tsv

    diff --git a/src/tech/v3/dataset/reductions.clj b/src/tech/v3/dataset/reductions.clj index e218c6ba..19d10090 100644 --- a/src/tech/v3/dataset/reductions.clj +++ b/src/tech/v3/dataset/reductions.clj @@ -472,6 +472,10 @@ _unnamed [4 5]: * agg-map - map of result column name to reducer. All values in the agg map must be functions from dataset to hamf (non-parallel) reducers. Note that transducer-compatible rf's - such as kixi.mean, are valid hamf reducers. + * ds-seq - Either a single dataset or sequence of datasets. + + + See also [[group-by-column-agg-rf]]. Options: @@ -485,50 +489,47 @@ _unnamed [4 5]: Example: ```clojure + user> (require '[tech.v3.dataset :as ds]) nil user> (require '[tech.v3.dataset.reductions :as ds-reduce]) nil -user> (def stocks (ds/->dataset \"test/data/stocks.csv\" {:key-fn keyword})) -#'user/stocks +user> (def ds (ds/->dataset \"https://github.com/techascent/tech.ml.dataset/raw/master/test/data/stocks.csv\" + {:key-fn keyword})) + +#'user/ds user> (ds-reduce/group-by-column-agg :symbol - {:symbol (ds-reduce/first-value :symbol) - :price-avg (ds-reduce/mean :price) + {:price-avg (ds-reduce/mean :price) :price-sum (ds-reduce/sum :price)} - [stocks stocks stocks]) -:symbol-aggregation [5 3]: + ds) +_unnamed [5 3]: | :symbol | :price-avg | :price-sum | -|---------|--------------|------------| -| MSFT | 24.73674797 | 9127.86 | -| IBM | 91.26121951 | 33675.39 | -| AAPL | 64.73048780 | 23885.55 | -| GOOG | 415.87044118 | 84837.57 | -| AMZN | 47.98707317 | 17707.23 | - - - -tech.v3.dataset.reductions-test> (def tstds - (ds/->dataset {:a [\"a\" \"a\" \"a\" \"b\" \"b\" \"b\" \"c\" \"d\" \"e\"] - :b [22 21 22 44 42 44 77 88 99]})) -#'tech.v3.dataset.reductions-test/tstds -tech.v3.dataset.reductions-test> (ds-reduce/group-by-column-agg - [:a :b] {:a (ds-reduce/first-value :a) - :b (ds-reduce/first-value :b) - :c (ds-reduce/row-count)} - [tstds tstds tstds]) -:tech.v3.dataset.reductions/_temp_col-aggregation [7 3]: +|---------|-------------:|-----------:| +| MSFT | 24.73674797 | 3042.62 | +| AAPL | 64.73048780 | 7961.85 | +| IBM | 91.26121951 | 11225.13 | +| AMZN | 47.98707317 | 5902.41 | +| GOOG | 415.87044118 | 28279.19 | + +user> (def testds (ds/->dataset {:a [\"a\" \"a\" \"a\" \"b\" \"b\" \"b\" \"c\" \"d\" \"e\"] + :b [22 21 22 44 42 44 77 88 99]})) +#'user/testds +user> (ds-reduce/group-by-column-agg + [:a :b] {:c (ds-reduce/row-count)} + testds) +_unnamed [7 3]: | :a | :b | :c | |----|---:|---:| -| a | 21 | 3 | -| a | 22 | 6 | -| b | 42 | 3 | -| b | 44 | 6 | -| c | 77 | 3 | -| d | 88 | 3 | -| e | 99 | 3 | +| e | 99 | 1 | +| a | 21 | 1 | +| c | 77 | 1 | +| d | 88 | 1 | +| b | 44 | 2 | +| b | 42 | 1 | +| a | 22 | 2 | ```" ([colname agg-map options ds-seq] (hamf-rf/reduce-reducer (group-by-column-agg-rf colname agg-map options)