Skip to content

Summary of functions

genmeblog edited this page May 2, 2020 · 16 revisions

Functionality of tech.ml.dataset version 2.0-beta33

Based on article: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#addupdatedelete-columns

Full source code with results and R (through clojisr): https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj

Some helper functions are created to perform certain operations, they are placed at the beginning of the code:

fn name desctiption
aggregate aggregate dataset and add result to the given (or empty) map
aggregate->dataset convert result of aggregate to a dataset
group-by-columns-or-fn-and-aggregate group dataset by column(s) or fn and aggregate, returns dataset
sort-by-columns-with-orders sort-by columns with given order (:asc or :desc)
filter-by-external-values->indices filter sequence and return selected indices
apply-to-columns apply function to all columns, returns dataset
map-v apply fn to values of map, returns map
my-min, my-max dfn versions don't work on string type

All functions are not optimized and should be rewritten to use tech.ml.dataset internal functions. Issues are filled already.

Namespaces used

Alias Namespace
ds tech.ml.dataset
col tech.ml.dataset.column
dfn tech.v2.datatype.functional
dtype tech.v2.datatype
m fastmath.core
str clojure.string

Dataset

Dataset used in all snippets

(def DS (ds/name-values-seq->dataset {:V1 (take 9 (cycle [1 2]))
                                      :V2 (range 1 10)
                                      :V3 (take 9 (cycle [0.5 1.0 1.5]))
                                      :V4 (take 9 (cycle [\A \B \C]))}))

(class DS)
;; => tech.ml.dataset.impl.dataset.Dataset
DS
;; => _unnamed [9 4]:
;;    | :V1 | :V2 |    :V3 | :V4 |
;;    |-----+-----+--------+-----|
;;    |   1 |   1 | 0.5000 |   A |
;;    |   2 |   2 |  1.000 |   B |
;;    |   1 |   3 |  1.500 |   C |
;;    |   2 |   4 | 0.5000 |   A |
;;    |   1 |   5 |  1.000 |   B |
;;    |   2 |   6 |  1.500 |   C |
;;    |   1 |   7 | 0.5000 |   A |
;;    |   2 |   8 |  1.000 |   B |
;;    |   1 |   9 |  1.500 |   C |

Basic operations

Filter rows

Filter rows using indices

(ds/select-rows DS [2 3])

Discard rows using indices (also remove-rows)

(ds/drop-rows DS (range 2 7)) 

Filter rows using a logical expression (single column)

(ds/filter-column #(> ^long % 5) :V2 DS)
(ds/filter-column #{\A \C} :V4 DS)

Filter rows using multiple conditions

(ds/filter #(and (= (:V1 %) 1) (= (:V4 %) \A)) DS)

Filter unique rows

(ds/unique-by identity {:column-name-seq (ds/column-names DS)} DS)
(ds/unique-by identity {:column-name-seq [:V1 :V4]} DS)

Discard rows with missing values (missing works on whole dataset, to select columns first create dataset with selected columns)

(ds/drop-rows DS (ds/missing DS))

3 random rows

(ds/sample 3 DS)

N/2 random rows

(ds/sample (/ (ds/row-count DS) 2) DS)

Top N entries (rank calculation is defined in fastmath library)

(->> (m/rank (map - (DS :V1)) :dense)
     (filter-by-external-values->indices #(< ^long % 1))
     (ds/select-rows DS))

Select by regex

(ds/filter #(re-matches #"^B" (str (:V4 %))) DS)

Selection by range (between? defined in fastmath)

(ds/filter (comp (partial m/between? 3 5) :V2) DS)

Range selection (pure clj)

(ds/filter (comp #(< 3 % 5) :V2) DS)

Sort rows

Sort rows by column

(ds/sort-by-column :V3 DS)

Sort rows by column using indices (order is defined in fastmath)

(ds/select-rows DS (m/order (DS :V3)))

Sort rows in decreasing order

(ds/sort-by-column :V3 (comp - compare) DS)
(ds/select-rows DS (m/order (DS :V3) true))

Sort rows based on several columns

(sort-by-columns-with-orders [:V1 :V2] [:asc :desc] DS)

Select columns

Select one column using an index (DS is seqable)

(nth (seq DS) 2)
(seq (nth (seq DS) 2))

Select one column using column name

(ds/select-columns DS [:V2]) ;; returns dataset
(ds/select DS [:V2] :all)
(DS :V2) ;; returns column type
(seq (DS :V2)) ;; returns sequence
(dtype/->reader (DS :V2)) ;; datatype reader (lazy sequence generator)
(ds/column DS :V2) ;; returns column

Select several columns

(ds/select-columns DS [:V2 :V3 :V4])
(ds/select DS [:V2 :V3 :V4] :all)

Exclude columns

(ds/drop-columns DS [:V2 :V3])

Reorder columns

(->> (distinct (conj (ds/column-names DS) :V4))
     (ds/select-columns DS))

Filter column names

(->> (ds/column-names DS)
     (filter #(str/starts-with? (name %) "V"))
     (ds/select-columns DS))
(->> (ds/column-names DS)
     (filter #(str/ends-with? (name %) "3"))
     (ds/select-columns DS))
(->> (ds/column-names DS)
     (filter #(re-matches #".2" (name %)))
     (ds/select-columns DS))
(->> (ds/column-names DS)
     (filter #{:V1 :X})
     (ds/select-columns DS))

Summarise data

Summarise one column

(dfn/sum (DS :V1)) ;; value
(reduce + (DS :V1)) ;; value
(aggregate->dataset [#(dfn/sum (% :V1))] DS) ;; dataset
(aggregate->dataset {:sumV1 #(dfn/sum (% :V1))} DS) ;; dataset

Summarise several columns

(aggregate->dataset [#(dfn/sum (% :V1))
                     #(dfn/standard-deviation (% :V3))] DS)

Summarise several columns and assign column names

(aggregate->dataset {:sumv1 #(dfn/sum (% :V1))
                     :sdv3 #(dfn/standard-deviation (% :V3))} DS)

Summarise a subset of rows

(-> DS
    (ds/select-rows (range 4))
    (->> (aggregate->dataset [#(dfn/sum (% :V1))])))

First/last/nth element from column

(first (DS :V3))
(last (DS :V3))
(nth (DS :V3) 5)

Number of distinct rows

(count (col/unique (DS :V3)))
(ds/row-count (ds/unique-by identity DS))

Add/update/delete columns

Modify a column

(ds/update-column DS :V1 #(map (fn [v] (m/pow v 2)) %))

Modify a column using reader

(ds/update-column DS :V1 #(-> (dtype/->reader % :float64)
                              (dfn/pow 2)))

Add one column

(ds/add-or-update-column DS :v5 (dfn/log (DS :V1)))

Add several columns

(-> DS
    (ds/add-or-update-column :v6 (dfn/sqrt (DS :V1)))
    (ds/add-or-update-column :v7 (take (ds/row-count DS) (repeat "X"))))

Create one column and remove the others

(ds/new-dataset [(col/new-column :v8 (dfn/+ (DS :V3) 1))])

Remove one column

(ds/remove-column DS :v5)

Remove several columns

(ds/drop-columns DS [:v6 :v7])
(ds/drop-columns DS cols)

Replace values for rows matching a condition

(ds/update-column DS :V2 #(map (fn [^long v]
                                   (if (< v 4) 0 v)) %))

by

By group

(group-by-columns-or-fn-and-aggregate [:V4] {:sumV2 #(dfn/sum (% :V2))} DS)

By several groups

(->> (group-by-columns-or-fn-and-aggregate [:V4 :V1] {:sumV2 #(dfn/sum (% :V2))} DS)
     (sort-by-columns-with-orders [:V4 :V1]))

Calling function in by

(->> (ds/update-column DS :V4 #(map str/lower-case %))
     (group-by-columns-or-fn-and-aggregate [:V4] {:sumV1 #(dfn/sum (% :V1))})
     (ds/sort-by-column :V4))

Assigning column name in by

(-> (ds/update-column DS :V4 #(map str/lower-case %))
    (ds/rename-columns {:V4 :abc})
    (->> (group-by-columns-or-fn-and-aggregate [:abc] {:sumV1 #(dfn/sum (% :V1))})
         (ds/sort-by-column :abc)))

Using a condition in by

(group-by-columns-or-fn-and-aggregate #(= (% :V4) \A)
                                      {:sumV1 #(dfn/sum (% :V1))}
                                      DS)

By on a subset of rows

(->> (ds/select-rows DS (range 5))
     (group-by-columns-or-fn-and-aggregate [:V4] {:sumV1 #(dfn/sum (% :V1))})
     (ds/sort-by-column :V4))

Count number of observations for each group

(group-by-columns-or-fn-and-aggregate [:V4] {:N ds/row-count} DS)
(map-v ds/row-count (ds/group-by-column :V4 DS))
(->> (vals (ds/group-by-column :V4 DS))
     (map ds/row-count))

Add a column with number of observations for each group

(->> (ds/group-by identity [:V1] DS)
     (vals)
     (map (fn [ds]
            (let [rcnt (ds/row-count ds)]
              (ds/new-column ds :n (repeat rcnt rcnt)))))
     (apply ds/concat)
     (sort-by-columns-with-orders [:V2 :V4]))

Retrieve the first/last/nth observation for each group

(->> (group-by-columns-or-fn-and-aggregate [:V4] {:v #(first (% :V2))} DS)
     (ds/sort-by-column :V4))
(->> (group-by-columns-or-fn-and-aggregate [:V4] {:v #(last (% :V2))} DS)
     (ds/sort-by-column :V4))
(->> (group-by-columns-or-fn-and-aggregate [:V4] {:v #(nth (% :V2) 1)} DS)
     (ds/sort-by-column :V4))

Going further

Advanced columns manipulation

Summarise all the columns

(apply-to-columns my-max :all DS)

Summarise several columns

(apply-to-columns dfn/mean [:V1 :V2] DS)

Summarise several columns by group

(->> DS
     (group-by-columns-or-fn-and-aggregate [:V4] {:V1 #(dfn/mean (% :V1))
                                                  :V2 #(dfn/mean (% :V2))})
     (ds/sort-by-column :V4))

Summarise with more than one function by group

(->> DS
     (group-by-columns-or-fn-and-aggregate [:V4] {:V1-mean #(dfn/mean (% :V1))
                                                  :V2-mean #(dfn/mean (% :V2))
                                                  :V1-sum #(dfn/sum (% :V1))
                                                  :V2-sum #(dfn/sum (% :V2))})
     (ds/sort-by-column :V4))

Summarise using a condition

(def cols (->> DS
               (ds/columns)
               (map meta)
               (filter (comp #{:float64} :datatype))
               (map :name)))

(apply-to-columns dfn/mean cols DS)

Modify all the columns

(ds/update-columns DS (ds/column-names DS) reverse)

Modify several columns (dropping the others)

(->> (ds/select-columns DS [:V1 :V2])
     (apply-to-columns dfn/sqrt :all))
(->> (ds/drop-columns DS [:V4])
     (apply-to-columns dfn/exp :all))

Modify several columns (keeping the others)

(def DS (apply-to-columns dfn/sqrt [:V1 :V2] DS))
(def DS (apply-to-columns #(dfn/pow % 2.0) [:V1 :V2] DS))

Modify columns using a condition (dropping the others)

(def cols (->> DS
               (ds/columns)
               (map meta)
               (filter (comp #(= :float64 %) :datatype))
               (map :name)))

(->> (ds/select-columns DS cols)
     (apply-to-columns #(dfn/- % 1) cols))

Modify columns using a condition (keeping the others)

(def DS (apply-to-columns #(dtype/->reader % :int64) [:V1 :V2] DS))

Use a complex expression

(->> (ds/group-by-column :V4 DS)
     (vals)
     (map #(ds/head 2 %))
     (map #(ds/add-or-update-column % :V2 (repeat (ds/row-count %) "X")))
     (apply ds/concat))

Use multiple expressions (with DT[,{j}])

(let [x (dfn/+ (DS :V1) (dfn/sum (DS :V2)))]
  (println (DS :V1))
  (println (dfn/descriptive-stats (DS :V2)))
  (ds/name-values-seq->dataset {:A (map inc (range (ds/row-count DS)))
                                :B x}))