-
Notifications
You must be signed in to change notification settings - Fork 0
Summary of functions
Based on article: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/#addupdatedelete-columns
Full source code with results and R (through clojisr
): https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj
Some helper functions are created to perform certain operations, they are placed at the beginning of the code:
fn name | desctiption |
---|---|
aggregate |
aggregate dataset and add result to the given (or empty) map |
aggregate->dataset |
convert result of aggregate to a dataset |
group-by-columns-or-fn-and-aggregate |
group dataset by column(s) or fn and aggregate, returns dataset |
sort-by-columns-with-orders |
sort-by columns with given order (:asc or :desc ) |
filter-by-external-values->indices |
filter sequence and return selected indices |
apply-to-columns |
apply function to all columns, returns dataset |
map-v |
apply fn to values of map, returns map |
my-min , my-max
|
dfn versions don't work on string type |
All functions are not optimized and should be rewritten to use tech.ml.dataset internal functions. Issues are filled already.
Namespaces used
Alias | Namespace |
---|---|
ds |
tech.ml.dataset |
col |
tech.ml.dataset.column |
dfn |
tech.v2.datatype.functional |
dtype |
tech.v2.datatype |
m |
fastmath.core |
str |
clojure.string |
Dataset used in all snippets
(def DS (ds/name-values-seq->dataset {:V1 (take 9 (cycle [1 2]))
:V2 (range 1 10)
:V3 (take 9 (cycle [0.5 1.0 1.5]))
:V4 (take 9 (cycle [\A \B \C]))}))
(class DS)
;; => tech.ml.dataset.impl.dataset.Dataset
DS
;; => _unnamed [9 4]:
;; | :V1 | :V2 | :V3 | :V4 |
;; |-----+-----+--------+-----|
;; | 1 | 1 | 0.5000 | A |
;; | 2 | 2 | 1.000 | B |
;; | 1 | 3 | 1.500 | C |
;; | 2 | 4 | 0.5000 | A |
;; | 1 | 5 | 1.000 | B |
;; | 2 | 6 | 1.500 | C |
;; | 1 | 7 | 0.5000 | A |
;; | 2 | 8 | 1.000 | B |
;; | 1 | 9 | 1.500 | C |
Filter rows using indices
(ds/select-rows DS [2 3])
Discard rows using indices (also remove-rows
)
(ds/drop-rows DS (range 2 7))
Filter rows using a logical expression (single column)
(ds/filter-column #(> ^long % 5) :V2 DS)
(ds/filter-column #{\A \C} :V4 DS)
Filter rows using multiple conditions
(ds/filter #(and (= (:V1 %) 1) (= (:V4 %) \A)) DS)
Filter unique rows
(ds/unique-by identity {:column-name-seq (ds/column-names DS)} DS)
(ds/unique-by identity {:column-name-seq [:V1 :V4]} DS)
Discard rows with missing values (missing
works on whole dataset, to select columns first create dataset with selected columns)
(ds/drop-rows DS (ds/missing DS))
3 random rows
(ds/sample 3 DS)
N/2 random rows
(ds/sample (/ (ds/row-count DS) 2) DS)
Top N entries (rank
calculation is defined in fastmath
library)
(->> (m/rank (map - (DS :V1)) :dense)
(filter-by-external-values->indices #(< ^long % 1))
(ds/select-rows DS))
Select by regex
(ds/filter #(re-matches #"^B" (str (:V4 %))) DS)
Selection by range (between?
defined in fastmath
)
(ds/filter (comp (partial m/between? 3 5) :V2) DS)
Range selection (pure clj)
(ds/filter (comp #(< 3 % 5) :V2) DS)
Sort rows by column
(ds/sort-by-column :V3 DS)
Sort rows by column using indices (order
is defined in fastmath
)
(ds/select-rows DS (m/order (DS :V3)))
Sort rows in decreasing order
(ds/sort-by-column :V3 (comp - compare) DS)
(ds/select-rows DS (m/order (DS :V3) true))
Sort rows based on several columns
(sort-by-columns-with-orders [:V1 :V2] [:asc :desc] DS)
Select one column using an index (DS is seqable
)
(nth (seq DS) 2)
(seq (nth (seq DS) 2))
Select one column using column name
(ds/select-columns DS [:V2]) ;; returns dataset
(ds/select DS [:V2] :all)
(DS :V2) ;; returns column type
(seq (DS :V2)) ;; returns sequence
(dtype/->reader (DS :V2)) ;; datatype reader (lazy sequence generator)
(ds/column DS :V2) ;; returns column
Select several columns
(ds/select-columns DS [:V2 :V3 :V4])
(ds/select DS [:V2 :V3 :V4] :all)
Exclude columns
(ds/drop-columns DS [:V2 :V3])
Reorder columns
(->> (distinct (conj (ds/column-names DS) :V4))
(ds/select-columns DS))
Filter column names
(->> (ds/column-names DS)
(filter #(str/starts-with? (name %) "V"))
(ds/select-columns DS))
(->> (ds/column-names DS)
(filter #(str/ends-with? (name %) "3"))
(ds/select-columns DS))
(->> (ds/column-names DS)
(filter #(re-matches #".2" (name %)))
(ds/select-columns DS))
(->> (ds/column-names DS)
(filter #{:V1 :X})
(ds/select-columns DS))
Summarise one column
(dfn/sum (DS :V1)) ;; value
(reduce + (DS :V1)) ;; value
(aggregate->dataset [#(dfn/sum (% :V1))] DS) ;; dataset
(aggregate->dataset {:sumV1 #(dfn/sum (% :V1))} DS) ;; dataset
Summarise several columns
(aggregate->dataset [#(dfn/sum (% :V1))
#(dfn/standard-deviation (% :V3))] DS)
Summarise several columns and assign column names
(aggregate->dataset {:sumv1 #(dfn/sum (% :V1))
:sdv3 #(dfn/standard-deviation (% :V3))} DS)
Summarise a subset of rows
(-> DS
(ds/select-rows (range 4))
(->> (aggregate->dataset [#(dfn/sum (% :V1))])))
First/last/nth element from column
(first (DS :V3))
(last (DS :V3))
(nth (DS :V3) 5)
Number of distinct rows
(count (col/unique (DS :V3)))
(ds/row-count (ds/unique-by identity DS))
Modify a column
(ds/update-column DS :V1 #(map (fn [v] (m/pow v 2)) %))
Modify a column using reader
(ds/update-column DS :V1 #(-> (dtype/->reader % :float64)
(dfn/pow 2)))
Add one column
(ds/add-or-update-column DS :v5 (dfn/log (DS :V1)))
Add several columns
(-> DS
(ds/add-or-update-column :v6 (dfn/sqrt (DS :V1)))
(ds/add-or-update-column :v7 (take (ds/row-count DS) (repeat "X"))))
Create one column and remove the others
(ds/new-dataset [(col/new-column :v8 (dfn/+ (DS :V3) 1))])
Remove one column
(ds/remove-column DS :v5)
Remove several columns
(ds/drop-columns DS [:v6 :v7])
(ds/drop-columns DS cols)
Replace values for rows matching a condition
(ds/update-column DS :V2 #(map (fn [^long v]
(if (< v 4) 0 v)) %))
By group
(group-by-columns-or-fn-and-aggregate [:V4] {:sumV2 #(dfn/sum (% :V2))} DS)
By several groups
(->> (group-by-columns-or-fn-and-aggregate [:V4 :V1] {:sumV2 #(dfn/sum (% :V2))} DS)
(sort-by-columns-with-orders [:V4 :V1]))
Calling function in by
(->> (ds/update-column DS :V4 #(map str/lower-case %))
(group-by-columns-or-fn-and-aggregate [:V4] {:sumV1 #(dfn/sum (% :V1))})
(ds/sort-by-column :V4))
Assigning column name in by
(-> (ds/update-column DS :V4 #(map str/lower-case %))
(ds/rename-columns {:V4 :abc})
(->> (group-by-columns-or-fn-and-aggregate [:abc] {:sumV1 #(dfn/sum (% :V1))})
(ds/sort-by-column :abc)))
Using a condition in by
(group-by-columns-or-fn-and-aggregate #(= (% :V4) \A)
{:sumV1 #(dfn/sum (% :V1))}
DS)
By on a subset of rows
(->> (ds/select-rows DS (range 5))
(group-by-columns-or-fn-and-aggregate [:V4] {:sumV1 #(dfn/sum (% :V1))})
(ds/sort-by-column :V4))
Count number of observations for each group
(group-by-columns-or-fn-and-aggregate [:V4] {:N ds/row-count} DS)
(map-v ds/row-count (ds/group-by-column :V4 DS))
(->> (vals (ds/group-by-column :V4 DS))
(map ds/row-count))
Add a column with number of observations for each group
(->> (ds/group-by identity [:V1] DS)
(vals)
(map (fn [ds]
(let [rcnt (ds/row-count ds)]
(ds/new-column ds :n (repeat rcnt rcnt)))))
(apply ds/concat)
(sort-by-columns-with-orders [:V2 :V4]))
Retrieve the first/last/nth observation for each group
(->> (group-by-columns-or-fn-and-aggregate [:V4] {:v #(first (% :V2))} DS)
(ds/sort-by-column :V4))
(->> (group-by-columns-or-fn-and-aggregate [:V4] {:v #(last (% :V2))} DS)
(ds/sort-by-column :V4))
(->> (group-by-columns-or-fn-and-aggregate [:V4] {:v #(nth (% :V2) 1)} DS)
(ds/sort-by-column :V4))
Summarise all the columns
(apply-to-columns my-max :all DS)
Summarise several columns
(apply-to-columns dfn/mean [:V1 :V2] DS)
Summarise several columns by group
(->> DS
(group-by-columns-or-fn-and-aggregate [:V4] {:V1 #(dfn/mean (% :V1))
:V2 #(dfn/mean (% :V2))})
(ds/sort-by-column :V4))
Summarise with more than one function by group
(->> DS
(group-by-columns-or-fn-and-aggregate [:V4] {:V1-mean #(dfn/mean (% :V1))
:V2-mean #(dfn/mean (% :V2))
:V1-sum #(dfn/sum (% :V1))
:V2-sum #(dfn/sum (% :V2))})
(ds/sort-by-column :V4))
Summarise using a condition
(def cols (->> DS
(ds/columns)
(map meta)
(filter (comp #{:float64} :datatype))
(map :name)))
(apply-to-columns dfn/mean cols DS)
Modify all the columns
(ds/update-columns DS (ds/column-names DS) reverse)
Modify several columns (dropping the others)
(->> (ds/select-columns DS [:V1 :V2])
(apply-to-columns dfn/sqrt :all))
(->> (ds/drop-columns DS [:V4])
(apply-to-columns dfn/exp :all))
Modify several columns (keeping the others)
(def DS (apply-to-columns dfn/sqrt [:V1 :V2] DS))
(def DS (apply-to-columns #(dfn/pow % 2.0) [:V1 :V2] DS))
Modify columns using a condition (dropping the others)
(def cols (->> DS
(ds/columns)
(map meta)
(filter (comp #(= :float64 %) :datatype))
(map :name)))
(->> (ds/select-columns DS cols)
(apply-to-columns #(dfn/- % 1) cols))
Modify columns using a condition (keeping the others)
(def DS (apply-to-columns #(dtype/->reader % :int64) [:V1 :V2] DS))
Use a complex expression
(->> (ds/group-by-column :V4 DS)
(vals)
(map #(ds/head 2 %))
(map #(ds/add-or-update-column % :V2 (repeat (ds/row-count %) "X")))
(apply ds/concat))
Use multiple expressions (with DT[,{j}])
(let [x (dfn/+ (DS :V1) (dfn/sum (DS :V2)))]
(println (DS :V1))
(println (dfn/descriptive-stats (DS :V2)))
(ds/name-values-seq->dataset {:A (map inc (range (ds/row-count DS)))
:B x}))