Skip to content

Commit

Permalink
Parquet write simplification (#228)
Browse files Browse the repository at this point in the history
* Major refactor/simplification for writing parquet files.

* new parquet io scheme now passes tests.

* Updating changelog.

* Updating changelog.
  • Loading branch information
cnuernber authored Apr 8, 2021
1 parent 7a9e432 commit a6e087d
Show file tree
Hide file tree
Showing 7 changed files with 217 additions and 251 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,5 @@ graalvm*
resources
graal-test
tc
__pycache__
__pycache__
*.crc
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
# Changelog

## 5.18
* Graal-native friendly mmap pathways (no requiring resolve, you have to explicity set the implementation in your main.clj file).
* Parquet write pathway update to make more standard and more likely to work with future versions of parquet. This means, however, that there will
no longer be a direct correlation between number of datasets and number of record batches in a parquet file as the standard pathway takes care
of writing out record batches when a memory constraint is triggered. So if you save a dataset you may get a parquet file back that contains
a sequence of datasets. There are many parquet options, see the documentation for
[ds-seq->parquet](https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html#var-ds-seq-.3Eparquet).

## 5.17
* [Issue 225](https://github.com/techascent/tech.ml.dataset/issues/224) - column/row selection should return empty datasets when no columns are selected.
* nil headers now print fine - thanks to DavidVujic.
Expand Down
13 changes: 13 additions & 0 deletions dev-resources/logback.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<configuration debug="false">
<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<!-- encoders are assigned the type
ch.qos.logback.classic.encoder.PatternLayoutEncoder by default -->
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>

<root level="info">
<appender-ref ref="STDOUT" />
</root>
</configuration>
25 changes: 0 additions & 25 deletions java/org/apache/parquet/hadoop/FilePageWriteStore.java

This file was deleted.

41 changes: 41 additions & 0 deletions java/tech/v3/dataset/ParquetRowWriter.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
package tech.v3.dataset;

import org.apache.parquet.schema.MessageType;
import org.apache.parquet.io.api.RecordConsumer;
import org.apache.parquet.hadoop.api.WriteSupport;
import org.apache.hadoop.conf.Configuration;
import clojure.lang.IFn;
import java.util.Map;

public class ParquetRowWriter extends WriteSupport<Long>
{
public final IFn rowWriter;
public final MessageType schema;
public final Map<String,String> metadata;
public RecordConsumer consumer;
public Object dataset;
public ParquetRowWriter(IFn _writer, MessageType _schema, Map<String,String> _meta) {
rowWriter = _writer;
schema = _schema;
metadata = _meta;
consumer = null;
//Outside forces must set dataset
dataset = null;
}

@Override
public WriteContext init(Configuration configuration) {
return new WriteContext( schema, metadata );
}

@Override
public void prepareForWrite(RecordConsumer recordConsumer) {
consumer = recordConsumer;
}

@Override
public void write(Long record) {
rowWriter.invoke(dataset,record,consumer);
}

}
4 changes: 3 additions & 1 deletion project.clj
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
:url "http://www.eclipse.org/legal/epl-v10.html"}
:dependencies [[org.clojure/clojure "1.10.2" :scope "provided"]
[camel-snake-kebab "0.4.2"]
[cnuernber/dtype-next "7.04"]
[cnuernber/dtype-next "7.07"]
[techascent/tech.io "4.04"
:exclusions [org.apache.commons/commons-compress]]
[com.univocity/univocity-parsers "2.9.0"]
Expand Down Expand Up @@ -66,6 +66,8 @@
{:dependencies [[criterium "0.4.5"]
[http-kit "2.3.0"]
[com.clojure-goes-fast/clj-memory-meter "0.1.0"]]
:source-paths ["src"]
:resource-paths ["dev-resources"]
:test-paths ["test" "neanderthal"]}
:codox
{:dependencies [[codox-theme-rdash "0.1.2"]
Expand Down
Loading

0 comments on commit a6e087d

Please sign in to comment.