Skip to content

mbtaylor/jarrow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JARROW

Overview

Jarrow is a lightweight java implementation for I/O of data stored in formats related to Apache Arrow. Currently, it only has support for the Arrow-related Feather format, but it may in future grow support for the Arrow IPC File format or other evolutions of Feather. Or it may not.

Comparison with the Apache Java Arrow Implementation

Why write this when there's already a Java implementation of Feather I/O provided by Apache? I wanted something without all those dependencies, and for which I had full control over the data access. I'm using it to provide Feather table I/O handlers in STIL/TOPCAT.

This library probably does less clever stuff than the Apache one but it's much more compact and has no external dependencies.

Building

If you want the library, the best thing is just to pick up the pre-built jarrow.jar file from the release.

However, if you want to build it from source, there's a makefile. It may need editing since some targets contains references to directories you don't have. But basically to build the library you just need to run javac on all the java files.

The source file is Java 1.6 compatible, and the distributed jarrow.jar file contains Java 1.6-compatible classes.

Implementation Status

Only feather files are currently supported. All feather files can be read, but currently the following column types are not fully supported on input:

  • CATEGORY: I haven't come across any feather files with category column types, and it's not clear to me how to interpret the feather format documentation for this type, so it's not supported.
  • UINT64: There's no java primitive or primitive-wrapper type that can represent unsigned 64-bit integers, so it's are not supported.
  • TIMESTAMP, DATE, TIME: These values can be read, but the type-specific metadata/unit information is not currently available.

The reading is implemented using memory mapping (MappedByteBuffers).

The LARGE_UTF8 and LARGE_BINARY types defined in the Arrow but not in the Feather version of the flatbuffers metadata file are supported.

Implementation notes

The flatbuffers java source files are generated by running the flatc compiler from Google Flatbuffers version 1.11.0 on the Arrow version of feather.fbs. I subsequently moved the generated source files into a different java package to avoid possible namespace clashes with external code that may use a different version of flatbuffers.

Usage

Comprehensive documentation is provided in the javadocs.

The classes in the package uk.ac.bristol.star.feather form the usable parts of the I/O library. The classes in the uk.ac.bristol.star.fbs.* packages are flatbuffer support files that you shouldn't need to use. To read a table, you can use FeatherTable.fromFile(File) method; examples in FeatherTable.main.

To write a table, use FeatherWriter.write(OutputStream); this requires you to implement some FeatherColumnWriter objects in some way appropriate to the data structures in which your table data resides; there are examples in FeatherWriter.main.

Support and future development

I don't know whether anybody else will want to use this package. If you do, and if you are interested in features that are not currently present, please contact me (@mbtaylor).

Licence

This library includes google flatbuffers code which is licenced under the Apache 2.0 licence. I'm prepared to offer any licence to the original parts of this project that suits you and that's legally possible. For now, I assert that it's licenced under the LGPL. Unless somebody tells me I'm not allowed to do that.

History

  • Version 1.0 (27 Feb 2020): Initial release