LZ4 Streaming Format
Notices
Copyright (c) 2013 Yann Collet
Permission is granted to copy and distribute this document for any purpose and
without charge, including translations into other languages and incorporation
into compilations, proided that the copyright notice and this notice are
presered, and that any substantie changes or deletions from the original are
clearly mar!ed"
Version
1"#
Introduction
$he purpose of this document is to de%ne a lossless compressed data format,
that is independent of CP& type, operating system, %le system and character set,
suitable for 'ile compression, Pipe and streaming compression using the ()#
algorithm * http *++ code "google "com +p +l, #+
$he data can be produced or consumed, een for an arbitrarily long se-uentially
presented input data stream, using only an a priori bounded amount of
intermediate storage, and hence can be used in data communications" $he
format uses the ()# compression method, and ../ash 032 chec!sum method, for
detection of data corruption"
$he data format de%ned by this speci%cation does not attempt to allow random
access to compressed data"
$his speci%cation is intended for use by implementers of software to compress
data into ()# format and+or decompress data from ()# format" $he te.t of the
speci%cation assumes a basic bac!ground in programming at the leel of bits and
other primitie data representations"
&nless otherwise indicated below, a compliant compressor must produce data
sets that conform to all the speci%cations presented here"
1 compliant decompressor must be able to accept and decompress at least one
data set that conforms to the speci%cations presented here2 wheneer it does not
support any parameter, it must produce a non0ambiguous error code and
associated error message e.plaining which parameter alue is unsupported (a
typical e.ample being an unsupported bu3er si,e)"
4istribution of this document is unlimited"
Summary :
5ntroduction
4escription of an () # stream
6tream 4escriptor
4ata 7loc!s
6!ippable Chun!s
(egacy format
1ppendi.
Description of an LZ4 stream
Magic Number
# 7ytes, (ittle endian format"
8alue * 0x184D2204
Stream descriptor
3 to 19 7ytes, to be detailed in the ne.t part"
:ost signi%cant part of the spec"
Data Blocks
$o be detailed later on"
$hat;s where compressed data is stored"
EoS End o! Stream mark
$he stream ends when the last data bloc! has a si,e of <"="
$he si,e is e.pressed as a 320bits alue"
Stream #$ecksum
6tream Chec!sum chec!s that the full stream has been decoded correctly"
$he stream chec!sum is the result of ..h 32() hash function, using the original (or
decoded) data as input, and a seed of ,ero"
6tream chec!sum is only present when its associated >ag is set in the stream
descriptor" 6tream Chec!sum alidates the result, and therefore not only that all
bloc!s were fully transmitted without error and in the correct order, but also that
the encoding+decoding process itself generated no distortion" 5ts usage is
recommended"
#oncatenation
5t shall be possible to append streams" 5n this case, the decoder shall be able to
decode them, by appending the results in the same order" $here is no limit to the
number of appended streams"
Stream Descriptor
$he stream descriptor uses a minimum of 3 bytes, and up to 19 bytes depending
on optional parameters"
5n the picture, bit ? is highest bit, while bit 0 is lowest"
Version Number :
20bits %eld, must be set to <"%="
1ny other alue cannot be decoded by this ersion of the speci%cation"
@ther ersion numbers will use di3erent >ag layouts"
Block &ndependence 'ag :
7y default, bloc!s are independent, and can therefore be decoded independently"
5f this >ag is set to <0=, it means each bloc! depends on preious ones for
decoding (up to ()# window si,e, which is A# B7)" 5n this case, it;s necessary to
decode all bloc!s in se-uence"
7loc! dependency improes compression ratio, especially for small bloc!s" @n the
other hand, it ma!es Cumps or multi0threaded decoding impossible"
4efault alue * <%= (bloc!s are independent)
Block c$ecksum 'ag :
5f this >ag is set, each data bloc! will be followed by a #0bytes chec!sum,
calculated by using the ../ash032 algorithm on the raw (compressed) data bloc!"
$he intention is to detect data corruption (storage or transmission errors)
immediately, before decoding"
4efault alue * <"= (disabled)
Stream Si(e 'ag :
5f this >ag is set, the original (uncompressed) si,e of the full stream will be
present as a D bytes unsigned alue, right after the >ags"
4efault alue * <"= (not present)
Stream c$ecksum 'ag :
5f this >ag is set, a stream chec!sum will be appended after the Eo6 mar!"
4efault alue * <%= (stream chec!sum is present)
)reset Dictionary 'ag :
5f this >ag is set, a 4ict054 %eld will be present, right after the descriptor >ags and
the stream si,e"
4efault alue * <"= (not present)
Block Ma*imum Si(e :
$his information is intended to help the decoder allocate the right amount of
memory"
6i,e here refers to the original (uncompressed) data si,e"
7loc! :a.imum 6i,e is one alue among the following table *
$he decoder may refuse to allocate bloc! si,es aboe a (system0speci%c) si,e"
&nused alues may be used in a future reision of the spec" 1 decoder respecting
the current ersion of the spec shall be unable to decode such stream"
+eser,ed bits :
8alue of resered bits must be " (,ero)"
Fesered bit might be used in a future ersion of the speci%cation, to enable any
(yet0to0decide) optional feature"
5f this happens, a decoder respecting the current ersion of the speci%cation shall
not be able to decode such a stream"
Stream Si(e
$his is the original (uncompressed) si,e"
$his information is optional, and only present if the associated >ag is set"
6tream si,e is proided using unsigned D 7ytes, for a ma.imum of 1A /e.a7ytes"
'ormat is (ittle endian"
$his %eld has no impact on decoding, it Cust informs the decoder how much data
the stream contains (for e.ample, to display it during decoding process)"
Dictionary &D
4ict054 is only present if the associated >ag is set"
1 dictionary is specially useful to compress short input se-uences" $he
compressor can ta!e adantage of the dictionary conte.t to encode the input in a
more compact manner" 5t wor!s as a !ind of <!nown pre%.= which is used by both
the compressor and the decompressor to <warm0up= reference tables and help
compress small data bloc!s"
4ict054 is the ../ash032 chec!sum of this <!nown pre%.=" 'ormat is (ittle endian"
$he decompressor uses this identi%er to determine which dictionary has been
used by the compressor" $he compressor and the decompressor must use e.actly
the same dictionary" $his document does not specify the contents of prede%ned
dictionaries, since the optimal dictionaries are application speci%c" 1ny data
format using this feature must precisely de%ne the allowed dictionaries"
Githin a single stream, a single dictionary is possible"
Ghen the stream consists of multiple independent bloc!s, each bloc! will be
initialised with the same dictionary"
5f the stream consists of interdependent bloc!s, the dictionary will only be used
once, at the beginning of the stream"
-eader #$ecksum :
@ne0byte chec!sum of all descriptor %elds, including optional ones when present"
$he byte is second byte of ..h 32() * H (..h32()IID) J 0.'' K ,
using ,ero as a seed,
and the full 6tream 4escriptor as an input (including optional %elds when they are
present)"
1 di3erent chec!sum indicates an error in the descriptor"
Data Blocks
Block Si(e
$his %eld uses 4bytes. format is little0endian"
$he highest bit is <%= if data in the bloc! is uncompressed"
$he highest bit is <"= if data in the bloc! is compressed by ()#"
1ll other bits gie the si,e, in bytes, of the following data bloc! (the si,e does not
include the chec!sum if present)"
7loc! 6i,e shall neer be larger than 7loc! :a.imum 6i,e" 6uch a thing could
happen when the original data is incompressible" 5n this case, such a data bloc!
shall be passed in uncompressed format"
Data
Ghere the actual data to decode stands" 5t might be compressed or not,
depending on preious %eld indications"
&ncompressed si,e of 4ata can be any si,e, up to <bloc! ma.imum si,e="
Lote that the data bloc! is not necessarily %lled * an arbitrary <>ush= may
happen anytime" 1ny bloc! can be <partially %lled="
Block c$ecksum :
@nly present if the associated >ag is set"
$his is a #0bytes chec!sum alue, in little endian format,
calculated by using the ../ash032 algorithm on the raw (undecoded) data bloc!,
and a seed of ,ero"
$he intention is to detect data corruption (storage or transmission errors) before
decoding"
7loc! chec!sum is cumulatie with 6tream chec!sum"
Skippable Chunks
6!ippable chun!s allow the integration of user0de%ned data into a >ow of
concatenated streams"
5ts design is pretty straightforward, with the sole obCectie to allow the decoder to
-uic!ly s!ip oer user0de%ned data and continue decoding"
'or the purpose of facilitating stream identi%cation, it is discouraged to start a
>ow of concatenated streams with a s!ippable chun!" 5f there is a need to start
such a >ow with some user data encapsulated into a s!ippable chun!, it;s
recommended to start will a ,ero0byte ()# stream followed by a s!ippable chun!"
$his will ma!e it easier for stream+%le type identi%ers"
Magic Number
# 7ytes, (ittle endian format"
8alue * 0x184D2A5X, which means any alue from 0x184D2A50 to 0x184D2A5F. 1ll
1A alues are alid to identify a s!ippable stream"
Stream Si(e
$his is the si,e, in bytes, of the following &ser 4ata (without including the magic
number nor the si,e %eld itself)"
# 7ytes, (ittle endian format, unsigned 320bits"
$his means &ser 4ata can;t be bigger than (2M3201) 7ytes"
/ser Data
&ser 4ata can be anything" 4ata will Cust be s!ipped by the decoder"
Legacy format
$he (egacy format was de%ned into the initial ersions of <()#4emo="
Lewer compressors should not use this format anymore, since it is too restrictie"
5t is recommended that decompressors shall be able to decode this format during
the transition period"
:ain properties of legacy format *
0 'i.ed bloc! si,e * D :7"
0 1ll bloc!s must be completely %lled, e.cept the last one"
0 $he last bloc! is detected either because it is followed by the <E@'= (End of 'ile)
mar!, or because it is followed by a !nown :agic Lumber"
0 Lo chec!sum
0 Conention is (ittle endian
Magic Number
# 7ytes, (ittle endian format"
8alue * 0x184C2102
Block #ompressed Si(e
$his is the si,e, in bytes, of the following compressed data bloc!"
# 7ytes, (ittle endian format"
Data
Ghere the actual data stands"
4ata is always compressed, een when compression is detrimental (i"e" larger
than original si,e)"
Appendix
8ersion changes
1"# * added s!ippable streams, re0added stream chec!sum
1"3 * modi%ed header chec!sum
1"2 * reduced choice of <bloc! si,e=, to postpone decision on <dynamic si,e of
7loc!6i,e 'ield="
1"1 * optional %elds are now part of the descriptor
1"0 * changed <bloc! si,e= speci%cation, adding a compressed+uncompressed >ag
0"N * reduced scale of <bloc! ma.imum si,e= table
0"D * remoed * high compression >ag
0"? * remoed * stream chec!sum
0"A * settled * stream si,e uses D bytes, endian conention is little endian
0"9* added copyright notice
0"# * changed format to Ooogle 4oc compatible @pen4ocument