Misplaced Pages

Lempel–Ziv–Markov chain algorithm

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The Lempel–Ziv–Markov chain algorithm ( LZMA ) is an algorithm used to perform lossless data compression . It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio (generally higher than bzip2 ) and a variable compression-dictionary size (up to 4  GB ), while still maintaining decompression speed similar to other commonly used compression algorithms.

#227772

34-467: LZMA2 is a simple container format that can include both uncompressed data and LZMA data, possibly with multiple different LZMA encoding parameters. LZMA2 supports arbitrarily scalable multithreaded compression and decompression and efficient compression of data which is partially incompressible. LZMA uses a dictionary compression algorithm (a variant of LZ77 with huge dictionary sizes and special support for repeatedly used match distances), whose output

68-430: A data structure (called the 'dictionary') maintained by the encoder. When the encoder finds such a match, it substitutes a reference to the string's position in the data structure. Some dictionary coders use a 'static dictionary', one whose full set of strings is determined before coding begins and does not change during the coding process. This approach is most often used when the message or set of messages to be encoded

102-498: A 16-bit data type) representing the predicted probability of the bit being 0, which is read and updated by the range decoder (and should be initialized to ⁠ 2 10 {\displaystyle 2^{10}} ⁠ , representing 0.5 probability). Fixed probability range decoding instead assumes a 0.5 probability, but operates slightly differently from context-based range decoding. The range decoder state consists of two unsigned 32-bit variables, range (representing

136-435: A SHORTREP packet had been used (in other words, the byte found at the dictionary at the last used distance); it is only used just after a *MATCH packet. literal_bit_mode is an array of 8 values in the 0–2 range, one for each bit position in a byte, which are 1 or 2 if the previous packet was a *MATCH and it is either the most significant bit position or all the more significant bits in the literal to encode/decode are equal to

170-419: A bit proceeds in this way: The Linux kernel implementation of fixed-probability decoding in rc_direct() , for performance reasons, does not include a conditional branch, but instead subtracts range from code unconditionally. The resulting sign bit is used to both decide the bit to return and to generate a mask that is combined with code and added to range . Note that: The range decoder also provides

204-424: A byte indicating the dictionary size: LZMA2 data consists of packets starting with a control byte, with the following values: Bits 5–6 for LZMA chunks can be: Dictionary coder A dictionary coder , also sometimes known as a substitution coder , is a class of lossless data compression algorithms which operate by searching for matches between the text to be compressed and a set of strings contained in

238-407: A cascade of contexts to represent the dependencies on previous bits from the same byte). The main innovation of LZMA is that instead of a generic byte-based model, LZMA's model uses contexts specific to the bitfields in each representation of a literal or phrase: this is nearly as simple as a generic byte-based model, but gives much better compression because it avoids mixing unrelated bits together in

272-449: A dictionary size. The value of the lclppb byte is lc + lp × 9 + pb × 9 × 5 , where: In non-LZMA2 streams, lc must not be greater than 8, and lp and pb must not be greater than 4. In LZMA2 streams, ( lc + lp ) and pb must not be greater than 4. In the 7-zip LZMA file format, configuration is performed by a header containing the "properties" byte followed by the 32-bit little-endian dictionary size in bytes. In LZMA2,

306-468: A different LZMA configuration and dictionary. This improves the compression of partially or completely incompressible files and allows multithreaded compression and multithreaded decompression by breaking the file into runs that can be compressed or decompressed independently in parallel. Criticism of LZMA2's changes over LZMA include header fields not being covered by CRCs, and parallel decompression not being possible in practice. The LZMA2 header consists of

340-400: A few values that are used as indices in these multidimensional arrays are defined. The state value is conceptually based on which of the patterns in the following table match the latest 2–4 packet types seen, and is implemented as a state machine state updated according to the transition table listed in the table every time a packet is output. The initial state is 0, and thus packets before

374-483: A pointer to the tree of variables, which starts at the root. As long as the pointer does not point to a leaf, a bit is decoded using the variable indicated by the pointer, and the pointer is moved to either the left or right children depending on whether the bit is 0 or 1; when the pointer points to a leaf, the number associated with the leaf is returned. Non-reverse bit-tree decoding thus happens from most significant to least significant bit, stopping when only one value in

SECTION 10

#1733084591228

408-528: A power of two limit , and reversing the last log 2 ( limit ) bits of the result. In the rc_bittree function in the Linux kernel, integers are actually returned in the [ limit , 2 × limit ) range (with limit added to the conceptual value), and the variable at index 0 in the array is unused, while the one at index 1 is the root, and the left and right children indices are computed as 2 i and 2 i + 1. The rc_bittree_reverse function instead adds integers in

442-640: A single byte, or an LZ77 sequence with its length and distance implicitly or explicitly encoded. Each part of each packet is modeled with independent contexts, so the probability predictions for each bit are correlated with the values of that bit (and related bits from the same field) in previous packets of the same type. Both the lzip and the LZMA SDK documentation describes this stream format. There are 7 types of packets: LONGREP[*] refers to LONGREP[0–3] packets, *REP refers to both LONGREP and SHORTREP, and *MATCH refers to both MATCH and *REP. LONGREP[n] packets remove

476-440: Is built from redundancy extracted from a data environment (various input streams) which dictionary is then used statically to compress a further input stream. For example, a dictionary is built from old English texts then is used to compress a book. More common are methods where the dictionary starts in some predetermined state but the contents change during the encoding process, based on the data that has already been encoded. Both

510-409: Is fixed and large; for instance, an application that stores the contents of a book in the limited storage space of a PDA generally builds a static dictionary from a concordance of the text and then uses that dictionary to compress the verses. This scheme of using Huffman coding to represent indices into a concordance has been called "Huffword". In a related and more general method, a dictionary

544-451: Is not ideal, any programmer should be able to check the claims below with a few hours of work. LZMA data is at the lowest level decoded one bit at a time by the range decoder, at the direction of the LZMA decoder. Context-based range decoding is invoked by the LZMA algorithm passing it a reference to the "context", which consists of the unsigned 11-bit variable prob (typically implemented using

578-435: Is one. At each step of the encoding process, if there is no match, then the last matching index (or zero) and character are both added to the dictionary and output to the compressed stream. If there is a match, then the working index is updated to the matching index, and nothing is output. LZW is similar to LZ78, but, the dictionary is initialized to all possible symbols. The typical implementation works with 8 bit symbols, so

612-544: Is then encoded with a range encoder , using a complex model to make a probability prediction of each bit. The dictionary compressor finds matches using sophisticated dictionary data structures, and produces a stream of literal symbols and phrase references, which is encoded one bit at a time by the range encoder: many encodings are possible, and a dynamic programming algorithm is used to select an optimal one under certain approximations. Prior to LZMA, most encoder models were purely byte-based (i.e. they coded each bit using only

646-449: The LZ77 and LZ78 algorithms work on this principle. In LZ77, a circular buffer called the "sliding window" holds the last N bytes of data processed. This window serves as the dictionary, effectively storing every substring that has appeared in the past N bytes as dictionary entries. Instead of a single index identifying a dictionary entry, two values are needed: the length , indicating

680-403: The [0, limit ) range to a caller-provided variable, where limit is implicitly represented by its logarithm, and has its own independent implementation for efficiency reasons. Fixed probability integer decoding simply performs fixed probability bit decoding repeatedly, reading bits from the most to the least significant. The LZMA decoder is configured by an lclppb "properties" byte and

714-456: The beginning are assumed to be LIT packets. The pos_state and literal_pos_state values consist of respectively the pb and lp (up to 4, from the LZMA header or LZMA2 properties packet) least significant bits of the dictionary position (the number of bytes coded since the last dictionary reset modulo the dictionary size). Note that the dictionary size is normally the multiple of a large power of 2, so these values are equivalently described as

SECTION 20

#1733084591228

748-439: The bit-tree, reverse bit-tree and fixed probability integer decoding facilities, which are used to decode integers, and generalize the single-bit decoding described above. To decode unsigned integers less than limit , an array of ( limit − 1) 11-bit probability variables is provided, which are conceptually arranged as the internal nodes of a complete binary tree with limit leaves. Non-reverse bit-tree decoding works by keeping

782-405: The bits in the corresponding positions in match_byte , while otherwise it is 0; the choice between the 1 or 2 values depends on the value of the bit at the same position in match_byte . The literal/Literal set of variables can be seen as a "pseudo-bit-tree" similar to a bit-tree but with 3 variables instead of 1 in every node, chosen depending on the literal_bit_mode value at the bit position of

816-500: The copy was performed byte by byte, keeping the distance constant. Distances are logically 32-bit and distance 0 points to the most recently added byte in the dictionary. The distance encoding starts with a 6-bit "distance slot", which determines how many further bits are needed. Distances are decoded as a binary concatenation of, from most to least significant, two bits depending on the distance slot, some bits encoded with fixed 0.5 probability, and some context encoded bits, according to

850-478: The dictionary "codes" for hex 00 to hex FF (decimal 255) are pre-defined. Dictionary entries would be added starting with code value hex 100. Unlike LZ78, if a match is not found (or if the end of data), then only the dictionary code is output. This creates a potential issue since the decoder output is one step behind the dictionary. Refer to LZW for how this is handled. Enhancements to LZW include handing symbol sizes other than 8 bits and having reserved codes to reset

884-450: The distance used from the list of the most recent distances and reinsert it at the front, to avoid useless repeated entry, while MATCH just adds the distance to the front even if already present in the list and SHORTREP and LONGREP[0] don't alter the list. The length is encoded as follows: As in LZ77, the length is not limited by the distance, because copying from the dictionary is defined as if

918-457: The following table (distance slots 0−3 directly encode distances 0−3). No complete natural language specification of the compressed format seems to exist, other than the one attempted in the following text. The description below is based on the compact XZ Embedded decoder by Lasse Collin included in the Linux kernel source from which the LZMA and LZMA2 algorithm details can be relatively easily deduced: thus, while citing source code as reference

952-439: The least significant bits of the number of uncompressed bytes seen since the last dictionary reset. The prev_byte_lc_msbs value is set to the lc (up to 4, from the LZMA header or LZMA2 properties packet) most significant bits of the previous uncompressed byte. The is_REP value denotes whether a packet that includes a length is a LONGREP rather than a MATCH. The match_byte value is the byte that would have been decoded if

986-403: The length of the matched text, and the offset (also called the distance ), indicating that the match is found in the sliding window starting offset bytes before the current text. LZ78 uses a more explicit dictionary structure; at the beginning of the encoding process, the dictionary is empty. An index value of zero is used to represent the end of a string, so the first index of the dictionary

1020-599: The next bit to decode after the bit-tree context denoted by the node. The claim, found in some sources, that literals after a *MATCH are coded as the XOR of the byte value with match_byte is incorrect; they are instead coded simply as their byte value, but using the pseudo-bit-tree just described and the additional context listed in the table below. The probability variable groups used in LZMA are those: LONGREP[0] The LZMA2 container supports multiple runs of compressed LZMA data and uncompressed data. Each LZMA compressed run can have

1054-593: The properties byte can optionally be changed at the start of LZMA2 LZMA packets, while the dictionary size is specified in the LZMA2 header as later described. The LZMA packet format has already been described, and this section specifies how LZMA statistically models the LZ-encoded streams, or in other words which probability variables are passed to the range decoder to decode each bit. Those probability variables are implemented as multi-dimensional arrays; before introducing them,

Lempel–Ziv–Markov chain algorithm - Misplaced Pages Continue

1088-484: The range size), and code (representing the encoded point within the range). Initialization of the range decoder consists of setting range to 2 − 1 , and code to the 32-bit value starting at the second byte in the stream interpreted as big-endian; the first byte in the stream is completely ignored. Normalization proceeds in this way: Context-based range decoding of a bit using the prob probability variable proceeds in this way: Fixed-probability range decoding of

1122-436: The same context. Furthermore, compared to classic dictionary compression (such as the one used in zip and gzip formats), the dictionary sizes can be and usually are much larger, taking advantage of the large amount of memory available on modern systems. In LZMA compression, the compressed stream is a stream of bits, encoded using an adaptive binary range coder. The stream is divided into packets, each packet describing either

1156-406: The valid range is possible (this conceptually allows to have range sizes that are not powers of two, even though LZMA does not make use of this). Reverse bit-tree decoding instead decodes from least significant bit to most significant bits, and thus only supports ranges that are powers of two, and always decodes the same number of bits. It is equivalent to performing non-reverse bittree decoding with

#227772