The LHA archive format, also known as LZH, is a compressed archive file format primarily used on MS-DOS and Microsoft Windows systems. It was developed by Haruyasu Yoshizaki in the late 1980s as an improvement over the existing ARC and ZIP compression formats. LHA archives provide efficient compression ratios and fast decompression speeds, making them well-suited for storing and distributing software, documents, and other types of files.
The LHA format uses a combination of Lempel-Ziv-Welch (LZW) and Huffman coding algorithms to achieve high compression ratios. LZW is a dictionary-based compression algorithm that replaces repeated occurrences of data with references to a dictionary that is built up as the data is being compressed. Huffman coding, on the other hand, is a variable-length coding scheme that assigns shorter bit sequences to more frequent symbols, thereby reducing the overall size of the compressed data.
An LHA archive consists of a series of headers and compressed data blocks. The archive begins with a main header that contains information about the archive itself, such as the archive format version, the compression method used, and the total number of files stored in the archive. Following the main header are individual file headers for each file contained in the archive. These file headers store metadata such as the original filename, file size, modification date, and CRC-16 checksum.
After each file header, the compressed data for that file is stored in one or more data blocks. The size of each data block is determined by the compression method and settings used during the creation of the archive. LHA supports several compression methods, including -lh0- (no compression), -lh1- (RLE encoding), -lh4- (LZW compression), -lh5- (LZW+Huffman coding), and -lh7- (LZSS compression). The choice of compression method affects both the compression ratio and the decompression speed of the archive.
One notable feature of the LHA format is its support for solid archives. In a solid archive, the compressed data for multiple files is concatenated together, allowing the compression algorithm to take advantage of redundancy across file boundaries. This can result in significantly higher compression ratios compared to non-solid archives, where each file is compressed independently. However, solid archives also have the drawback of requiring the entire archive to be decompressed to extract a single file, which can be time-consuming for large archives.
To create an LHA archive, a compression utility such as LHA or LHarc is used. These utilities take one or more input files and compress them into a single LHA archive file with the extension .lha or .lzh. The compression process involves analyzing the input data, building a dictionary of repeated patterns, and replacing those patterns with shorter references in the compressed output. The compressed data is then divided into blocks and written to the archive file along with the necessary headers and metadata.
Extracting files from an LHA archive involves reading the archive headers to locate the desired file(s) and then decompressing the corresponding data blocks. The decompression process reverses the compression algorithm, rebuilding the original data from the dictionary references and encoded symbols. Most LHA compression utilities support various extraction options, such as extracting specific files, overwriting existing files, or preserving the original directory structure.
One advantage of the LHA format is its compatibility with a wide range of operating systems and platforms. In addition to MS-DOS and Microsoft Windows, LHA archives can be created and extracted on Unix-like systems, macOS, and other platforms using appropriate software tools. This cross-platform compatibility makes LHA a convenient choice for distributing software and data across different environments.
However, the LHA format also has some limitations compared to more modern compression formats. One issue is its lack of built-in encryption support, which means that LHA archives do not provide any inherent security for sensitive data. Another limitation is the maximum file size supported by the format, which is typically around 2 GB due to the use of 32-bit file offsets. Additionally, the LHA format has largely been superseded by newer formats like ZIP and RAR, which offer improved compression ratios, better performance, and additional features.
Despite these limitations, the LHA format remains in use today, particularly for archiving and distributing older software and data. Many classic MS-DOS games, applications, and document archives are still distributed in LHA format, and there are numerous tools and utilities available for working with LHA archives on modern systems. Some popular LHA compression utilities include LHA, LHarc, and UNLHA, while many modern file archivers like 7-Zip and WinRAR also support creating and extracting LHA archives.
In terms of performance, the LHA format offers a good balance between compression ratio and decompression speed. The exact performance characteristics depend on the specific compression method and settings used, as well as the nature of the input data. In general, LHA archives created with the -lh5- method (LZW+Huffman coding) provide a good trade-off between compression ratio and decompression speed, while the -lh7- method (LZSS compression) offers faster decompression at the cost of slightly lower compression ratios.
When working with LHA archives, it is important to ensure that the software tools used are compatible with the specific version and features of the archive format. Older LHA compression utilities may not support newer compression methods or archive features, while modern tools may handle older archives differently than the original software. It is also recommended to verify the integrity of LHA archives using CRC-16 checksums or other verification methods to ensure that the compressed data has not been corrupted during storage or transmission.
In conclusion, the LHA archive format is a legacy compression format that provides efficient compression and fast decompression for storing and distributing files on MS-DOS and Microsoft Windows systems. While it has largely been superseded by newer formats like ZIP and RAR, LHA remains relevant for archiving and distributing older software and data. Its cross-platform compatibility and good performance characteristics make it a useful tool in certain scenarios, and there are still many software utilities and tools available for working with LHA archives on modern systems. Understanding the structure and features of the LHA format is valuable for anyone working with legacy data or software archives.
File compression reduces redundancy so the same information takes fewer bits. The upper bound on how far you can go is governed by information theory: for lossless compression, the limit is the entropy of the source (see Shannon’s source coding theorem and his original 1948 paper “A Mathematical Theory of Communication”). For lossy compression, the trade-off between rate and quality is captured by rate–distortion theory.
Most compressors have two stages. First, a model predicts or exposes structure in the data. Second, a coder turns those predictions into near-optimal bit patterns. A classic modeling family is Lempel–Ziv: LZ77 (1977) and LZ78 (1978) detect repeated substrings and emit references instead of raw bytes. On the coding side, Huffman coding (see the original paper 1952) assigns shorter codes to more likely symbols. Arithmetic coding and range coding are finer-grained alternatives that squeeze closer to the entropy limit, while modern Asymmetric Numeral Systems (ANS) achieves similar compression with fast table-driven implementations.
DEFLATE (used by gzip, zlib, and ZIP) combines LZ77 with Huffman coding. Its specs are public: DEFLATE RFC 1951, zlib wrapper RFC 1950, and gzip file format RFC 1952. Gzip is framed for streaming and explicitly does not attempt to provide random access. PNG images standardize DEFLATE as their only compression method (with a max 32 KiB window), per the PNG spec “Compression method 0… deflate/inflate… at most 32768 bytes” and W3C/ISO PNG 2nd Edition.
Zstandard (zstd): a newer general-purpose compressor designed for high ratios with very fast decompression. The format is documented in RFC 8878 (also HTML mirror) and the reference spec on GitHub. Like gzip, the basic frame doesn’t aim for random access. One of zstd’s superpowers is dictionaries: small samples from your corpus that dramatically improve compression on many tiny or similar files (see python-zstandard dictionary docs and Nigel Tao’s worked example). Implementations accept both “unstructured” and “structured” dictionaries (discussion).
Brotli: optimized for web content (e.g., WOFF2 fonts, HTTP). It mixes a static dictionary with a DEFLATE-like LZ+entropy core. The spec is RFC 7932, which also notes a sliding window of 2WBITS−16 with WBITS in [10, 24] (1 KiB−16 B up to 16 MiB−16 B) and that it does not attempt random access. Brotli often beats gzip on web text while decoding quickly.
ZIP container: ZIP is a file archive that can store entries with various compression methods (deflate, store, zstd, etc.). The de facto standard is PKWARE’s APPNOTE (see APPNOTE portal, a hosted copy, and LC overviews ZIP File Format (PKWARE) / ZIP 6.3.3).
LZ4 targets raw speed with modest ratios. See its project page (“extremely fast compression”) and frame format. It’s ideal for in-memory caches, telemetry, or hot paths where decompression must be near RAM speed.
XZ / LZMA push for density (great ratios) with relatively slow compression. XZ is a container; the heavy lifting is typically LZMA/LZMA2 (LZ77-like modeling + range coding). See .xz file format, the LZMA spec (Pavlov), and Linux kernel notes on XZ Embedded. XZ usually out-compresses gzip and often competes with high-ratio modern codecs, but with slower encode times.
bzip2 applies the Burrows–Wheeler Transform (BWT), move-to-front, RLE, and Huffman coding. It’s typically smaller than gzip but slower; see the official manual and man pages (Linux).
“Window size” matters. DEFLATE references can only look back 32 KiB (RFC 1951 and PNG’s 32 KiB cap noted here). Brotli’s window ranges from about 1 KiB to 16 MiB (RFC 7932). Zstd tunes window and search depth by level (RFC 8878). Basic gzip/zstd/brotli streams are designed for sequential decoding; the base formats don’t promise random access, though containers (e.g., tar indexes, chunked framing, or format-specific indexes) can layer it on.
The formats above are lossless: you can reconstruct exact bytes. Media codecs are often lossy: they discard imperceptible detail to hit lower bitrates. In images, classic JPEG (DCT, quantization, entropy coding) is standardized in ITU-T T.81 / ISO/IEC 10918-1. In audio, MP3 (MPEG-1 Layer III) and AAC (MPEG-2/4) rely on perceptual models and MDCT transforms (see ISO/IEC 11172-3, ISO/IEC 13818-7, and an MDCT overview here). Lossy and lossless can coexist (e.g., PNG for UI assets; Web codecs for images/video/audio).
Theory: Shannon 1948 · Rate–distortion · Coding: Huffman 1952 · Arithmetic coding · Range coding · ANS. Formats: DEFLATE · zlib · gzip · Zstandard · Brotli · LZ4 frame · XZ format. BWT stack: Burrows–Wheeler (1994) · bzip2 manual. Media: JPEG T.81 · MP3 ISO/IEC 11172-3 · AAC ISO/IEC 13818-7 · MDCT.
Bottom line: choose a compressor that matches your data and constraints, measure on real inputs, and don’t forget the gains from dictionaries and smart framing. With the right pairing, you can get smaller files, faster transfers, and snappier apps — without sacrificing correctness or portability.
File compression is a process that reduces the size of a file or files, typically to save storage space or speed up transmission over a network.
File compression works by identifying and removing redundancy in the data. It uses algorithms to encode the original data in a smaller space.
The two primary types of file compression are lossless and lossy compression. Lossless compression allows the original file to be perfectly restored, while lossy compression enables more significant size reduction at the cost of some loss in data quality.
A popular example of a file compression tool is WinZip, which supports multiple compression formats including ZIP and RAR.
With lossless compression, the quality remains unchanged. However, with lossy compression, there can be a noticeable decrease in quality since it eliminates less-important data to reduce file size more significantly.
Yes, file compression is safe in terms of data integrity, especially with lossless compression. However, like any files, compressed files can be targeted by malware or viruses, so it's always important to have reputable security software in place.
Almost all types of files can be compressed, including text files, images, audio, video, and software files. However, the level of compression achievable can significantly vary between file types.
A ZIP file is a type of file format that uses lossless compression to reduce the size of one or more files. Multiple files in a ZIP file are effectively bundled together into a single file, which also makes sharing easier.
Technically, yes, although the additional size reduction might be minimal or even counterproductive. Compressing an already compressed file might sometimes increase its size due to metadata added by the compression algorithm.
To decompress a file, you typically need a decompression or unzipping tool, like WinZip or 7-Zip. These tools can extract the original files from the compressed format.