The ISO archive format, also known as ISO 9660, is a file system standard published by the International Organization for Standardization (ISO) in 1988. It was designed as a cross-platform file system for optical disc media, such as CD-ROMs. The goal was to provide a unified method for different operating systems to read data from optical discs, ensuring interoperability and compatibility.
ISO 9660 defines a hierarchical file system structure, similar to the file systems used by most operating systems. It organizes data into directories and files, with each directory able to contain subdirectories and files. The standard specifies the format of the volume and directory descriptors, as well as the path table, which is used for quick access to directories.
One of the key features of the ISO 9660 format is its simplicity and compatibility. The standard imposes restrictions on file names, directory structures, and metadata to ensure that the discs can be read by a wide range of systems. File names are limited to 8 characters, followed by a 3-character extension (8.3 format), and can only contain uppercase letters, digits, and underscores. Directory names are similarly restricted, with a maximum depth of 8 levels.
To accommodate longer file names and additional metadata, the ISO 9660 standard has been extended through various specifications. One such extension is Joliet, introduced by Microsoft in 1995. Joliet allows for longer file names (up to 64 Unicode characters) and supports case-sensitivity. It achieves this by including an additional set of directory records using UCS-2 encoding, which is read by systems that support the Joliet extension.
Another notable extension to ISO 9660 is Rock Ridge, which was developed for UNIX systems. Rock Ridge adds POSIX file system semantics, such as file permissions, ownership, and symbolic links, to the ISO 9660 format. This extension allows for the preservation of UNIX-specific file attributes when creating ISO images from UNIX file systems.
The ISO 9660 format divides the disc into logical blocks, each typically 2,048 bytes in size. The first 16 blocks are reserved for system use and contain the Volume Descriptors, which provide information about the disc's structure and content. The Primary Volume Descriptor is mandatory and includes details such as the disc's volume identifier, the size of the logical blocks, and the root directory record.
Following the Volume Descriptors, the Path Table is stored on the disc. The Path Table contains information about the location of each directory on the disc, allowing for quick traversal of the directory hierarchy. It consists of an L-Path Table (Little-Endian) and an M-Path Table (Big-Endian) to support different byte orderings used by various systems.
Directories and files are stored in the subsequent blocks of the disc. Each directory is represented by a Directory Record, which contains information such as the directory's name, its parent directory, and the location of its associated files and subdirectories. Files are stored as contiguous sequences of logical blocks, with their location and size specified in the corresponding File Identifier record within the directory.
When creating an ISO image, the file system is first organized according to the ISO 9660 standard's requirements. This includes ensuring that file and directory names comply with the 8.3 format, limiting the directory depth, and converting file names to uppercase. Once the file system is prepared, it is written to an image file with the `.iso` extension, which can then be burned onto an optical disc or used as a virtual disc image.
To read an ISO 9660 formatted disc, the operating system or a dedicated software application starts by examining the Volume Descriptors to determine the disc's structure and characteristics. It then uses the Path Table and Directory Records to navigate the file system hierarchy and locate specific files or directories. When a file is accessed, the system reads the appropriate logical blocks from the disc based on the information provided in the File Identifier record.
The ISO 9660 format has been widely adopted and is still commonly used for distributing software, multimedia content, and archival data on optical discs. Its simplicity, compatibility, and robustness have contributed to its longevity, even as newer optical disc formats and file systems have emerged.
Despite its age, the ISO 9660 standard remains relevant in modern computing. Many software applications and operating systems, including Windows, macOS, and Linux, continue to support the format natively. Additionally, ISO images are frequently used for distributing operating system installation files, software packages, and virtual machine disk images, as they provide a convenient and platform-independent method for storing and transferring data.
In conclusion, the ISO 9660 format has played a crucial role in standardizing the file system structure for optical discs, enabling cross-platform compatibility and facilitating the distribution of digital content. Its extensions, such as Joliet and Rock Ridge, have added support for longer file names, additional metadata, and UNIX-specific attributes. Although optical discs have largely been superseded by other storage media and network-based distribution methods, the ISO 9660 format remains a reliable and widely-supported standard for archiving and exchanging data.
As technology continues to evolve, the ISO 9660 format may eventually be replaced by newer, more advanced file systems designed for high-capacity optical discs or other storage media. However, its impact on the history of computing and its role in establishing a standardized approach to cross-platform data exchange will not be forgotten. The ISO 9660 format serves as a testament to the importance of interoperability and the benefits of industry-wide collaboration in developing and adopting standards.
File compression reduces redundancy so the same information takes fewer bits. The upper bound on how far you can go is governed by information theory: for lossless compression, the limit is the entropy of the source (see Shannon’s source coding theorem and his original 1948 paper “A Mathematical Theory of Communication”). For lossy compression, the trade-off between rate and quality is captured by rate–distortion theory.
Most compressors have two stages. First, a model predicts or exposes structure in the data. Second, a coder turns those predictions into near-optimal bit patterns. A classic modeling family is Lempel–Ziv: LZ77 (1977) and LZ78 (1978) detect repeated substrings and emit references instead of raw bytes. On the coding side, Huffman coding (see the original paper 1952) assigns shorter codes to more likely symbols. Arithmetic coding and range coding are finer-grained alternatives that squeeze closer to the entropy limit, while modern Asymmetric Numeral Systems (ANS) achieves similar compression with fast table-driven implementations.
DEFLATE (used by gzip, zlib, and ZIP) combines LZ77 with Huffman coding. Its specs are public: DEFLATE RFC 1951, zlib wrapper RFC 1950, and gzip file format RFC 1952. Gzip is framed for streaming and explicitly does not attempt to provide random access. PNG images standardize DEFLATE as their only compression method (with a max 32 KiB window), per the PNG spec “Compression method 0… deflate/inflate… at most 32768 bytes” and W3C/ISO PNG 2nd Edition.
Zstandard (zstd): a newer general-purpose compressor designed for high ratios with very fast decompression. The format is documented in RFC 8878 (also HTML mirror) and the reference spec on GitHub. Like gzip, the basic frame doesn’t aim for random access. One of zstd’s superpowers is dictionaries: small samples from your corpus that dramatically improve compression on many tiny or similar files (see python-zstandard dictionary docs and Nigel Tao’s worked example). Implementations accept both “unstructured” and “structured” dictionaries (discussion).
Brotli: optimized for web content (e.g., WOFF2 fonts, HTTP). It mixes a static dictionary with a DEFLATE-like LZ+entropy core. The spec is RFC 7932, which also notes a sliding window of 2WBITS−16 with WBITS in [10, 24] (1 KiB−16 B up to 16 MiB−16 B) and that it does not attempt random access. Brotli often beats gzip on web text while decoding quickly.
ZIP container: ZIP is a file archive that can store entries with various compression methods (deflate, store, zstd, etc.). The de facto standard is PKWARE’s APPNOTE (see APPNOTE portal, a hosted copy, and LC overviews ZIP File Format (PKWARE) / ZIP 6.3.3).
LZ4 targets raw speed with modest ratios. See its project page (“extremely fast compression”) and frame format. It’s ideal for in-memory caches, telemetry, or hot paths where decompression must be near RAM speed.
XZ / LZMA push for density (great ratios) with relatively slow compression. XZ is a container; the heavy lifting is typically LZMA/LZMA2 (LZ77-like modeling + range coding). See .xz file format, the LZMA spec (Pavlov), and Linux kernel notes on XZ Embedded. XZ usually out-compresses gzip and often competes with high-ratio modern codecs, but with slower encode times.
bzip2 applies the Burrows–Wheeler Transform (BWT), move-to-front, RLE, and Huffman coding. It’s typically smaller than gzip but slower; see the official manual and man pages (Linux).
“Window size” matters. DEFLATE references can only look back 32 KiB (RFC 1951 and PNG’s 32 KiB cap noted here). Brotli’s window ranges from about 1 KiB to 16 MiB (RFC 7932). Zstd tunes window and search depth by level (RFC 8878). Basic gzip/zstd/brotli streams are designed for sequential decoding; the base formats don’t promise random access, though containers (e.g., tar indexes, chunked framing, or format-specific indexes) can layer it on.
The formats above are lossless: you can reconstruct exact bytes. Media codecs are often lossy: they discard imperceptible detail to hit lower bitrates. In images, classic JPEG (DCT, quantization, entropy coding) is standardized in ITU-T T.81 / ISO/IEC 10918-1. In audio, MP3 (MPEG-1 Layer III) and AAC (MPEG-2/4) rely on perceptual models and MDCT transforms (see ISO/IEC 11172-3, ISO/IEC 13818-7, and an MDCT overview here). Lossy and lossless can coexist (e.g., PNG for UI assets; Web codecs for images/video/audio).
Theory: Shannon 1948 · Rate–distortion · Coding: Huffman 1952 · Arithmetic coding · Range coding · ANS. Formats: DEFLATE · zlib · gzip · Zstandard · Brotli · LZ4 frame · XZ format. BWT stack: Burrows–Wheeler (1994) · bzip2 manual. Media: JPEG T.81 · MP3 ISO/IEC 11172-3 · AAC ISO/IEC 13818-7 · MDCT.
Bottom line: choose a compressor that matches your data and constraints, measure on real inputs, and don’t forget the gains from dictionaries and smart framing. With the right pairing, you can get smaller files, faster transfers, and snappier apps — without sacrificing correctness or portability.
File compression is a process that reduces the size of a file or files, typically to save storage space or speed up transmission over a network.
File compression works by identifying and removing redundancy in the data. It uses algorithms to encode the original data in a smaller space.
The two primary types of file compression are lossless and lossy compression. Lossless compression allows the original file to be perfectly restored, while lossy compression enables more significant size reduction at the cost of some loss in data quality.
A popular example of a file compression tool is WinZip, which supports multiple compression formats including ZIP and RAR.
With lossless compression, the quality remains unchanged. However, with lossy compression, there can be a noticeable decrease in quality since it eliminates less-important data to reduce file size more significantly.
Yes, file compression is safe in terms of data integrity, especially with lossless compression. However, like any files, compressed files can be targeted by malware or viruses, so it's always important to have reputable security software in place.
Almost all types of files can be compressed, including text files, images, audio, video, and software files. However, the level of compression achievable can significantly vary between file types.
A ZIP file is a type of file format that uses lossless compression to reduce the size of one or more files. Multiple files in a ZIP file are effectively bundled together into a single file, which also makes sharing easier.
Technically, yes, although the additional size reduction might be minimal or even counterproductive. Compressing an already compressed file might sometimes increase its size due to metadata added by the compression algorithm.
To decompress a file, you typically need a decompression or unzipping tool, like WinZip or 7-Zip. These tools can extract the original files from the compressed format.