How does file compression work?

  • Thread starter Sharky.
  • 16 comments
  • 3,908 views

Sharky.

MX-5 gang
Premium
11,969
New Zealand
Christchurch
Simple question, no doubt a very not-simple answer :lol:

How does compression of a file work, and why are some compressed formats wildly better than others? 7-Zip compressing to its .7z format is stupidly better than Windows compressing to a .zip archive, for a start.

An example provided by 7-Zip itself is the compression of a post-install GIMP 1.2.4 - 127 subfolders, 1304 files, ~27MB total. Going to .7z produces the smallest archive (~5.5MB), WinRAR 3.10's archive is 10% larger, 7-Zip in .zip mode is 74% larger :scared:

Also, which types of files benefit the most from being compressed? 7-Zip on ultra only managed a 94% ratio when compressing 16 mp3* files (total size 77MB) =/




(*mp3 v wma [lossless mode]... which is *best*? But that's for another time :lol: )

Also, you probably figured I like 7-Zip, haha
 
Files like jpg's and mp3's are already compressed so you can't compress the compressed much. On the other hand and iso that I downloaded (legally, don't worry) unzipped to 450 megs, but compressed down to 12.

As for why one works better than another, the same can be said for jpg, png, gif, bmp, etc. (using image files as an example), for which I have no answer. :lol:

I'd imagine it's all in the coding and how efficient it is.
 
TB
Files like jpg's and mp3's are already compressed so you can't compress the compressed much. On the other hand and iso that I downloaded (legally, don't worry) unzipped to 450 megs, but compressed down to 12.
Ah, right. I get what you mean about .iso etc - I just 7z'd an iso of GT2 GT mode (don't worry also, it's my own copy) from 677MB to 425MB. I expected better compression, mind. :lol:
 
It works by recognizing patterns in the binary composition of a file, and how often those patterns repeat. For example, if a file contains 20 copies of the string 011010, the archiver saves only one copy of the string and replaces the rest with some unique identifying marker. Decompression, then, simply involves replacing each marker with its associated string. Compression standards and programs will vary in effectiveness due to their pattern-matching algorithm's ability to recognize longer strings.
 
That was very enlightening, thanks Jordan! I was expecting the answer to be a lot more complex than that 👍
 
Jordan's explanation explains why my iso was so small - it was essentially 32 copies of the same file. Thanks from me, too, Jordan! 👍
 
And then there's lossless vs. lossy. Data files must have lossless compression. You can't change the data, it has to be the same when reconstructed. ZIPs and RARs are like that. Music, video, and images, though, use lossy compression, disposing of redundant information rather than keeping every last bit describing every last pixel. An image is compressed by saying something like, "all of these pixels are like that one," instead of keeping exact color levels for every pixel. In motion compression like mpg, it might say, "this area of the screen has not changed," so it doesn't actually have to store that data, it just instructs the decoder to keep what it already knows. That's why TIFFs (uncompressed) are so much larger than JPGs. Uncompressed video is even worse, running into gigabytes for five minutes or so.
 
And that's why it's a good idea to have a RAW copy of a file you're photoshopping... after a few iterations and file-saves, it can start to look pretty lousy...
 
Being the crazy guy that I am, I decided to see just how extreme compression can get. :lol:

Attached 1KB zip contains a 149KB 7z archive, which contains... a 1GB text file. :scared:
 

Attachments

  • file.zip
    1.1 KB · Views: 41
Being the crazy guy that I am, I decided to see just how extreme compression can get. :lol:

Attached 1KB zip contains a 149KB 7z archive, which contains... a 1GB text file. :scared:

He speaks the truth! :crazy:

(waits for meaning of life.txt to open)

EDIT:... Guess you have to have a good computer to open that file 👎 :)
 
It works by recognizing patterns in the binary composition of a file, and how often those patterns repeat. For example, if a file contains 20 copies of the string 011010, the archiver saves only one copy of the string and replaces the rest with some unique identifying marker. Decompression, then, simply involves replacing each marker with its associated string. Compression standards and programs will vary in effectiveness due to their pattern-matching algorithm's ability to recognize longer strings.

Which is why anything by Microsoft probably fails to recognise a string anything more complex than 010. Most of their programmes can't cope with that kind of stuff :D

So many times I get a zipped file by WinRAR and it just screws up the file. 7-zip is the king.
 
EDIT:... Guess you have to have a good computer to open that file 👎 :)
LOL I wouldn't try to open it, it'll kill whatever program you open it with :P

Took a good 5-10 minutes to create the file (by way of a VB.NET app I wrote in five minutes), even on this fairly quick machine :lol:
 
LOL I wouldn't try to open it, it'll kill whatever program you open it with :P

Took a good 5-10 minutes to create the file (by way of a VB.NET app I wrote in five minutes), even on this fairly quick machine :lol:

Ahh didn't think my msi wind had the "muscle" to open it :(
 
Which is why anything by Microsoft probably fails to recognise a string anything more complex than 010. Most of their programmes can't cope with that kind of stuff :D
Matching longer strings requires considerably more processing power. While those of us with faster machines may not care to wait a few more seconds for a smaller zipped file, it will take slower computers much longer. Microsoft's engineers have to consider the wide variety of machines which will be running their software, forcing them to find a reasonable compromise that will be acceptable for the majority of users.
 
It works by recognizing patterns in the binary composition of a file, and how often those patterns repeat. For example, if a file contains 20 copies of the string 011010, the archiver saves only one copy of the string and replaces the rest with some unique identifying marker. Decompression, then, simply involves replacing each marker with its associated string. Compression standards and programs will vary in effectiveness due to their pattern-matching algorithm's ability to recognize longer strings.

Is it then possible to code markers recognizing patterns of markers?:dopey:
 
Is it then possible to code markers recognizing patterns of markers?:dopey:

Yes, which is where the differences in compression algorithms come in. Check out this Wikipedia article and its links for some good technical detail.

With lossy compression, the encoder is much more complex, and usually runs through the source data multiple times, stripping out a lot of the unique (and thus uncompressible through pattern marking) data. Back to Wikipedia for this description of the JPG compression algorithm. When you read through this you'll understand why it gets such good compression rates, and why JPG files cannot be further compressed.

Sorry to be linking to Wikipedia, but these sorts of articles are usually well-enough peer-reviewed and free from editorial opinion to be factually correct.
 
Back