How does file compression work?

Sharky. · Jun 9, 2009

Simple question, no doubt a very not-simple answer :lol:

How does compression of a file work, and why are some compressed formats wildly better than others? 7-Zip compressing to its .7z format is stupidly better than Windows compressing to a .zip archive, for a start.

An example provided by 7-Zip itself is the compression of a post-install GIMP 1.2.4 - 127 subfolders, 1304 files, ~27MB total. Going to .7z produces the smallest archive (~5.5MB), WinRAR 3.10's archive is 10% larger, 7-Zip in .zip mode is 74% larger :scared:

Also, which types of files benefit the most from being compressed? 7-Zip on ultra only managed a 94% ratio when compressing 16 mp3* files (total size 77MB) =/

(*mp3 v wma [lossless mode]... which is *best*? But that's for another time :lol:

)

Also, you probably figured I like 7-Zip, haha

TB · Jun 9, 2009

Files like jpg's and mp3's are already compressed so you can't compress the compressed much. On the other hand and iso that I downloaded (legally, don't worry) unzipped to 450 megs, but compressed down to 12.

As for why one works better than another, the same can be said for jpg, png, gif, bmp, etc. (using image files as an example), for which I have no answer. :lol:

I'd imagine it's all in the coding and how efficient it is.

Sharky. · Jun 9, 2009

TB
Files like jpg's and mp3's are already compressed so you can't compress the compressed much. On the other hand and iso that I downloaded (legally, don't worry) unzipped to 450 megs, but compressed down to 12.

Ah, right. I get what you mean about .iso etc - I just 7z'd an iso of GT2 GT mode (don't worry also, it's my own copy) from 677MB to 425MB. I expected better compression, mind. :lol:

Jordan · Jun 10, 2009

It works by recognizing patterns in the binary composition of a file, and how often those patterns repeat. For example, if a file contains 20 copies of the string 011010, the archiver saves only one copy of the string and replaces the rest with some unique identifying marker. Decompression, then, simply involves replacing each marker with its associated string. Compression standards and programs will vary in effectiveness due to their pattern-matching algorithm's ability to recognize longer strings.

Sharky. · Jun 10, 2009

That was very enlightening, thanks Jordan! I was expecting the answer to be a lot more complex than that 👍

TB · Jun 10, 2009

Jordan's explanation explains why my iso was so small - it was essentially 32 copies of the same file. Thanks from me, too, Jordan! 👍

wfooshee · Jun 10, 2009

And then there's lossless vs. lossy. Data files must have lossless compression. You can't change the data, it has to be the same when reconstructed. ZIPs and RARs are like that. Music, video, and images, though, use lossy compression, disposing of redundant information rather than keeping every last bit describing every last pixel. An image is compressed by saying something like, "all of these pixels are like that one," instead of keeping exact color levels for every pixel. In motion compression like mpg, it might say, "this area of the screen has not changed," so it doesn't actually have to store that data, it just instructs the decoder to keep what it already knows. That's why TIFFs (uncompressed) are so much larger than JPGs. Uncompressed video is even worse, running into gigabytes for five minutes or so.

niky · Jun 10, 2009

And that's why it's a good idea to have a RAW copy of a file you're photoshopping... after a few iterations and file-saves, it can start to look pretty lousy...

Sharky. · Jun 10, 2009

Being the crazy guy that I am, I decided to see just how extreme compression can get. :lol:

Attached 1KB zip contains a 149KB 7z archive, which contains... a 1GB text file. :scared:

crooky369 · Jun 10, 2009

Sharky.
Being the crazy guy that I am, I decided to see just how extreme compression can get.

Attached 1KB zip contains a 149KB 7z archive, which contains... a 1GB text file.

He speaks the truth! :crazy:

(waits for meaning of life.txt to open)

EDIT:... Guess you have to have a good computer to open that file 👎

Sureboss · Jun 10, 2009

Jordan
It works by recognizing patterns in the binary composition of a file, and how often those patterns repeat. For example, if a file contains 20 copies of the string 011010, the archiver saves only one copy of the string and replaces the rest with some unique identifying marker. Decompression, then, simply involves replacing each marker with its associated string. Compression standards and programs will vary in effectiveness due to their pattern-matching algorithm's ability to recognize longer strings.

Which is why anything by Microsoft probably fails to recognise a string anything more complex than 010. Most of their programmes can't cope with that kind of stuff

So many times I get a zipped file by WinRAR and it just screws up the file. 7-zip is the king.

Sharky. · Jun 10, 2009

crooky369
EDIT:... Guess you have to have a good computer to open that file 👎

LOL I wouldn't try to open it, it'll kill whatever program you open it with

Took a good 5-10 minutes to create the file (by way of a VB.NET app I wrote in five minutes), even on this fairly quick machine :lol:

crooky369 · Jun 10, 2009

Sharky.
LOL I wouldn't try to open it, it'll kill whatever program you open it with

Took a good 5-10 minutes to create the file (by way of a VB.NET app I wrote in five minutes), even on this fairly quick machine

Ahh didn't think my msi wind had the "muscle" to open it

Jordan · Jun 12, 2009

Sureboss
Which is why anything by Microsoft probably fails to recognise a string anything more complex than 010. Most of their programmes can't cope with that kind of stuff

Matching longer strings requires considerably more processing power. While those of us with faster machines may not care to wait a few more seconds for a smaller zipped file, it will take slower computers much longer. Microsoft's engineers have to consider the wide variety of machines which will be running their software, forcing them to find a reasonable compromise that will be acceptable for the majority of users.

Sureboss · Jun 12, 2009

All right, Jordan. Don't stop me from bashing MS. Please.

Lotrzyna · Jun 16, 2009

Jordan
It works by recognizing patterns in the binary composition of a file, and how often those patterns repeat. For example, if a file contains 20 copies of the string 011010, the archiver saves only one copy of the string and replaces the rest with some unique identifying marker. Decompression, then, simply involves replacing each marker with its associated string. Compression standards and programs will vary in effectiveness due to their pattern-matching algorithm's ability to recognize longer strings.

Is it then possible to code markers recognizing patterns of markers? :dopey:

GilesGuthrie · Jun 16, 2009

villainpl
Is it then possible to code markers recognizing patterns of markers?

Yes, which is where the differences in compression algorithms come in. Check out this Wikipedia article and its links for some good technical detail.

With lossy compression, the encoder is much more complex, and usually runs through the source data multiple times, stripping out a lot of the unique (and thus uncompressible through pattern marking) data. Back to Wikipedia for this description of the JPG compression algorithm. When you read through this you'll understand why it gets such good compression rates, and why JPG files cannot be further compressed.

Sorry to be linking to Wikipedia, but these sorts of articles are usually well-enough peer-reviewed and free from editorial opinion to be factually correct.

How does file compression work?

Sharky.

MX-5 gang

TB

Space Lord

Sharky.

MX-5 gang

Jordan

Site Founder

Sharky.

MX-5 gang

TB

Space Lord

wfooshee

Rather ride my FJR

niky

Karma Chameleon

Sharky.

MX-5 gang

Attachments

crooky369

Sureboss

Tanned and Lipstick'd

Sharky.

MX-5 gang

crooky369

Jordan

Site Founder

Sureboss

Tanned and Lipstick'd

Lotrzyna

Kryptonite Member

GilesGuthrie

Latest Posts