Small strings compression library

Related tags

Compression smaz
Overview
SMAZ - compression for very small strings
-----------------------------------------

Smaz is a simple compression library suitable for compressing very short
strings. General purpose compression libraries will build the state needed
for compressing data dynamically, in order to be able to compress every kind
of data. This is a very good idea, but not for a specific problem: compressing
small strings will not work.

Smaz instead is not good for compressing general purpose data, but can compress
text by 40-50% in the average case (works better with English), and is able to
perform a bit of compression for HTML and urls as well. The important point is
that Smaz is able to compress even strings of two or three bytes!

For example the string "the" is compressed into a single byte.

To compare this with other libraries, think that like zlib will usually not be able to compress text shorter than 100 bytes.

COMPRESSION EXAMPLES
--------------------

'This is a small string' compressed by 50%
'foobar' compressed by 34%
'the end' compressed by 58%
'not-a-g00d-Exampl333' enlarged by 15%
'Smaz is a simple compression library' compressed by 39%
'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49%
'this is an example of what works very well with smaz' compressed by 49%
'1000 numbers 2000 will 10 20 30 compress very little' compressed by 10%

In general, lowercase English will work very well. It will suck with a lot
of numbers inside the strings. Other languages are compressed pretty well too,
the following is Italian, not very similar to English but still compressible
by smaz:

'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura' compressed by 33%
'Mi illumino di immenso' compressed by 37%
'L'autore di questa libreria vive in Sicilia' compressed by 28%

It can compress URLS pretty well:

'http://google.com' compressed by 59%
'http://programming.reddit.com' compressed by 52%
'http://github.com/antirez/smaz/tree/master' compressed by 46%

USAGE
-----

The lib consists of just two functions:

    int smaz_compress(char *in, int inlen, char *out, int outlen);

Compress the buffer 'in' of length 'inlen' and put the compressed data into
'out' of max length 'outlen' bytes. If the output buffer is too short to hold
the whole compressed string, outlen+1 is returned. Otherwise the length of the
compressed string (less then or equal to outlen) is returned.

    int smaz_decompress(char *in, int inlen, char *out, int outlen);

Decompress the buffer 'in' of length 'inlen' and put the decompressed data into
'out' of max length 'outlen' bytes. If the output buffer is too short to hold
the whole decompressed string, outlen+1 is returned. Otherwise the length of the
compressed string (less then or equal to outlen) is returned. This function will
not automatically put a nul-term at the end of the string if the original
compressed string didn't included a nulterm.


CREDITS
-------

Small was writte by Salvatore Sanfilippo and is released under the BSD license. Check the COPYING file for more information.
Comments
  • GitHub repository description claims this is an “encryption library”

    GitHub repository description claims this is an “encryption library”

    The GitHub repository description is…

    Small strings encryption library

    …when, going by the README, smaz is a compression library, not an encryption library.

    opened by 8573 1
  • Aiding Smaz in further compressing repeating characters

    Aiding Smaz in further compressing repeating characters

    Ciao Salvatore,

    I'm crossposting this here as I think it's better suited because you're the creator of this project.

    Smaz is wonderful as it's able to compress a short string (< 100 bytes) where other compressing tools fail.. But there is a problem with it, particularly repeating characters that it doesn't optimize by itself.

    For example the string "this is a short string" compresses fine

    \x9b8\xac>\xbb\xf2>\xc3F
    

    It is 9 bytes long. But if you have a short string with repeating characters you have a problem.. for example the string "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's" compresses into this

    \x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe'\n
    

    It is still smaller, but the many "\x04"'s look like a waste of space..

    I've been thinking about calculating a letter occurrence and replacing it with a sort of "bookmark".. for example "aaaaaaaaaa" with ten "a" occurrences becomes "a//10".

    This is a test Python snippet I've created out of my head, but is very very ugly as of now

    a = set("this is a string with many aaaaaaaaaaaaaaaaaaaaaa's")
    b = "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
    
    for i in a:
        if i+i in b: # if char occ. > 2
            o = b.count(i) - 2 
            s = i*o
            c = b.replace(s, i+'//'+str(o))
    
    print c
    

    It then becomes

    this is a string with many a//22's 
    

    Smaz compressed

    \x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\xc5\xc5\xff\x0222'\n
    

    My worry is, what if the string contains an url? Is it safe to escape it like "//".. but then you have regex strings.. How can it be escaped in that case?

    Finally my clear and concise question is: How do you safely shorten repeating characters that Smaz doesn't compress by itself?

    opened by ghost 0
  • "warning: implicit declaration of function ‘random’" message when compiling

    When I compile, I get the following warning:

    $ make clean ; make
    rm -rf smaz_test
    gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
    smaz_test.c: In function ‘main’:
    smaz_test.c:58:18: warning: implicit declaration of function ‘random’ [-Wimplicit-function-declaration]
             ranlen = random() % 512;
    

    The program compiles and the tests pass, so it looks like this warning is non-critical.

    From looking around, I found an SO question/answer that offers a solution.

    Adding the -D_XOPEN_SOURCE=600 option to gcc in the Makefile fixes the issue for me:

    $ git diff
    diff --git a/Makefile b/Makefile
    index 62e8ccb..eecbac7 100644
    --- a/Makefile
    +++ b/Makefile
    @@ -1,7 +1,7 @@
     all: smaz_test
     
     smaz_test: smaz_test.c smaz.c
    -       gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
    +       gcc -D_XOPEN_SOURCE=600 -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c
     
     clean:
            rm -rf smaz_test
    

    My system information:

    $ gcc --version
    gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
    Copyright (C) 2015 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    
    opened by abetusk 0
  • codebook with the most frequent ngrams in language/s

    codebook with the most frequent ngrams in language/s

    I know this guy..;) (from Redis) did you hand pick the codebook dictionary? how? have you though about using the most frequent ngrams in language/s? e.g the top (e.g 32) ngrams from Norvig's ngrams2,3,4,5,6,7,8,9.csv? How do you optimally pick them for minimum overlap and better compression rates? i.e ation and tion are the most common 4 and 5 letters long ngrams respectively, tio is the 6th most common 3 letters ngram. I think you'd get much better/higher compression rates.

    I wanna test it, but couldn't find any docs. so what are these characters?

    static char *Smaz_cb[241] = {
    "\002s,\266", "\003had\232\002leW", "\003on \216", "", "\001yS",
    "\002ma\255\002li\227", "\003or \260", "", "\002ll\230\003s t\277",
    
    opened by wis 0
  • Corta Texto al comprimir

    Corta Texto al comprimir

    Estimado, al comprimir con el algoritmo un tweet, la compresión se corta en algunos puntos. Por ejemplo la combinación de " M" la detecta como fin de la cadena y no comprime mas. Si se cambia la "M" por "m", el algoritmo sigue funcionando. Esto también sucede con la secuencia de símbolos " C".

    string de tweet: "@ChidubemLatest More to the point. Revenge porn is illegal in California, (senate bill 255). By posting the indecen… https://t.co/LTtwbW75Po"

    Saludos

    opened by jeremyisai93 0
Owner
Salvatore Sanfilippo
Computer programmer based in Sicily, Italy. I mostly write OSS software. Born 1977. Not a puritan.
Salvatore Sanfilippo
LZFSE compression library and command line tool

LZFSE This is a reference C implementation of the LZFSE compressor introduced in the Compression library with OS X 10.11 and iOS 9. LZFSE is a Lempel-

null 1.7k Nov 25, 2022
A massively spiffy yet delicately unobtrusive compression library.

ZLIB DATA COMPRESSION LIBRARY zlib 1.2.11 is a general purpose data compression library. All the code is thread safe. The data format used by the z

Mark Adler 4k Nov 28, 2022
A simple C library implementing the compression algorithm for isosceles triangles.

orvaenting Summary A simple C library implementing the compression algorithm for isosceles triangles. License This project's license is GPL 2 (as of J

Kevin Matthes 0 Apr 1, 2022
Advanced DXTc texture compression and transcoding library

crunch/crnlib v1.04 - Advanced DXTn texture compression library Public Domain - Please see license.txt. Portions of this software make use of public d

null 770 Nov 26, 2022
Brotli compression format

SECURITY NOTE Please consider updating brotli to version 1.0.9 (latest). Version 1.0.9 contains a fix to "integer overflow" problem. This happens when

Google 11.7k Nov 27, 2022
Extremely Fast Compression algorithm

LZ4 - Extremely fast compression LZ4 is lossless compression algorithm, providing compression speed > 500 MB/s per core, scalable with multi-cores CPU

lz4 7.8k Nov 29, 2022
Zstandard - Fast real-time compression algorithm

Zstandard, or zstd as short version, is a fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better comp

Facebook 18.7k Dec 1, 2022
Lossless data compression codec with LZMA-like ratios but 1.5x-8x faster decompression speed, C/C++

LZHAM - Lossless Data Compression Codec Public Domain (see LICENSE) LZHAM is a lossless data compression codec written in C/C++ (specifically C++03),

Rich Geldreich 639 Nov 27, 2022
A bespoke sample compression codec for 64k intros

pulsejet A bespoke sample compression codec for 64K intros codec pulsejet lifts a lot of ideas from Opus, and more specifically, its CELT layer, which

logicoma 34 Jul 25, 2022
A variation CredBandit that uses compression to reduce the size of the data that must be trasnmitted.

compressedCredBandit compressedCredBandit is a modified version of anthemtotheego's proof of concept Beacon Object File (BOF). This version does all t

Conor Richard 18 Sep 22, 2022
Data compression utility for minimalist demoscene programs.

bzpack Bzpack is a data compression utility which targets retrocomputing and demoscene enthusiasts. Given the artificially imposed size limits on prog

Milos Bazelides 20 Jul 27, 2022
gzip (GNU zip) is a compression utility designed to be a replacement for 'compress'

gzip (GNU zip) is a compression utility designed to be a replacement for 'compress'

ACM at UCLA 8 Nov 6, 2022
Better lossless compression than PNG with a simpler algorithm

Zpng Small experimental lossless photographic image compression library with a C API and command-line interface. It's much faster than PNG and compres

Chris Taylor 207 Nov 24, 2022
A C++ static library offering a clean and simple interface to the 7-zip DLLs.

bit7z A C++ static library offering a clean and simple interface to the 7-zip DLLs Supported Features • Getting Started • Download • Requirements • Bu

Riccardo 321 Nov 27, 2022
miniz: Single C source file zlib-replacement library, originally from code.google.com/p/miniz

Miniz Miniz is a lossless, high performance data compression library in a single source file that implements the zlib (RFC 1950) and Deflate (RFC 1951

Rich Geldreich 1.6k Nov 23, 2022
Fork of the popular zip manipulation library found in the zlib distribution.

minizip-ng 3.0.0 minizip-ng is a zip manipulation library written in C that is supported on Windows, macOS, and Linux. Developed and maintained by Nat

zlib-ng 961 Nov 16, 2022
Fork of the popular zip manipulation library found in the zlib distribution.

minizip-ng 3.0.1 minizip-ng is a zip manipulation library written in C that is supported on Windows, macOS, and Linux. Developed and maintained by Nat

zlib-ng 962 Nov 25, 2022
PhysFS++ is a C++ wrapper for the PhysicsFS library.

PhysFS++ PhysFS++ is a C++ wrapper for the excellent PhysicsFS library by Ryan C. Gordon and others. It is licensed under the zlib license - same as P

Kevin Howell 80 Oct 25, 2022
An embedded-friendly library for decompressing files from zip archives

An 'embedded-friendly' (aka Arduino) library to extract and decompress files from ZIP archives

Larry Bank 31 Nov 21, 2022