SMAZ - compression for very small strings ----------------------------------------- Smaz is a simple compression library suitable for compressing very short strings. General purpose compression libraries will build the state needed for compressing data dynamically, in order to be able to compress every kind of data. This is a very good idea, but not for a specific problem: compressing small strings will not work. Smaz instead is not good for compressing general purpose data, but can compress text by 40-50% in the average case (works better with English), and is able to perform a bit of compression for HTML and urls as well. The important point is that Smaz is able to compress even strings of two or three bytes! For example the string "the" is compressed into a single byte. To compare this with other libraries, think that like zlib will usually not be able to compress text shorter than 100 bytes. COMPRESSION EXAMPLES -------------------- 'This is a small string' compressed by 50% 'foobar' compressed by 34% 'the end' compressed by 58% 'not-a-g00d-Exampl333' enlarged by 15% 'Smaz is a simple compression library' compressed by 39% 'Nothing is more difficult, and therefore more precious, than to be able to decide' compressed by 49% 'this is an example of what works very well with smaz' compressed by 49% '1000 numbers 2000 will 10 20 30 compress very little' compressed by 10% In general, lowercase English will work very well. It will suck with a lot of numbers inside the strings. Other languages are compressed pretty well too, the following is Italian, not very similar to English but still compressible by smaz: 'Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura' compressed by 33% 'Mi illumino di immenso' compressed by 37% 'L'autore di questa libreria vive in Sicilia' compressed by 28% It can compress URLS pretty well: 'http://google.com' compressed by 59% 'http://programming.reddit.com' compressed by 52% 'http://github.com/antirez/smaz/tree/master' compressed by 46% USAGE ----- The lib consists of just two functions: int smaz_compress(char *in, int inlen, char *out, int outlen); Compress the buffer 'in' of length 'inlen' and put the compressed data into 'out' of max length 'outlen' bytes. If the output buffer is too short to hold the whole compressed string, outlen+1 is returned. Otherwise the length of the compressed string (less then or equal to outlen) is returned. int smaz_decompress(char *in, int inlen, char *out, int outlen); Decompress the buffer 'in' of length 'inlen' and put the decompressed data into 'out' of max length 'outlen' bytes. If the output buffer is too short to hold the whole decompressed string, outlen+1 is returned. Otherwise the length of the compressed string (less then or equal to outlen) is returned. This function will not automatically put a nul-term at the end of the string if the original compressed string didn't included a nulterm. CREDITS ------- Small was writte by Salvatore Sanfilippo and is released under the BSD license. Check the COPYING file for more information.
Small strings compression library
Overview
Comments
-
GitHub repository description claims this is an “encryption library”
The GitHub repository description is…
Small strings encryption library
…when, going by the
README
, smaz is a compression library, not an encryption library. -
Aiding Smaz in further compressing repeating characters
Ciao Salvatore,
I'm crossposting this here as I think it's better suited because you're the creator of this project.
Smaz is wonderful as it's able to compress a short string (< 100 bytes) where other compressing tools fail.. But there is a problem with it, particularly repeating characters that it doesn't optimize by itself.
For example the string "this is a short string" compresses fine
\x9b8\xac>\xbb\xf2>\xc3F
It is 9 bytes long. But if you have a short string with repeating characters you have a problem.. for example the string "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's" compresses into this
\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe'\n
It is still smaller, but the many "\x04"'s look like a waste of space..
I've been thinking about calculating a letter occurrence and replacing it with a sort of "bookmark".. for example "aaaaaaaaaa" with ten "a" occurrences becomes "a//10".
This is a test Python snippet I've created out of my head, but is very very ugly as of now
a = set("this is a string with many aaaaaaaaaaaaaaaaaaaaaa's") b = "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's" for i in a: if i+i in b: # if char occ. > 2 o = b.count(i) - 2 s = i*o c = b.replace(s, i+'//'+str(o)) print c
It then becomes
this is a string with many a//22's
Smaz compressed
\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\xc5\xc5\xff\x0222'\n
My worry is, what if the string contains an url? Is it safe to escape it like "//".. but then you have regex strings.. How can it be escaped in that case?
Finally my clear and concise question is: How do you safely shorten repeating characters that Smaz doesn't compress by itself?
-
"warning: implicit declaration of function ‘random’" message when compiling
When I compile, I get the following warning:
$ make clean ; make rm -rf smaz_test gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c smaz_test.c: In function ‘main’: smaz_test.c:58:18: warning: implicit declaration of function ‘random’ [-Wimplicit-function-declaration] ranlen = random() % 512;
The program compiles and the tests pass, so it looks like this warning is non-critical.
From looking around, I found an SO question/answer that offers a solution.
Adding the
-D_XOPEN_SOURCE=600
option togcc
in theMakefile
fixes the issue for me:$ git diff diff --git a/Makefile b/Makefile index 62e8ccb..eecbac7 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ all: smaz_test smaz_test: smaz_test.c smaz.c - gcc -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c + gcc -D_XOPEN_SOURCE=600 -o smaz_test -O2 -Wall -W -ansi -pedantic smaz.c smaz_test.c clean: rm -rf smaz_test
My system information:
$ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-
codebook with the most frequent ngrams in language/s
I know this guy..;) (from Redis) did you hand pick the codebook dictionary? how? have you though about using the most frequent ngrams in language/s? e.g the top (e.g 32) ngrams from Norvig's ngrams2,3,4,5,6,7,8,9.csv? How do you optimally pick them for minimum overlap and better compression rates? i.e
ation
andtion
are the most common 4 and 5 letters long ngrams respectively,tio
is the 6th most common 3 letters ngram. I think you'd get much better/higher compression rates.I wanna test it, but couldn't find any docs. so what are these characters?
static char *Smaz_cb[241] = { "\002s,\266", "\003had\232\002leW", "\003on \216", "", "\001yS", "\002ma\255\002li\227", "\003or \260", "", "\002ll\230\003s t\277",
-
Corta Texto al comprimir
Estimado, al comprimir con el algoritmo un tweet, la compresión se corta en algunos puntos. Por ejemplo la combinación de " M" la detecta como fin de la cadena y no comprime mas. Si se cambia la "M" por "m", el algoritmo sigue funcionando. Esto también sucede con la secuencia de símbolos " C".
string de tweet: "@ChidubemLatest More to the point. Revenge porn is illegal in California, (senate bill 255). By posting the indecen… https://t.co/LTtwbW75Po"
Saludos
Owner
Salvatore Sanfilippo
Superfast compression library
DENSITY Superfast compression library DENSITY is a free C99, open-source, BSD licensed compression library. It is focused on high-speed compression, a
data compression library for embedded/real-time systems
heatshrink A data compression/decompression library for embedded/real-time systems. Key Features: Low memory usage (as low as 50 bytes) It is useful f
Compression abstraction library and utilities
Squash - Compresion Abstraction Library
Multi-format archive and compression library
Welcome to libarchive! The libarchive project develops a portable, efficient C library that can read and write streaming archives in a variety of form
Brotli compression format
SECURITY NOTE Please consider updating brotli to version 1.0.9 (latest). Version 1.0.9 contains a fix to "integer overflow" problem. This happens when
Heavily optimized zlib compression algorithm
Optimized version of longest_match for zlib Summary Fast zlib longest_match function. Produces slightly smaller compressed files for significantly fas
Fastest Integer Compression
TurboPFor: Fastest Integer Compression TurboPFor: The new synonym for "integer compression" ?? (2019.11) ALL functions now available for 64 bits ARMv8
A Small C Compiler
8cc C Compiler Note: 8cc is no longer an active project. The successor is chibicc. 8cc is a compiler for the C programming language. It's intended to
Smaller C is a simple and small single-pass C compiler
Smaller C is a simple and small single-pass C compiler, currently supporting most of the C language common between C89/ANSI C and C99 (minus some C89 and plus some C99 features).
A simple C library for compressing lists of integers using binary packing
The SIMDComp library A simple C library for compressing lists of integers using binary packing and SIMD instructions. The assumption is either that yo
A portable, simple zip library written in C
A portable (OSX/Linux/Windows), simple zip library written in C This is done by hacking awesome miniz library and layering functions on top of the min
is a c++20 compile and runtime Struct Reflections header only library.
is a c++20 compile and runtime Struct Reflections header only library. It allows you to iterate over aggregate type's member variables.
Small strings compression library
SMAZ - compression for very small strings ----------------------------------------- Smaz is a simple compression library suitable for compressing ver
Small strings compression library
SMAZ - compression for very small strings ----------------------------------------- Smaz is a simple compression library suitable for compressing ver
Chad Strings - The Chad way to handle strings in C.
chadstr.h Chad Strings - The Chad way to handle strings in C. One str(...) macro to handle them all. Examples Usage: int table = 13; int id = 37; str
Lizard (formerly LZ5) is an efficient compressor with very fast decompression. It achieves compression ratio that is comparable to zip/zlib and zstd/brotli (at low and medium compression levels) at decompression speed of 1000 MB/s and faster.
Lizard - efficient compression with very fast decompression Lizard (formerly LZ5) is a lossless compression algorithm which contains 4 compression met
This is a Header-only c++ string implentation which specializes in dealing with small strings. 🧵
string-impl This is a Header-only c++ string implentation which specializes in dealing with small strings. ?? Usage ⌨ Instantiation A string can be in
Simple Dynamic Strings library for C
Simple Dynamic Strings Notes about version 2: this is an updated version of SDS in an attempt to finally unify Redis, Disque, Hiredis, and the stand a
Experimental managed C-strings library
Stricks Managed C strings library. ?? API Why ? Because handling C strings is tedious and error-prone. Appending while keeping track of length, null-t
A fast Python Common substrings of multiple strings library with C++ implementation
A fast Python Common substrings of multiple strings library with C++ implementation Having a bunch of strings, can I print some substrings which appea