Word Replacing Transform for eXtended Markup Language
Version 3.2
29.10.2007
(C) 2007 Przemyslaw Skibinski (inikep@gmail.com). Portions:
- zlib (C) 1995-2005 Jean-loup Gailly and Mark Adler
- LZMA (C) 1999-2006 Igor Pavlov
- PPMVC (C) 1997-2006 Dmitry Shkarin and Przemyslaw Skibinski
- lpaq6 (C) 2007 Matt Mahoney and Alexander Ratushnyak
Introduction
XWRT (XML-WRT) is a high-performance XML compressor (actually it works
with all
textual files). It transforms XML to more
compressible form and uses zlib (default), LZMA, PPMVC, or lpaq6 as
back-end compressor. This idea is based on well-known XML compressor - XMill.
Moreover, XWRT creates a semi-dynamic dictionary and replaces frequently
used words with shorter codes. There are additional techniques to improve
compression ratio:
- word alphabet can consist of start tags (like '<tag>'), urls, e-mails
- special model for numbers encoding
- input XML file is split into containers
- there are special containers for dates, time, pages and fractional numbers
- end tags (like '</tag>') are replaced with a single char
- end tags + EOL symbols can also be replaced with a single char
- spaceless words model
- very effective methods for white-space preserving
- quotes modeling ('="' and '">' replaced with a single char)
Comparision to other XML compressors
All files used for comparision can be downloaded from Wratislavia XML Corpus.
Results are given in bpc (bits ber character). Tested with XWRT 3.1:
|
|
|
|
|
|
|
|
|
|
|
|
|
file |
gzip |
XMill 0.9 zip |
XWRT -l2 (gzip) |
LZMA -a1 |
XWRT -l6 (LZMA) |
PPMdJ -o8 -m64 |
XMill 0.9 PPMd |
XMLPPM -l 9 |
SCMPPM -l 9 |
XWRT -l9 (PPM) |
FastPAQ8 74 MB |
XWRT -l11 (FastPAQ8) |
dblp | 1,463 | 1,250 | 0,865 | 0,943 | 0,747 | 0,724 | 0,940 | 0,802 | 0,693 | 0,690 | 0,659 | 0,597 |
enwikibooks | 2,339 | 2,295 | 1,742 | 1,686 | 1,504 | 1,565 | 1,838 | 1,621 | 1,621 | 1,481 | 1,357 | 1,269 |
enwikinews | 2,248 | 2,198 | 1,597 | 1,462 | 1,301 | 1,291 | 1,746 | 1,379 | 1,398 | 1,202 | 1,172 | 1,090 |
lineitem | 0,721 | 0,380 | 0,276 | 0,421 | 0,243 | 0,359 | 0,270 | 0,261 | 0,242 | 0,243 | 0,236 | 0,226 |
Shakespeare | 2,182 | 2,044 | 1,481 | 1,646 | 1,349 | 1,245 | 1,584 | 1,295 | 1,293 | 1,204 | 1,220 | 1,185 |
SwissProt | 0,985 | 0,619 | 0,475 | 0,478 | 0,388 | 0,490 | 0,477 | 0,416 | 0,417 | 0,363 | 0,395 | 0,313 |
uwm | 0,553 | 0,382 | 0,315 | 0,368 | 0,278 | 0,426 | 0,310 | 0,259 | 0,274 | 0,240 | 0,254 | 0,228 |
average | 1,499 | 1,310 | 0,964 | 1,001 | 0,830 | 0,871 | 1,024 | 0,862 | 0,848 | 0,775 | 0,756 | 0,701 |
Download
Please look at
XWRT project download page
XWRT usage
Usage: XWRT.exe [options] [file2] [file3] ...
where is a XML file or a XWRT compressed file (it's auto-detected)
you can also use wildcards (e.g., "*.xml")
GENERAL OPTIONS (which also set default additional options):
-i = Delete input files
-l0 = no compression (memory usage up to 16 MB)
-l1 = zlib fast (memory usage 16+1 MB)
-l2 = zlib normal (default, memory usage 16+1 MB)
-l3 = zlib best (memory usage 16+1 MB)
-l4 = LZMA dict size 64 KB (memory usage 16+9 MB for compression and 16+3 MB for decompression)
-l5 = LZMA dict size 1 MB (memory usage 16+18 MB for compression and 16+3 MB for decompression)
-l6 = LZMA dict size 8 MB (memory usage 16+84 MB for compression and 16+10 MB for decompression)
-l7 = PPMVC model size 16 MB (memory usage 16+20 MB)
-l8 = PPMVC model size 32 MB (memory usage 16+36 MB)
-l9 = PPMVC model size 64 MB (memory usage 16+70 MB)
-l10 = lpaq6 model size 120 MB (memory usage 16+104 MB)
-l11 = lpaq6 model size 214 MB (memory usage 16+198 MB)
-l12 = lpaq6 model size 406 MB (memory usage 16+390 MB)
-l13 = lpaq6 model size 790 MB (memory usage 16+774 MB)
-l14 = lpaq6 model size 1560 MB (memory usage 16+1542 MB)
-o = Force overwrite of output files
ADDITIONAL OPTIONS:
-bX = Set maximum buffer size while creating dynamic dictionary to X MB
-c = Turn off containers (without number and word containers)
+d = Turn on usage of the static dictionary (requires wrt-eng.dic,
which is available at http://www.ii.uni.wroc.pl/~inikep/research)
-eX = Set maximum dictionary size to X words
-fX = Set minimal word frequency to X
-mX = Set maximum memory buffer size to X MB (default=8)
-n = Turn off number containers
-pX = Preprocess only (file_size/X) bytes in a first pass
-s = Turn off spaces modeling option
-t = Turn off "try shorter word" option
-w = Turn off word containers
History
XWRT 3.2 (25.10.2007)
-FastPAQ8 replaced with lpaq6 (compression level 10-14)
XWRT 3.1 (05.06.2007)
-improved support for XML files encoded in UTF-8
-dictionary is compressed using front compression
-added little-endian/big-endian Unicode (UCS-2) support
-non-textual files are compressed/stored without using a filter
-64-bit compiler support
XML-WRT 3.0 (14.09.2006)
-internal PPMVC and FastPAQ8 compression
XML-WRT 2.0 (14.06.2006)
-internal zlib and LZMA compression
-input XML file is split into containers depend on start-tags and end-tags and content under the same tag is sent to the same container
-container for dates in format 1980-02-31 and 01-MAR-1920
-container for times in format 11:30pm
-container for numbers from 1900 to 2155 (years)
-container for pages in format "x-y", where y-x<256, eg. "120-148", "1480-1600"
-container for numbers in format "x-y", eg. "1234-0", "87-623"
-container for two digits after period, eg. "102.00", "12.01"
-container for numbers from 0.0 to 24.9 (one digit after period), eg. "12.0", "9.9"
-urls (statring from "http:"), e-mails (x@y.z), "ü" added to dynamic dictionary
XML-WRT 1.0 (27.03.2006)
-first public release
How to compile
- LINUX 32-bit: static executable is included with sources, to compile by yourself you need g++ (be aware that Linux version doesn't support LZMA and PPMVC back-end compression (yet))
- LINUX 64-bit: uncomment "PROC=-m64" in Makefile
Licence
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation at
http://www.gnu.org/licenses/gpl.txt or (at your option) any later version.
This program is distributed without any warranty.