XWRT

Word Replacing Transform for eXtended Markup Language


Version 3.2

29.10.2007

(C) 2007 Przemyslaw Skibinski (inikep@gmail.com). Portions:
[Introduction] [Comparision to other XML compressors] [Download] [XWRT usage] [History] [How to compile] [Licence]


Introduction

XWRT (XML-WRT) is a high-performance XML compressor (actually it works with all textual files). It transforms XML to more compressible form and uses zlib (default), LZMA, PPMVC, or lpaq6 as back-end compressor. This idea is based on well-known XML compressor - XMill. Moreover, XWRT creates a semi-dynamic dictionary and replaces frequently used words with shorter codes. There are additional techniques to improve compression ratio:


Comparision to other XML compressors

All files used for comparision can be downloaded from Wratislavia XML Corpus. Results are given in bpc (bits ber character). Tested with XWRT 3.1:

file gzip XMill 0.9 zip XWRT -l2 (gzip) LZMA -a1 XWRT -l6 (LZMA) PPMdJ -o8 -m64 XMill 0.9 PPMd XMLPPM -l 9 SCMPPM -l 9 XWRT -l9 (PPM) FastPAQ8 74 MB XWRT -l11 (FastPAQ8)
dblp1,4631,2500,8650,9430,7470,7240,9400,8020,6930,6900,6590,597
enwikibooks2,3392,2951,7421,6861,5041,5651,8381,6211,6211,4811,3571,269
enwikinews2,2482,1981,5971,4621,3011,2911,7461,3791,3981,2021,1721,090
lineitem0,7210,3800,2760,4210,2430,3590,2700,2610,2420,2430,2360,226
Shakespeare2,1822,0441,4811,6461,3491,2451,5841,2951,2931,2041,2201,185
SwissProt0,9850,6190,4750,4780,3880,4900,4770,4160,4170,3630,3950,313
uwm0,5530,3820,3150,3680,2780,4260,3100,2590,2740,2400,2540,228
average1,4991,3100,9641,0010,8300,8711,0240,8620,8480,7750,7560,701

Download

Please look at XWRT project download page


XWRT usage

Usage: XWRT.exe [options]  [file2] [file3] ...
 where  is a XML file or a XWRT compressed file (it's auto-detected)
             you can also use wildcards (e.g., "*.xml")
 GENERAL OPTIONS (which also set default additional options):
  -i = Delete input files
  -l0 = no compression (memory usage up to 16 MB)
  -l1 = zlib fast (memory usage 16+1 MB)
  -l2 = zlib normal (default, memory usage 16+1 MB)
  -l3 = zlib best (memory usage 16+1 MB)
  -l4 = LZMA dict size 64 KB (memory usage 16+9 MB for compression and 16+3 MB for decompression)
  -l5 = LZMA dict size 1 MB (memory usage 16+18 MB for compression and 16+3 MB for decompression)
  -l6 = LZMA dict size 8 MB (memory usage 16+84 MB for compression and 16+10 MB for decompression)
  -l7 = PPMVC model size 16 MB (memory usage 16+20 MB)
  -l8 = PPMVC model size 32 MB (memory usage 16+36 MB)
  -l9 = PPMVC model size 64 MB (memory usage 16+70 MB)
  -l10 = lpaq6 model size 120 MB (memory usage 16+104 MB)
  -l11 = lpaq6 model size 214 MB (memory usage 16+198 MB)
  -l12 = lpaq6 model size 406 MB (memory usage 16+390 MB)
  -l13 = lpaq6 model size 790 MB (memory usage 16+774 MB)
  -l14 = lpaq6 model size 1560 MB (memory usage 16+1542 MB)
  -o = Force overwrite of output files
 ADDITIONAL OPTIONS:
  -bX = Set maximum buffer size while creating dynamic dictionary to X MB
  -c = Turn off containers (without number and word containers)
  +d = Turn on usage of the static dictionary (requires wrt-eng.dic,
       which is available at http://www.ii.uni.wroc.pl/~inikep/research)
  -eX = Set maximum dictionary size to X words
  -fX = Set minimal word frequency to X
  -mX = Set maximum memory buffer size to X MB (default=8)
  -n = Turn off number containers
  -pX = Preprocess only (file_size/X) bytes in a first pass
  -s = Turn off spaces modeling option
  -t = Turn off "try shorter word" option
  -w = Turn off word containers

History

XWRT 3.2 (25.10.2007)
-FastPAQ8 replaced with lpaq6 (compression level 10-14)

XWRT 3.1 (05.06.2007)
-improved support for XML files encoded in UTF-8
-dictionary is compressed using front compression
-added little-endian/big-endian Unicode (UCS-2) support
-non-textual files are compressed/stored without using a filter
-64-bit compiler support

XML-WRT 3.0 (14.09.2006)
-internal PPMVC and FastPAQ8 compression

XML-WRT 2.0 (14.06.2006)
-internal zlib and LZMA compression
-input XML file is split into containers depend on start-tags and end-tags and content under the same tag is sent to the same container
-container for dates in format 1980-02-31 and 01-MAR-1920
-container for times in format 11:30pm
-container for numbers from 1900 to 2155 (years)
-container for pages in format "x-y", where y-x<256, eg. "120-148", "1480-1600"
-container for numbers in format "x-y", eg. "1234-0", "87-623"
-container for two digits after period, eg. "102.00", "12.01"
-container for numbers from 0.0 to 24.9 (one digit after period), eg. "12.0", "9.9"
-urls (statring from "http:"), e-mails (x@y.z), "ü" added to dynamic dictionary

XML-WRT 1.0 (27.03.2006)
-first public release

How to compile


Licence

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation at http://www.gnu.org/licenses/gpl.txt or (at your option) any later version. This program is distributed without any warranty.  

SourceForge.net Logo