Since edition 2012-04-27 of this compilation, a new custom written Java program is used to parse the English Wiktionary dump with full article texts instead of the old one-line format definitions. This will make it possible to include more data from Wiktionary in the future.
Here are archives containing the Java source codes, precompiled class files, Windows and Linux shell scripts and such needed for building the dictionary (links fixed 20181104).
Here's a list of the steps currently taken in order to build the dictionary.
Tools required: Windows/Linux, JRE 1.7 (1.5+ should work), Linux sort (running in a real or virtual Linux image), kindlegen, google-api-client-1.4.1-beta.jar
For editing source code you can use e.g. Eclipse Indigo
The sorting order was somewhat fixed and changed in edition 2011-08-28 so that UTF-8 sorting with the English locale is used. Special chars, namely mostly Greek chars, are now sorted last (after Z).
GnuWin32 seems not to honor the LC* environment variables, so the Linux sort command was used instead, with the settings LC_ALL= and LC_COLLATE=en_US.UTF-8.
The sort is performed without the --dictionary-order parameter. This is because the aforementioned parameter doesn't handle various UTF-8 chars correctly; e.g. it sorts Œdipus at D instead of OE and Šiauliai at i instead of Sh. The drawback is, that without the parameter, “À l'ordinaire” is sorted after A, whereas with the parameter at L, which perhaps is not desirable in a dictionary. Also, Greek letters are sorted after Z, instead of based on the translitteration.
One further quirk fixed in edition 2011-08-28 is the use of the UTF-8 BOM char. It is now written to the beginning of files when generating the final XML files (Def_a.xml etc.). Earlier, each intermediary step wrote the BOM as well. Due to the method of joining the intermediary files, the BOM chars got copied as well, and would end up at the end of each letter file or the the new single-sort-file in the according step.
The following is a detailed example of the sort logic.
a
A
Æquo animo - this begins with capitalized AE ligature
Affaire d'amour
À l'ordinaire
b
B
c
C
m
M
mg/kg
o
O
Œdipus - begins with capitalized OE ligature
œillade - begins with lowercase oe ligature
Oersted
oestrian
s
S
Shickley village
Šiauliai - cf. Sh
Signal Hill city
To prepon
v
V
vaccination
Vaccination
vaccinationlike
vaccination marks
vaccinations
Vaccinator
vagally
Vagancy
Vagantes
vagaries
w
W
wad
Wad
Wada test
z
Z
zinc
Zinc
μg/kg - begins with the Greek letter lowercase mu, cf. English m
σ-algebra - begins with the Greek letter lowercase sigma, cf. English s
Τό πρέπόν - this begins with the Greek letter capitalized Tau, not English T
Last updated 4th of November, 2018.