Building Unicode Tools

This file provides instructions for building and running the UnicodeTools, which can be used to:
  • build the Derived Unicode files in the UCD (Unicode Character Database),
  • build the transformed UCA (Unicode Collation Algorithm) files needed by Unicode.
  • run consistency checks on beta releases of the UCD and the UCA.
  • build 4 chart folders on the unicode site.
  • build files for ICU (collation, NFSkippable)

WARNING!!

  • This is NOT production level code, and should never be used in programs.
  • The API is subject to change without notice, and will not be maintained.
  • The source is uncommented, and has many warts; since it is not production code, it has not been worth the time to clean it up.
  • It will probably need some adjustments on Unix or Windows, such as changing the file separator.
  • Currently it uses hard-coded directory names.
  • The contents of multiple versions of the UCD must be copied to a local directory, as described below.
  • It will be useful to look at the history of the files in SVN to see the kinds of rule changes that are made!

Instructions

Set up Eclipse, ICU, and CLDR, according to instructions: http://site.icu-project.org/setup/java/eclipse-setup-for-java-developers and http://cldr.unicode.org/development/eclipse-setup

Get SVN account on unicode.org, and create a /unicodetools/ Java project in your Eclipse workspace.

Eclipse File/Import... General/Existing Projects into Workspace... select the folder with your unicodetools file tree, Finish.

Also create the project Generated. (Having it as projects makes it easier to view the files there.)
  • New... -> Project... -> General/Project
    • Project Name=Generated
    • Uncheck "Use default location" (so that it's not inside your Eclipse workspace)
    • Browse or type a folder path like <svn.unitools>/Generated
      • Create this folder
      • Create a subfolder BIN

Input data files

The input data files for the Unicode Tools are checked into svn since 2012-dec-21:
This is inside the unicodetools file tree, and the Java code has been updated to assume that. Any old Eclipse setup needs its path variables checked.

For details see Input data setup.

Versions

All of the following have "version 5.0.0" in the options you give to Java (either on the  command line, or in the Eclipse 'run' options). If you want a specific version like 3.1.0, then you would write "version 3.1.1". If you want the latest version, you can omit the "version X".

  1. If you are doing this for a new version, do the following:

Example changes for adding properties: http://www.unicode.org/utility/trac/changeset/509
  • Except that that change had a mismatch between the "bpbt" alias in the code and the "bpt" alias in PropertyAliases.txt and thus didn't work.

MakeUnicodeFiles.txt (find in Eclipse via Navigate/Resource or Ctrl+Shift+R)

  Generate: .*
  DeltaVersion: 1
  CopyrightYear: 2010 (or whatever)
   - new value under DerivedAge
Update in UCD_Names.java:
String[] LONG_AGE and 
String[] SHORT_AGE
to add newest version.
Update in org.unicode.text.utility.Settings.java to fix:
  public static final String latestVersion = "5.2.0";
  public static final String lastVersion = "5.1.0"; // last released version
Update in UCD_Types.java
  LIMIT_AGE
  AGE_VERSIONS
Update in org.unicode.text.utility.Utility.java:
  searchPath
If there are new CJK characters, search for "Extension C" in UCD.java and make the changes.
  mapToRepresentative(...) to add the range.
  hasComputableName(...), and get(...) to add the representative (first) character

Also add a new CJK range to UCD_Types.java, near CJK_C_BASE, and add the new range anywhere the others are used.
Also search (case-insensitively) unicodetools for 2A700 (start of Extension C) and add the new range accordingly.
When CJK_LIMIT moves, search for 9FCC and update near there as necessary.

New blocks:

  • Add long & short names to UcdPropertyValues.java enum Block_Values.
  • Add to ShortBlockNames.txt.
    • Maybe only where the short name is different from the long name??

New scripts:

  • Add long & short names to UcdPropertyValues.java enum Script_Values.
  • Add to UCD_Types.java "SCRIPT CODE"
  • Add to UCD_Names.java SCRIPT and LONG_SCRIPT.
  • After first run of UCD Main, take the DerivedAge.txt lines for the new version, copy them into the input Scripts.txt file, and change the new version number to the appropriate script (which can be new or old or Common etc.). Then run UCD Main again and check the generated Scripts.txt.

New enum property values

When you run, it will break if there are new enum property values.

Note: For more information and newer code see the subpages

To fix that:
Go into org.unicode.text.UCD/
    • UCD_Names.java and 
    • UCD_Types.java
(These contain ugly items that should be enums nowadays.)

Find the property (easiest is to search for some other properties in the enum). Add at end in 
UCD_Types. Be sure to update the limit, like
LIMIT_SCRIPT = Mandaic + 1;

Then in UCD_Names, change the corresponding name entry, both the full and abbreviated names. Follow the format of the existing values.

For example:

In UCDNames.java in BIDI_CLASS add "LRI", "RLI", "FSI", "PDI",

In UCDNames.java in LONG_BIDI_CLASS add "LeftToRightIsolate", "RightToLeftIsolate", "FirstStrongIsolate", "PopDirectionalIsolate",

In UCD_Types.java add & adjust

  BIDI_LRI = 20,

  BIDI_RLI = 21,

  BIDI_FSI = 22,

  BIDI_PDI = 23,

  LIMIT_BIDI_CLASS = 24;


Some changes may cause collisions in the UnicodeMaps used for derived properties. You'll find that out with an exception like:
Exception in thread "main" java.lang.IllegalArgumentException: Attempt to reset value for 17B4 when that is disallowed. Old: Control; New: Extend at org.unicode.text.UCD.ToolUnicodePropertySource$28.<init>(ToolUnicodePropertySource.java:578)

New scripts

Add new scripts like other new property values. In addition, make sure there are ISO 15924 script codes, and collect CLDR script metadata. See

http://cldr.unicode.org/development/updating-codes/updating-script-metadata

http://www.unicode.org/iso15924/codechanges.html


Break Rules

If there are new break rules (or changes), see Segmentation-Rules.

Building Files

  1. Setup
    1. In Eclipse, open the Package Explorer (Use Window>Show View if you don't see it)
    2. Open UnicodeTools
      • package org.unicode.text.UCD
        • MakeUnicodeFiles.txt

          This file drives the production of the derived Unicode files. The first three lines contain parameters that you may want to modify at some times:

          Generate: .*script.* // this is a regular expression. Use .* for all files
          DeltaVersion: 10     // This gets appended to the file name. Pick 1+ the highest value in Public
          CopyrightYear: 2010  // Pick the current year
    3. Open in Package Explorer
      • package org.unicode.text.UCD
        • Main.java
    4. Run>Run As...
      1. Choose Java Application
        • it will fail, don't worry; you need to set some parameters.
    5. Run>Run...
      • Select the Arguments tab, and fill in the following
        • Program arguments:
          build MakeUnicodeFiles
          • For a specific version, prepend "version 6.3.0 " or similar.
        • VM arguments:
          -DSVN_WORKSPACE=/home/mscherer/svn.unitools/trunk
          -DOTHER_WORKSPACE=/home/mscherer/svn.unitools
          -DUCD_DIR=/home/mscherer/svn.unitools/trunk/data
          -DCLDR_DIR=/home/mscherer/svn.cldr/trunk
      • Close and Save
  2. Run
    1. You'll see it build the 5.0 files, with something like the following results:
      Writing UCD_Data
      Data Size: 109,802
      Wrote Data 109802
      For each version, the tools build a set of binary data in BIN that contain the information for that release. This is done automatically, or you can manually do it with the Program Arguments
    2. As options, use: version 5.0.0 build

      This builds an compressed format of all the UCD data (except blocks and Unihan) into the BIN directory. Don't worry about the voluminous console messages, unless one says "FAIL".

      You have to manually do this if you change any of the data files in that version! This ought to have build files, but I haven't worked around to it.

      Note: if for any reason you modify the binary format of the BIN files, you also have to bump the value in that file:

      static final byte BINARY_FORMAT = 8; // bumped if binary format of UCD changes
  3. Results in Generated
    1. The files will be in this directory.
    2. There are also DIFF folders, that contain BAT files that you can run on Windows with CompareIt. (You can modify the code to build BATs with another Diff program if you want).
      1. For any file with a significant difference, it will build two BAT files, such as the first two below.
        Diff_PropList-5.0.0d10.txt.bat
        OLDER-Diff_PropList-5.0.0d10.txt.bat
        
        UNCHANGED-Diff_PropertyValueAliases-5.0.0d10.txt.bat
    3. Any files without significant changes will have "UNCHANGED" as a prefix: ignore them.  The OLDER prefix is the comparison to the last version of Unicode.
    4. On Windows you can run these BATs to compare files: TODO??
  4. Upload for Ken & edcom
    1. Check diffs for problems
    2. First drop for a version: Upload all files
    3. Subsequent drop for a version: Upload only modified files

Invariant Checking

Note: Also build and run the New Unicode Properties programs, since they have some additional checks.
  1. Setup
    1. Open in Package Explorer
      • org.unicode.text.UCD
        • TestUnicodeInvariants.java
    2. Run>Run As... Java Application
      Will create the following file of results:
      ...workspace/Generated/UnicodeTestResults.html

      And on the console will list whether any problems are found. Thus in the following case there was one failure:

      ParseErrorCount=0
      TestFailureCount=1
    3. The header of the result file explains the syntax of the tests.
    4. Open that file and search for "**** START Test Failure". 
    5. Each such point provides a dump of comparison information.
      1. Failures print a list of differences between two sets being compared. So if A and B are being compared, it prints all the items in A-B, then in B-A, then in A&B.
      2. For example, here is a listing of a problem that must be corrected. Note that usually there is a comment that explains what the following line or lines are supposed to test. Then will come FALSE (indicating that the test failed), then the detailed error report.
        # Canonical decompositions (minus exclusions) must be identical across releases
        [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] = [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion]
        
        FALSE
        **** START Error Info ****
        
        In [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], but not in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
        
        # Total code points: 0
        
        Not in [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], but in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
        1B06           # Lo       BALINESE LETTER AKARA TEDUNG
        1B08           # Lo       BALINESE LETTER IKARA TEDUNG
        1B0A           # Lo       BALINESE LETTER UKARA TEDUNG
        1B0C           # Lo       BALINESE LETTER RA REPA TEDUNG
        1B0E           # Lo       BALINESE LETTER LA LENGA TEDUNG
        1B12           # Lo       BALINESE LETTER OKARA TEDUNG
        1B3B           # Mc       BALINESE VOWEL SIGN RA REPA TEDUNG
        1B3D           # Mc       BALINESE VOWEL SIGN LA LENGA TEDUNG
        1B40..1B41     # Mc   [2] BALINESE VOWEL SIGN TALING TEDUNG..BALINESE VOWEL SIGN TALING REPA TEDUNG
        1B43           # Mc       BALINESE VOWEL SIGN PEPET TEDUNG
        
        # Total code points: 11
        
        In both [$�Decomposition_Type:Canonical - $�Full_Composition_Exclusion], and in [$Decomposition_Type:Canonical - $Full_Composition_Exclusion] :
        00C0..00C5     # L&   [6] LATIN CAPITAL LETTER A WITH GRAVE..LATIN CAPITAL LETTER A WITH RING ABOVE
        00C7..00CF     # L&   [9] LATIN CAPITAL LETTER C WITH CEDILLA..LATIN CAPITAL LETTER I WITH DIAERESIS
        00D1..00D6     # L&   [6] LATIN CAPITAL LETTER N WITH TILDE..LATIN CAPITAL LETTER O WITH DIAERESIS
        ...
        30F7..30FA     # Lo   [4] KATAKANA LETTER VA..KATAKANA LETTER VO
        30FE           # Lm       KATAKANA VOICED ITERATION MARK
        AC00..D7A3     # Lo [11172] HANGUL SYLLABLE GA..HANGUL SYLLABLE HIH
        
        # Total code points: 12089
        **** END Error Info ****
    6. Options:
      1. -r    Print the failures as a range list.
      2. -fxxx    Use a different input file, such as -fInvariantTest.txt

Options

  1. If you want to see files that are opened while processing, do the following:
    1. Run>Run
    2. Select the Arguments tab, and add the following
      1. VM arguments:
        -DSHOW_FILES

UCA

  1. Note: This will only work after building the ucd files for this version.
  2. Download UCA files (mostly allkeys.txt) from http://www.unicode.org/Public/UCA/<beta version>/
  3. Run desuffixucd.py (see the inputdata subpage)
  4. Update the input files for the UCA tools, at ~/svn.unitools/trunk/data/uca/8.0.0/ = http://www.unicode.org/utility/trac/browser/trunk/unicodetools/data/uca/8.0.0
  5. You will use org.unicode.text.UCA.Main as your main class, creating along the same lines as above.
    1. Options (VM arguments):
    2. -DNODATE  (suppresses date output, to avoid gratuitous diffs during development)
    3. -DAUTHOR  (suppresses only the author suffix from the date)
    4. -DAUTHOR=XYZ  (sets the author suffix to " [XYZ]")
  6. Only for UCA 6.2 and before: If you change any of the CJK constants, you also need to modify the same constants in ICU's ImplicitCEGenerator.
    1. If you don't, you'll see a message like:

      Exception in thread "main" java.lang.IllegalArgumentException: FA0E: overlap: 9FCC (E2FA6A90) > FA0E(E0AB8800)

  7. To test whether the UCA files are valid, use the options (note: you must also build the ICU files below, since they test other aspects).
    writeCollationValidityLog

    It will create a file:

    ...\5.0.0\CheckCollationValidity.html
    1. Review this file. It will list errors. Some of those are actually warnings, and indicate possible problems (this is indicated in the text, such as by: "These are not necessarily errors, but should be examined for possible errors"). In those cases, the items should be reviewed to make sure that there are no inadvertent problems.
    2. If it is not so marked, it is a true error, and must be fixed.
    3. At the end, there is section 11. Coverage. There are two sections:
      1. In UCDxxx, but not in allkeys. Check this over to make sure that these are all the characters that should get implicit weights.
      2. In allkeys, but not in UCD. These should be only contractions. Check them over to make sure they look right also.

To build all the UCA files used by ICU, use the option:

ICU

They will be built into:

../Generated/uca/8.0.0/
  1. NFSkippable
    1. A file is needed by ICU that is generated with the same tool. Just use the input parameter "NFSkippable" to generate the file NFSafeSets.txt. This is also a default if you do the ICU files.
  1. You should then build a set of the ICU files for the previous version, if you don't have them. Use the options:
    version 4.2.0 ICU

    Or whatever the last version was.

  2. Now, you will want to compare versions. The key file is UCA_Rules_NoCE.txt. It contains the rules expressed in ICU format, which allows for comparison across versions of UCA without spurious variations of the numbers getting in the way.
    1. Do a Diff between the last and current versions of these files, and verify that all the differences are either new characters, or were authorized to be changed by the UTC.
Review the generated data; compare files, use blankweights.sed or similar:
  • ~/svn.unitools/Generated$ sed -r -f ~/svn.cldr/trunk/tools/scripts/uca/blankweights.sed ~/svn.cldr/trunk/common/uca/FractionalUCA.txt > ../frac-9.txt
  • ~/svn.unitools/Generated$ sed -r -f ~/svn.cldr/trunk/tools/scripts/uca/blankweights.sed uca/10.0.0/CollationAuxiliary/FractionalUCA.txt > ../frac-10.txt && meld ../frac-9.txt ../frac-10.txt
Copy all generated files to unicode.org for review & staging by Ken & editors.

Once the files look good:
  • Make sure there is a CLDR ticket for the new UCA version.
  • Create a branch for it.
  • Copy the generated CollationAuxiliary/* files to the CLDR branch at common/uca/ and commit for review.
    • ~/svn.unitools$ cp Generated/uca/8.0.0/CollationAuxiliary/* ~/svn.cldr/trunk/common/uca/
    • Ignore files that were copied but are not version-controlled, that is, "svn status" shows a question mark status for them.

UCA for previous version

Some of the tools code only works with the latest UCD/UCA versions. When I (Markus) worked on UCA 7 files while UCD/UCA 8 were under way, I set version 7.0.0 on the command line and made the following temporary (not committed to the repository) code changes:

Index: org/unicode/text/UCA/UCA.java

===================================================================

--- org/unicode/text/UCA/UCA.java (revision 742)

+++ org/unicode/text/UCA/UCA.java (working copy)

@@ -1354,7 +1354,7 @@

         {0x10FFFE},

         {0x10FFFF},

         {UCD_Types.CJK_A_BASE, UCD_Types.CJK_A_LIMIT},

-        {UCD_Types.CJK_BASE, UCD_Types.CJK_LIMIT},

+        {UCD_Types.CJK_BASE, 0x9FCC+1},  // TODO: restore for UCA 8.0!  {UCD_Types.CJK_BASE, UCD_Types.CJK_LIMIT},

         {0xAC00, 0xD7A3},

         {0xA000, 0xA48C},

         {0xE000, 0xF8FF},

@@ -1361,7 +1361,7 @@

         {UCD_Types.CJK_B_BASE, UCD_Types.CJK_B_LIMIT},

         {UCD_Types.CJK_C_BASE, UCD_Types.CJK_C_LIMIT},

         {UCD_Types.CJK_D_BASE, UCD_Types.CJK_D_LIMIT},

-        {UCD_Types.CJK_E_BASE, UCD_Types.CJK_E_LIMIT},

+        // TODO: restore for UCA 8.0!  {UCD_Types.CJK_E_BASE, UCD_Types.CJK_E_LIMIT},

         {0xE0000, 0xE007E},

         {0xF0000, 0xF00FD},

         {0xFFF00, 0xFFFFD},

Index: org/unicode/text/UCD/UCD.java

===================================================================

--- org/unicode/text/UCD/UCD.java (revision 743)

+++ org/unicode/text/UCD/UCD.java (working copy)

@@ -1345,7 +1345,7 @@

             if (ch <= 0x9FCC && rCompositeVersion >= 0x60100) {

                 return CJK_BASE;

             }

-            if (ch <= 0x9FD5 && rCompositeVersion >= 0x80000) {

+            if (ch <= 0x9FD5 && rCompositeVersion > 0x80000) {  // TODO: restore ">=" when really going to 8.0!

                 return CJK_BASE;

             }

             if (ch <= 0xAC00)

Index: org/unicode/text/UCD/UCD_Types.java

===================================================================

--- org/unicode/text/UCD/UCD_Types.java (revision 742)

+++ org/unicode/text/UCD/UCD_Types.java (working copy)

@@ -24,7 +24,7 @@

     // 4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;

     // 9FD5;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

     CJK_BASE = 0x4E00,

-    CJK_LIMIT = 0x9FD5+1,

+    CJK_LIMIT = 0x9FCC+1,  // TODO: restore for UCD 8.0!  0x9FD5+1,

 

     CJK_COMPAT_USED_BASE = 0xFA0E,

     CJK_COMPAT_USED_LIMIT = 0xFA2F+1,


Charts

To build all the charts, use org.unicode.text.UCA.Main, with the option:

charts

They will be built into

http://unicode.org/draft/charts/

Once UCA is released, then copy those files up to the right spots in the Unicode site: