Importing SFM to FLEx: A roadmap and resources

5E. Normalize POS Values

Because there are differences in how POS labels are handled in SFM versus FLEx, some normalization is usually needed. There are two main things to be done:

If the compiler did not use a "range set" for the \ps values in Toolbox, there will likely be inconsistencies in how the same POS is represented. For instance, a file might include: vi v.i. v.i v. i. all indicating "intransitive verb". This step is about (a) determining which values need to be collapsed, (b) what to collapse them into, and (c) making that change in the SFM file.
The meaning of the POS field for affixes is different between Toolbox and FLEx. In FLEx, that field is for "what this affix attaches to", whereas in Toolbox users are often attempting to say what POS an affix IS. For importing to FLEx, it is important that the \ps field be blank for all affixes. Yet we do want to preserve what had been in that field. The script below changes the marker on the \ps field in affixes (so it can be imported into a custom field) and makes an empty \ps field.

Below are the steps for doing this process. The final application of this step may not happen until later, but it is still important to work through it early.

See if the \ps values need to be normalized.
- Determine the set of values used in \ps by using a recipe like this on a Linux command line (note the space after "ps"):
  - egrep "^.ps " MyDatabase.db | sort | uniq -c > pos-list.txt
  - (It is easy to modify this "recipe" to find the range of values for any fixed-value field, such as \so or \ue or \bw. It is helpful for determining whether the field really does use "fixed values" or if it is "free-form".)
- See if there are some values that are not appropriate for importing into FLEx (and therefore need to be "normalized").
  - Look for obvious typos (e.g., same abbreviation with an initial capital letter or all lowercase, some with periods at the end and others without, one abbreviation spelled more than one way)--you will want to pick one way to make them all consistent. (It is helpful to use a spreadsheet to do this planning.)
  - Look for overspecification that may need to be simplified for the Category list in FLEx. For instance, if they have been very detailed about the kinds of nouns (e.g, N-masc-inan for "masculine inanimate noun"), there may be useful information that should be expressed somewhere in the entry, but all of that detail may not in fact be needed for the Category. Think in terms of "What should the list of abbreviations in the front matter look like."
  - If there are multiple POS in a single field (e.g., "n, adj" or "vt; vi"), there are a few possible ways to handle it.
    - If the linguist is active, it's possible to mark these so the linguist can investigate them further after the import is done.
    - In some cases, the import specialist might ask the linguist to split them into separate senses before importing, or provide enough information that the import specialist can split them. (But this usually slows down the import process and reduces momentum.)
    - If the linguist is not available for consultation, the import specialist can create categories in FLEx for these "compound categories". For instance, create a category "Adjective/Adverb" with an abbreviation "adj;adv". Create these in FLEx to match what is in the data.
Download and then edit FixPOS-NNN.pl so it can normalize the SFM file as decided. (Details of how to edit the script are not provided here. If you have enough understanding of Perl and regular expressions to edit this script, you can probably get enough pointers from comments that are in the script itself. If not, don't hesitate to get help for this part.)
If you are finished doing import steps that involve moving fields around (e.g., during work with Solid), and you don't have subentries of senses, then go ahead and apply that script. Otherwise, save it for a later step. (That is, it will get applied either after you have finished moving fields around, or as part of the "subentries of senses" process.) It is still important to think about it this early, so you can set up the FLEx project based on the plans you make here, and communicate with the linguist about it.
Communicate the list to the linguist. Confirm whether what you have guessed matches with what they desire, both in terms of what should or shouldn't be collapsed, and the full names and abbreviations. Edit your proposed list if they give you any corrections.
- - It is recommended that POS category to be added and defined in the grammatical category list in FLEx though FLEx will create each POS referenced in the SFM file to the POS category list. Some advantages to prepare the list in advance are as the following:
    - - FLEx populates the POS all at the root level of the category list. If hierarchical arrangement is desired, it is best to create them in advance. For example, vi (intransitive verb), vt (transitive verb): If these two should be subcategories of v (verb), it should be done in advance.
      - It is important to identify undesired duplicates (e.g., v, vb) in time to make corrections in the SFM file prior to import.
      - Additional information such as full name and description of the POS can be documented in advance. This information is also useful later to be included in the front matter of the dictionary (such as Webonary.org.)
Communicate the normalized list to those working on the front matter, to be included in the list of abbreviations.

Page updated

Report abuse