LIBRARY FOR THE EFFICIENT HANDLING OF LARGE DICTIONARIES (LIBNADFA)

Compile

To compile an automaton we need:

  1. The file with the information of words.

    It should have the following format:

    word_1[\tinformation_1][\tinformation_2]...\n

    word_1[\tinformation_1][\tinformation_2]...\n

    ...

    Where \t represents a tab, \n a linefeed and brackets optionality.

    For example, we define a dictionary with four words in which we associate the lemmas (canonical forms) to the words (forms), as follows:

    rejects\treject\n

    three\tthree\n

    violet\tviolet\n

    violets\tviolet\n

    Moreover, in this case the information associated with words (lemmas) are not of type numeric, and we want that these lemmas were stored in another automaton. So, the first dictionary will reference to the second.

    Therefore, in our case we define another dictionary for lemmas, which do not include any information:

    reject\n

    three\n

    violet\n

  2. The settings file for the compiler.

    In this case, one to compile the lemmas:

    <?xml version="1.0" encoding="ISO-8859-1" ?>

    <!DOCTYPE setup SYSTEM "setup_nadfa.dtd">

    <setup path="./" output_binary_file="automaton_lemmas.bin">

    <key file="lemmas.txt" number_mappings="0"/>

    </setup>

    and another to compile the forms:

    <?xml version="1.0" encoding="ISO-8859-1" ?>

    <!DOCTYPE setup SYSTEM "setup_nadfa.dtd">

    <setup path="./" output_binary_file="automaton_forms.bin">

    <key file="forms.txt" mappings_number="2"/>

    <info name="lemma" function="word_to_index" automaton_id="2" type="guint32"/>

    <input_automaton automaton_id="2" file="automaton_lemmas.bin"/>

    </setup>

    setup_nadfa.dtd DTD can be seen in the src directory of the downloadable package. Next, we describe the attributes which can be included in its elements:

    • setup:
      • path: directory path to the files necessary for compilation.
      • output_binary_file: binary file which will store the compiled version of the dictionary.
    • key:
      • file: file containing the original dictionary, that is, the file which contains the words list and its information columns.
      • mappings_number: number of conversion tables to be used for indexing information of words. It should be set to 2 if words have any associated information, or to 0 if there is no information.
    • info
      • name: name of information column.
      • function: function applied to data of the information column in case of non-numerical field. Up to now it can only include the value word_to_index.
      • automaton_id: identifier of previously compiled automaton referenced by the function defined in the previous attribute.
      • type: data type of the information column to be stored in the compilation process. In the case we use a function, it would contain the type of data it returns.
    • input_automaton
      • automaton_id: automaton identifier associated with a previously compiled information column.
      • file: binary file which contains the compressed version of the automaton.
  3. Run:

    $nadfac conf_lemas.xml

    $nadfac conf_formas.xml

    Where conf_lemas.xml and conf_formas.xml are the settings files described above for lemmas and forms respectively.

Access:

To see how to load automata into memory and use it, we recommend to take a look at the files libnadfa_bin_print.c and libndfa_bin.h. The latter shows the functions exported by the library that manages the compiled automaton.

We have not documented a complete description of the API for access to compiled automata so far. We will do as soon as possible.

english | galego | espaƱol
Valid XHTML 1.0 Strict Valid CSS!