ΒΆ 23

Miscellaneous tools

indextool

indextool is a helpful utility that extracts various information about a physical table, excluding template or distributed tables. Here's the general syntax for utilizing indextool:

indextool <command> [options]

Options

These options are applicable to all commands:

Commands

Here are the available commands:

spelldump

The spelldump command is designed to retrieve the contents from a dictionary file that employs the ispell or MySpell format. This can be handy when you need to compile word lists for wordforms, as it generates all possible forms for you.

Here's the general syntax:

spelldump [options] <dictionary> <affix> [result] [locale-name]

The primary parameters are the main file and the affix file of the dictionary. Typically, these are named as [language-prefix].dict and [language-prefix].aff, respectively. You can find these files in most standard Linux distributions or from numerous online sources.

The [result] parameter is where the extracted dictionary data will be stored, and [locale-name] is the parameter used to specify the locale details of your choice.

There's an optional -c [file] option as well. This option allows you to specify a file for case conversion details.

Here are some usage examples:

spelldump en.dict en.aff
spelldump ru.dict ru.aff ru.txt ru_RU.CP1251
spelldump ru.dict ru.aff ru.txt .1251

The resulting file will list all the words from the dictionary, arranged alphabetically and formatted like a wordforms file. You can then modify this file as per your specific requirements. Here's a sample of what the output file might look like:

zone > zone
zoned > zoned
zoning > zoning

wordbreaker

The wordbreaker tool is designed to deconstruct compound words, a common feature in URLs, into their individual components. For instance, it can dissect "lordoftherings" into four separate words or break down http://manofsteel.warnerbros.com into "man of steel warner bros". This ability enhances search functionality by eliminating the need for prefixes or infixes. To illustrate, a search for "sphinx" wouldn't yield "sphinxsearch" in the results. However, if you apply wordbreaker to disassemble the compound word and index the detached elements, a search will be successful without the file size expansion associated with prefix or infix usage in full-text indexing.

Here are some examples of how to use wordbreaker:

echo manofsteel | bin/wordbreaker -dict dict.txt split
man of steel

The -dict dictionary file is used to separate the input stream into individual words. If no dictionary file is specified, Wordbreaker will look for a file named wordbreaker-dict.txt in the current working directory. (Ensure that the dictionary file matches the language of the compound word you're working with.) The split command breaks words from the standard input and sends the results to the standard output. The test and bench commands are also available to assess the splitting quality and measure the performance of the splitting function, respectively.

Wordbreaker uses a dictionary to identify individual substrings within a given string. To distinguish between multiple potential splits, it considers the relative frequency of each word in the dictionary. A higher frequency indicates a higher likelihood for a word split. To generate a file of this nature, you can use the indexer tool:

indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/manticore.conf

which will produce a text file named dict.txt that contains the 100,000 most frequently occurring words from myindex, along with their respective counts. Since this output file is a simple text document, you have the flexibility to manually edit it whenever needed. Feel free to add or remove words as required.