The index properties dialog has five sections accessed by the tabs across the top:

It also contains the following buttons accessible from all the tabs:
  • Cancel - close the dialog and ignore any changes made on any of the tabs.
  • Save - close the dialog without rebuilding the index, but save the changes made. The index will have to be rebuilt before those changes take effect.
  • Build - close the dialog and launch the indexer to rebuild the index in the background. When the build is finished, you will be prompted to reload the changed index.

Root Paths

This tab consists of a list of top level folders where Wilma should start looking for included files. All the folders that these root folders contain and the folders that they in turn contain and so forth, will also be searched.

Thus if you wanted to index all the files in your Documents folder on a Windows machine, you might have an entry:

C:\Documents and Settings\yourname\Documents

while on a Mac it might be:

/Users/yourname/Documents

and on Linux:

/home/yourname/Documents

You can either just type the root folders into the list directly or use the browse button to search using a directory tree.


Include Files

While the Root Paths list tells Wilma where to look for files to index, this list tells Wilma which files in those folders are to be indexed and which file analyzer is to be used in reading them.

When a new index is created, the index properties dialog is opened and the include files list is filled with a range of common file extensions and associated analyzers.In many cases the simplest thing to do will be to simply use these default entries, but in other cases it will be more appropriate to clear the list, and add just specific entries. Note that file patterns are typically something like *.txt where the * character is a wild card representing any number of any characters. Also note that this matches against the full path name of the file, so something like */saved/*.doc (*\saved\*.doc on Windows) would only include doc files that had the folder named "saved" somewhere in its path.

The Add button adds a new row to the list, while the Delete button removes the currently selected row and the Clear button removes all the rows. The Up and Down buttons can be used to move the selected item in the list. The analyzer used for a file is determined by the first match in the list and this can be important if for instance you want to use a different analyzer for files in a certain folder. The Default button will add the default list of analyzers to the end of the current list.

To select an analyzer, just click on analyzer for that row to get a selection dialog, which lists the available analyzers on the left with a brief description on the right. The generic analyzer will be used for most files and is quite proficient at extracting text even from binary file formats.

Some analyzers, such as the zip, tar and gzip analyzers, deal with archives and will recursively index all of the files in the archive using the appropriate analyzers, including archive analyzers.

Wilma can also use external command line programs to process unknown formats into temporary files that Wilma can handle. One good application of this is to use the open source pdftotext program for pdf files. While Wilma does have a simple pdf analyzer built in, it can have problems with complex pdf files that pdftotext handles well. See external analyzers for information on how to set these up.


Exclude Paths

This list is used to indicate any files that you don't want indexed, even though they fit the criterion set by the root paths and include files lists. For example you might have a temporary directory somewhere in the directory tree included by a root path. This could be skipped with something like:

*/temporary/*

Like the include files, these patterns are matched against the full path name of the file.


Options

This tab is used to specify various options pertaining to how the index is built:

Track All Files If this option is selected, Wilma will include in its index basic information (name, size, modification date, etc.) for all the files it encounters in the root path directory trees, even those not specified for indexing in the include files tab.
Hide Tags When displaying contents, hide the tags used in things like html files.
Include Word Counts Keep track of how many times (up to 256) that each word appears in a file.
Minimum Word Length Words smaller than this will not be indexed. Not many significant words will be shorter than three characters and limiting the index to words that long or longer greatly reduces the amount of noise indexed when dealing with binary files.
Maximum Word Length Sets the longest string of characters that will be considered a word.
Max Indexed Bytes/File Limits the number of bytes read from any one file. Normally this will be set quite large, but can be reduced where you know all the readable text will be near the beginning of the file. For example this applies to email messages that are stored in separate files as the attachments that appear after the clear text will just be noise to Wilma.
Display Line Length How many characters are displayed in a line in the contents view, before the line is wrapped.
Number Handling There are three choices about how number characters are handled:
  • No Numbers - Numbers are considered non word or word separating characters.
  • Numbers not starting word - Numbers are treated as word characters unless they are the first character in the word. Thus fred123 would be considered a word, but 2001 would not.
  • Numbers anywhere - numbers are treated the same as alphabetic characters, so 2001 would be considered a word and indexed.
Others Not Starting Word Normally only alphabetic characters and numbers as specified above are considered parts of a word, but you can specify other characters to be included if you want. Characters specified here will be considered word characters unless they are at the beginning of the word. One should be careful to avoid unintended consequences when adding characters. For instance including the dot character, perhaps to index full domain names, would probably result in words at the end of sentences being indexed with the period included and thus missed when searched for without the period.
Others Characters Anywhere As above, but these characters may also start words.

Language

On this tab you can select which accented characters should be used when indexing.