Daylight v4.9
Release Date: 1 February 2008

Name

ares - approximate regular-expression search

Description

1. INTRODUCTION
2. USER INTERFACE
3. ARES REGULAR EXPRESSION LANGUAGE
4. LIMITATIONS
5. EXAMPLES, APPROXIMATE SEARCHES
6. EXAMPLES, REGULAR EXPRESSIONS
7. BUGS
8. REFERENCES and CREDITS

1. INTRODUCTION

This file documents the Daylight ARES utility, a regular expression matcher with the capability of doing approximate matching. Although ARES was primarily designed for the Merlin server, ARES is likely to be used in other components of the Daylight software. This document is intended to be an authoritative ARES reference for all such applications. End users are likely to find sections 1-6 most useful.

ARES implements a very general form of string searching: approximate regular expression matching. Since a regular expression can represent a simple string, ARES provides the ability to do approximate matching of simple strings. One can also use ARES for exact matching of strings or regular expressions by setting the number of allowable errors to zero. (However, faster methods are available for exact string matching.)

Approximate string searching differs from conventional string searching several important ways. Most importantly, strings are defined to match a pattern even if they don't match the pattern exactly, i.e., the string matches the pattern with a certain number of "errors". With ARES, each character insertion, deletion, or substitution counts as one error. For instance, if you use ARES to look for names containing "PARACETIMAL", names containing "PARACETAMOL" will be found with 2 (substitution) errors.

ARES allows you to specify parts of the pattern which must be matched exactly by enclosing them in angle brackets, e.g., the pattern "HYDRONOL" matches HYDROKON with two errors, but the pattern "HYDRON<OL>" will not match HYDROKON at any error level, since "OL" is required exactly.

With approximate string searching, one tends to look for complete strings (as opposed to substrings) more often than one does with exact matching. For instance, when looking for the name "ALUMINUM", you might wish to also find the name "ALUMINIUM" (British spelling?) but not "ALUMINIUM CHLORIDE". Although one can do this with a simple regular expression (^ALUMINUM$), ARES provides an simple option to search for "whole" strings.

ARES accepts patterns written as regular expressions. Although regular expressions allow complex patterns to be specified, the concept of insertion, deletion, and substitution errors is just as valid as with simple string patterns. More information on ARES' regular expression language may be found below.

2. USER INTERFACE

The user interface is implementation-dependent, but generally will consist of following elements:

- User input area: enter string or regular expression to
search for here.  The search string is limited to about
30 characters, excluding "magic characters" (regular
expression meta-characters).

- "Whole line" option: if selected, search for entire lines which match the input pattern; if not, look for lines containing the input pattern as a substring.

- "No magic" option: if selected, all characters in the input pattern are treated as themselves with no special "regular expression" meaning. If not selected, the characters ^ $ * [ ] . | ( ) < > # and \\ have special meanings unless preceded by \\ (see below).

- "Ignore case" option: if selected, upper and lower case letters are considered to be identical; if not selected, search is case-sensitive.

- "Number of errors": maximum number of errors (mismatches) which are allowed in a match. 0 requires an exact match, positive values allow approximate matching. Each insertion, deletion, or substitution of a character is counted as one error. Some interfaces may provide a "Best match" choice, which looks for 0-error (exact) matches and if none are found, looks for 1-error matches, then 2-error matches, etc, until one or more matches are found.

3. ARES REGULAR EXPRESSION LANGUAGE

^ ....... beginning of line
$ ....... end of line
* ....... 0 or more of previous expression
[..] .... character list, [^..] is complement,
          [a-z] is range
. ....... any character
| ....... OR (union)
( ) ..... delimit a regular expression,
          e.g., (pattern1)|(pattern2)
# ....... wildcard (equivalent to .*)
\\ ....... unmagic next character

All other characters match only themselves.

4. LIMITATIONS

max pattern ..... 30 char (excluding meta-chars)
max errors ...... 4
max line ........ no limit
max record ...... no limit

5. EXAMPLES, APPROXIMATE SEARCHES

The following examples were done with all TRADE NAMEs in wdi93demo. options: whole=FALSE, nomagic=FALSE, nocase=FALSE

METOTREXATE
  Matches METHOTREXATE and METOTREXATO with 1 error (insertion).

PARACETIMAL Matches PARACETAMOL with 2 errors (substitutions).

PENNICILIN Matches "PENICILLIN" with 2 errors (insertion and deletion).

MOREFINE Matches MORPHINE and MORFIN with 2 errors. Also matches CEPOREXINE with 2 errors (as substring).

Matches MORPHINE and MORFIN with 2 errors. Won't match CEPOREXINE at any error level (angle brackets mean "MOR" is required).

6. EXAMPLES, REGULAR EXPRESSIONS

options: whole=FALSE, nomagic=FALSE, nocase=FALSE, okerrs=0

COLD Matches "COLD" anywhere, e.g. COLDRIN, CALMANTICOLD, and NICOLDA.

^COLD Matches "COLD" at beginning of line only, e.g., matches COLDRIN but not CALMANTICOLD or NICOLDA.

COLD$ Matches "COLD" at end of line only,

COLD$ Matches "COLD" at end of line only, e.g., matches CALMANTICOLD but not COLDRIN or NICOLDA.

[A-Z]COLD[A-Z] Matches "COLD" embedded between letters A-Z, e.g., matches NICOLDA but not COLDRIN or CALMANTICOLD.

^[AEIOU][^AEIOU]*[AEIOU]$ Matches lines starting with a vowel [AEIOU] followed by zero or more non-vowels, and ending with a vowel. The first ^ means "beginning of line" and the second ^ (inside brackets) causes the set of characters [AEIOU] to be complemented. Matches ALCYDE, A.S.A, ASPRO, EVE, etc.

A.S.A Matches "A" followed by any character followed by "S" followed by any character followed by "A", e.g., A.S.A, A-S-A, and HEADSTART.

A\\.S\\.A Matches "A" followed by a period followed by "S" followed a period followed by "A", e.g., matches A.S.A. but not A-S-A or HEADSTART.

Y.*Y.*Y or Y#Y#Y (mean the same thing) Matches line containing three Y's (a "Y" followed by any number of characters followed by a "Y" followed by any number of characters followed by a "Y"), e.g., matches ACETYLSALISYLSYRE.

(BASPO|STAND).*N Matches "BASPO" or "STAND" followed zero or more other characters followed by "N", e.g., BASPORIN, STANDACILLIN, and BASPORIDINA.

(BASPO|STAND).*N$ Matches "BASPO" or "STAND" in a line ending with "N", e.g., BASPORIN and STANDACILLIN, but not BASPORIDINA.

^S(AL)*ATE Matches "S" at beginning of line, followed by zero or more "AL"s, followed by "ATE", e.g., SALALATE, SATELLITE, and SATELOL.

^[AEIOU]*.([AEIOU].)*[AEIOU]*$ Matches lines where every other letter is a vowel, e.g., matches DOPAMINE, FUROSEMIDE, PARACETAMOL, PIXIE, AXIS, NALIDIXATE-ACID, etc.

7. BUGS

The magic character '+' (one or more of) is not supported. A workaround is to use "(pattern)(pattern)*" instead of "(pattern)+".

The magic character '.' (any character) is not effective at the beginning (or end) of a line, e.g., the pattern ".COLD" matches COLDRIN, but shouldn't. A workaround is to use a pattern like "^..*COLD" which works correctly.

The magic character '#' is exactly equivalent to ".*" and is redundant. It is provided for forward compatibility with approximate string (not regular expression) searches. However, the same thing is not done for other such characters, e.g., ',' (or) and ';' (and).

There is no way to specify a NULL character as part of a pattern.

ARES uses algorithms which are general enough to do approximate regular expression matching. Faster algorithms are available for less general cases, e.g., approximate matching of simple strings and exact regular expression matching. Such methods should be used as appropriate (but aren't, yet).

An internal buffer which holds strings to be searched is part of each search context. This buffer is re-allocated as needed, so there is no fixed limit on the size of strings which may be searched. However, on a single CPU machine, only a single dy_ares_test() runs at once, so it would be more efficient to use a single shared buffer.

In the case of a large number of strings which are much smaller than the available buffer size (typical for a server), most of the processing time is spent setting things up. This can be improved substantially by encoding multiple lines per string (i.e., with newlines). For instance, wdi93demo contains 16239 TRADE NAMEs in 1235 datatrees. Doing the search as 16239 dt_ares_test() calls takes 0.69 sec on a Sun IPX. Doing the same searches with 1235 multi-line calls takes 0.34 sec (200% faster). Unfortunately (this is the bug) the dy_ares_test() interface doesn't tell you which lines hit.

For substring searches, it would be nice if ARES would return information about which characters matched (perhaps the first and last in the match?)

The code is optimized for machines with 32-bit words.

8. REFERENCES and CREDITS

The algorithm used here is pretty much as described in the following papers:

Boyer, R.S. and Moore, J. S., "A fast string searching algorithm", Proc. ACM-SIGIR Conf on Info. Retriv., Cambridge MA, June 1989, pp 168-175.

Myers E.W. and Miller, W., "Approximate matching of regular expressions", Bull. of Mathematical Biology, 51, 1989, pp 5-37.

Tarhio J. and Ukkonen, E., "Approximate Boyer-Moore string matching", Technical Report A-1990-3, Dept. Comp. Sci., U. Helsinki, (March 1990).

Wu, S. and Mamber, U., "Fast text searching with errors", Technical Report TR-91-11, Dept. Comp. Sci., U. Arizona (June 1991).

The algorithm is structured in the manner of Wu and Mamber's use of the Boyer-Moore algorithm in agrep (re and re1), but reimplemented to provide reentrant code. This implementation doesn't use the agrep, mgrep, or monkey algorithms, and does not operate in sub-linear time. agrep contains all of these and is available by ftp from cs.arizona.edu (IP 192.12.69.5).

Related Topics

xvmerlin(1)