Daylight v4.9 Release Date: 1 February 2008 Nameares - approximate regular-expression searchDescription1. INTRODUCTION2. USER INTERFACE 3. ARES REGULAR EXPRESSION LANGUAGE 4. LIMITATIONS 5. EXAMPLES, APPROXIMATE SEARCHES 6. EXAMPLES, REGULAR EXPRESSIONS 7. BUGS 8. REFERENCES and CREDITS 1. INTRODUCTION This file documents the Daylight ARES utility, a regular expression matcher with the capability of doing approximate matching. Although ARES was primarily designed for the Merlin server, ARES is likely to be used in other components of the Daylight software. This document is intended to be an authoritative ARES reference for all such applications. End users are likely to find sections 1-6 most useful. ARES implements a very general form of string searching: approximate regular expression matching. Since a regular expression can represent a simple string, ARES provides the ability to do approximate matching of simple strings. One can also use ARES for exact matching of strings or regular expressions by setting the number of allowable errors to zero. (However, faster methods are available for exact string matching.) Approximate string searching differs from conventional string searching several important ways. Most importantly, strings are defined to match a pattern even if they don't match the pattern exactly, i.e., the string matches the pattern with a certain number of "errors". With ARES, each character insertion, deletion, or substitution counts as one error. For instance, if you use ARES to look for names containing "PARACETIMAL", names containing "PARACETAMOL" will be found with 2 (substitution) errors. ARES allows you to specify parts of the pattern which must be matched exactly by enclosing them in angle brackets, e.g., the pattern "HYDRONOL" matches HYDROKON with two errors, but the pattern "HYDRON<OL>" will not match HYDROKON at any error level, since "OL" is required exactly. With approximate string searching, one tends to look for complete strings (as opposed to substrings) more often than one does with exact matching. For instance, when looking for the name "ALUMINUM", you might wish to also find the name "ALUMINIUM" (British spelling?) but not "ALUMINIUM CHLORIDE". Although one can do this with a simple regular expression (^ALUMINUM$), ARES provides an simple option to search for "whole" strings. ARES accepts patterns written as regular expressions. Although regular expressions allow complex patterns to be specified, the concept of insertion, deletion, and substitution errors is just as valid as with simple string patterns. More information on ARES' regular expression language may be found below.
2. USER INTERFACE The user interface is implementation-dependent, but generally will consist of following elements:
- User input area: enter string or regular expression to search for here. The search string is limited to about 30 characters, excluding "magic characters" (regular expression meta-characters). 3. ARES REGULAR EXPRESSION LANGUAGE
^ ....... beginning of line $ ....... end of line * ....... 0 or more of previous expression [..] .... character list, [^..] is complement, [a-z] is range . ....... any character | ....... OR (union) ( ) ..... delimit a regular expression, e.g., (pattern1)|(pattern2) # ....... wildcard (equivalent to .*) \\ ....... unmagic next character All other characters match only themselves.
4. LIMITATIONS
max pattern ..... 30 char (excluding meta-chars) max errors ...... 4 max line ........ no limit max record ...... no limit
5. EXAMPLES, APPROXIMATE SEARCHES The following examples were done with all TRADE NAMEs in wdi93demo. options: whole=FALSE, nomagic=FALSE, nocase=FALSE
METOTREXATE Matches METHOTREXATE and METOTREXATO with 1 error (insertion).
6. EXAMPLES, REGULAR EXPRESSIONS
options: whole=FALSE, nomagic=FALSE, nocase=FALSE, okerrs=0
7. BUGS The magic character '+' (one or more of) is not supported. A workaround is to use "(pattern)(pattern)*" instead of "(pattern)+". The magic character '.' (any character) is not effective at the beginning (or end) of a line, e.g., the pattern ".COLD" matches COLDRIN, but shouldn't. A workaround is to use a pattern like "^..*COLD" which works correctly. The magic character '#' is exactly equivalent to ".*" and is redundant. It is provided for forward compatibility with approximate string (not regular expression) searches. However, the same thing is not done for other such characters, e.g., ',' (or) and ';' (and). There is no way to specify a NULL character as part of a pattern. ARES uses algorithms which are general enough to do approximate regular expression matching. Faster algorithms are available for less general cases, e.g., approximate matching of simple strings and exact regular expression matching. Such methods should be used as appropriate (but aren't, yet). An internal buffer which holds strings to be searched is part of each search context. This buffer is re-allocated as needed, so there is no fixed limit on the size of strings which may be searched. However, on a single CPU machine, only a single dy_ares_test() runs at once, so it would be more efficient to use a single shared buffer. In the case of a large number of strings which are much smaller than the available buffer size (typical for a server), most of the processing time is spent setting things up. This can be improved substantially by encoding multiple lines per string (i.e., with newlines). For instance, wdi93demo contains 16239 TRADE NAMEs in 1235 datatrees. Doing the search as 16239 dt_ares_test() calls takes 0.69 sec on a Sun IPX. Doing the same searches with 1235 multi-line calls takes 0.34 sec (200% faster). Unfortunately (this is the bug) the dy_ares_test() interface doesn't tell you which lines hit. For substring searches, it would be nice if ARES would return information about which characters matched (perhaps the first and last in the match?) The code is optimized for machines with 32-bit words.
8. REFERENCES and CREDITS The algorithm used here is pretty much as described in the following papers: Boyer, R.S. and Moore, J. S., "A fast string searching algorithm", Proc. ACM-SIGIR Conf on Info. Retriv., Cambridge MA, June 1989, pp 168-175. Myers E.W. and Miller, W., "Approximate matching of regular expressions", Bull. of Mathematical Biology, 51, 1989, pp 5-37. Tarhio J. and Ukkonen, E., "Approximate Boyer-Moore string matching", Technical Report A-1990-3, Dept. Comp. Sci., U. Helsinki, (March 1990). Wu, S. and Mamber, U., "Fast text searching with errors", Technical Report TR-91-11, Dept. Comp. Sci., U. Arizona (June 1991). The algorithm is structured in the manner of Wu and Mamber's use of the Boyer-Moore algorithm in agrep (re and re1), but reimplemented to provide reentrant code. This implementation doesn't use the agrep, mgrep, or monkey algorithms, and does not operate in sub-linear time. agrep contains all of these and is available by ftp from cs.arizona.edu (IP 192.12.69.5). Related Topicsxvmerlin(1) |