Daylight Programmer's Guide: Merlin and Thor Servers

18. Merlin Toolkit

18.1 Introduction

Previous chapters discussed those aspects of the Merlin Toolkit that are common with the THOR Toolkit: servers and security, databases and datatypes. In this chapter, we will cover the Merlin-specific capabilities of the Merlin Toolkit.

Merlin uses a "spreadsheet" model to represent a database. This is discussed in greater detail in the Daylight THOR-Merlin Administration Guide. Merlin has two objects, the hitlist and the column that represent this view of the database. Two other concepts, the row and the cell, are also important, but there are no row or cell objects in Merlin.

A column object represents a "column" of data from a database, i.e. one specific field from each TDT in the database. A column is the "y axis" of the "spreadsheet" view of the database.
A "row" is the "x axis" of the spreadsheet view of the database; it is data from a single TDT. There is no row object; it is just an idea we use to convey the workings of Merlin.
A hitlist object represents an ordered set of rows from a database. That is, the object holds a set of "hits" (rows) and a particular ordering of those rows. Various search operations affect which rows belong in the hitlist, and various sort operations affect the order of the rows in the hitlist.
A "cell" is the data at the intersection of a row and a column. There is no cell object.

18.2 Tasks -- "Time Slicing"

Although Merlin is quite fast at searching and sorting, certain tasks can take a significant amount of time. Since the server has to serve many clients, tasks that take a long time have to be "sliced" into smaller time segments so that requests from various clients can be interleaved. This prevents any one client from "hogging" the server for a long time. In addition, a client can abort a time-sliced task part way through.

All sorting and searching Toolkit functions are time-sliced. These functions, and the function dt_continue() (described below), have a "status" parameter indicating how the task is progressing:

`DX_STATUS_IN_PROGRESS`	not finished, task in progress
`DX_STATUS_DONE`	finished search, target found
`DX_STATUS_NOT_FOUND`	finished search, target not found
`DX_STATUS_ERROR`	error, operation not completed

The following three functions are used in conjunction with the searching and sorting functions (which are described in detail below) to carry out time-sliced functions:

dt_continue(Handle server, RETURN integer status) ==> integer progress: Continues the current task in progress. Returns the progress on the task; dividing this value by the value returned dt_done_when() will yield the fraction of the task that is completed. You can only call this function when a task is in progress, e.g after a search or sort function has returned a status of DX_STATUS_IN_PROGRESS.
dt_abort(Handle server) ==> integer ok: Aborts the current task. A server can only have one task in progress for any particular client, so starting a second task (another search or sort) also has the effect of aborting the current task.
dt_done_when(Handle serverh) ==> integer done_when: Indicates the "final progress" for dt_continue(); that is, the value of "progress" that will mean the task is complete (where "progress" is the value dt_continue() returns).

A general algorithm for starting and completing a sort or search task is:

     Start the task; check the task's return status
     If return status is "in progress" then 
	 done_when = dt_done_when(server)
	 while (status is still "in progress")
	     progress = dt_continue(server, status)
	     report progress to the user
	 endwhile
     endif

The following C code fragment illustrates this in more concrete terms:

     /*** Do the search ***/
     progress = dt_mer_similarselect(hitlist, col, searchtype, action, -1,
			       &ret_status, strlen(smiles), smiles, limit,0.0,0.0);
     if (ret_status == DX_STATUS_IN_PROGRESS) {
       done_when = dt_done_when(server);
       while (ret_status == DX_STATUS_IN_PROGRESS) {
	 printf("Similarity: (%d%%)\n", (100 * progress)/done_when);
	 progress = dt_continue(server, &ret_status);
       }
     }

     /*** Let user know how it came out ***/
     if (ret_status == DX_STATUS_NOT_FOUND)
       printf("Target not found - hitlist unchanged\n");
     else if (ret_status != DX_STATUS_DONE) {
       printf("Error with similarity search:\n");
       printerrors(stdout, 0);
     }
     else
       printf("Done: %d hits in list\n", dt_mer_length(hitlist));

18.3 Querying for Capabilities

Merlin's set of capabilities includes several ways to sort data, search data, and otherwise examine and modify hitlists. All of Merlin's capabilities are enumerated in the "include" files that come with the Toolkit; however, it is some times desirable to design a user interface without "hard coding" this information. That is, it might be desirable to ask the Toolkit at "run time" for its capabilities, and build the user interface (menus, etc.) using the reported capabilities.

The Merlin Toolkit provides functions that allow you to ask the Merlin system for its capabilities. For example, the sorting function take a "sort type" parameter that indicates how the data are to be sorted (e.g. ASCII, numeric, etc.); using the "capabilities functions" described below, you can ask the server how many types of sorts are available, ask for the "name" of each one, and ask which of these are appropriate for the particular column being sorted. Using the information returned, the program can present this information as a menu from which the user can select the sort-type desired.

The following "capability querying" functions are available:

dt_mer_action2name(Handle server, integer action) dt_mer_function2name(Handle server, integer func) dt_mer_search2name(Handle server, integer search) dt_mer_similar2name(Handle server, integer similar) dt_mer_sort2name(Handle server, integer sort) dt_mer_subselect2name(Handle server, integer subselect) dt_mer_superselect2name(Handle server, integer superselect): Each of the above functions returns a string containing a an English- language name for the specified capability. If the capability is unknown (the parameter is out of range) or server is not a server object, returns the invalid string.
dt_mer_nactions(Handle serverh) dt_mer_nfunctions(Handle serverh) dt_mer_nsearches(Handle serverh) dt_mer_nsimilars(Handle serverh) dt_mer_nsorts(Handle serverh) dt_mer_nsubselects(Handle serverh) dt_mer_nsuperselects(Handle serverh): Each of the above functions returns an integer equal to the number of valid capabilities. If the capability is unknown, or server is not a server object, returns -1.

The following C code fragment illustrates how one might use these functions to print a list of all legal sorts:

     nsorts = dt_mer_nsorts(server);
     for (sort = 0; sort > nsorts; sort = sort + 1) {
	sort_name = dt_mer_sort2name(&alen,server,sort);
	fprintf(stdout, "%d. %.*s\n", sort, alen, sort_name);
     }

Two other capability-querying functions are used to help users select capabilities that are appropriate for particular data:

dt_mer_sortapplies(Handle column, integer sort) ==> boolean applies: Returns TRUE if the specified sort can be applied to the specified column. Sorting methods and the function dt_mer_sort() are discussed below.
dt_mer_funcapplies(Handle fieldtype, int func) ==> boolean applies: Returns TRUE if the specified function can be applied to create a column of the specified field type. For example, you can't use DX_FUNC_STDDEV (standard deviation) on a fieldtype that is not numeric. Column creation is discussed below.

18.4 Column Objects

A column object is defined by three properties:

Property	Description
database	The database (Merlin pool) which is the column's parent object.
fieldtype	Defines which datatype and which field within that datatype is to be used to create the column
function	Describes how to extract the particular field from each TDT.

18.4.1 Column "Functions"

A single THOR datatree (TDT) can contain many occurrences of a particular datatype. For example, a TDT might have dozens or hundreds of names for a compound, such as brand names for a drug. Likewise, it could have many measurements of a particular physical property.

Merlin's "spreadsheet" model (rows and columns) requires that we somehow select from among the multiple occurrences of a particular type of data to create "columns" of data. To do this, we introduce the idea of a column-creation function. These functions provide various methods for choosing among the various occurrences of a particular type of data in a TDT:

DX_FUNC_FIRST: Select the first occurrence of the specified field in the TDT. This is the most commonly used function in column creation.
DX_FUNC_LAST: Select the last occurrence of the specified field in the TDT.
DX_FUNC_MIN: Select the lowest-valued occurrence of the specified field in the TDT. For numbers, this is the lowest numerical value, using a simple "<" less-than test. For ASCII data, it is the string that is the lowest lexically. Recall that in ASCII, lowercase letters [a-z] are higher than all uppercase letters [A-Z]. Thus "Baker" comes before "able".
DX_FUNC_MAX: Select the highest valued occurrence of the specified field in the TDT, as with DX_FUNC_MIN, above, except using ">".
DX_FUNC_LONGEST: Selects the longest (where length is the number of characters in the string).
DX_FUNC_SHORTEST: Selects the shortest.
DX_FUNC_AVG: Creates a column of "derived data" containing the average of all occurrences of the specified fieldtype in the row. If a particular row has no occurrences of the specified fieldtype, the cell in the column will be "not available" (usually indicated by "~"). The fieldtype must be numeric.
DX_FUNC_STDDEV: Creates a column of "derived data" containing the standard deviation of all occurrences of the specified fieldtype in the row. If a particular row has one or zero occurrences of the specified fieldtype, the cell in the column will be "not available". The fieldtype must be numeric.
DX_FUNC_COUNT: Creates a column of "derived data" containing the count of the specified fieldtype in each row. That is, goes through the rows and sums the number of occurrences of the specified fieldtype; the resulting sums become the column's contents.
DX_FUNC_ALL: Creates a column of "pseudo data" which in effect has all occurrences of the specified field type in it. The column initially appears empty; when a search is performed, the entire row is searched; if a field of the specified type is found that matches the search parameters, that field becomes the cell's value. Note that this makes these columns behave somewhat strangely, since their data changes with each search.

18.4.2 Creating Columns

dt_mer_alloc_column(Handle pool, Handle ftype, integer func) ==> Handle col: Creates a column of data from the database pool, using the datafield specified by ftype and the function func.
dt_mer_getnitems(Handle pool, Handle type) ==> integer nitems: If type is a datatype object, returns the number of dataitems in the pool that have the specified datatype. If type is a fieldtype object, returns the number of datafields in the pool that have the specified field type. (Particular implementations of the Merlin server may not be able to report these two numbers separately. In such cases, the server may report the number of dataitems when you request the number of datafields.)

18.4.3 Information about Columns

dt_mer_defaultsort(Handle column) ==> integer sort: Returns the index of the "most likely" sort type for the specified column. For numeric columns, returns DX_SORT_NUM; for CAS numbers returns DX_SORT_CAS; for all other sortable columns returns DX_SORT_ASC.
dt_mer_sortapplies(Handle column, integer sort) ==> boolean applies: Returns TRUE if the specified sort (a string object) can be applied to the column. For example, you can't sort a numeric column by length, since length only applies to strings.
dt_mer_function(Handle column) ==> integer func: Returns the function that was used to create the column, or -1 if an error is detected.

18.4.4 Polymorphic Functions on Columns

Most (but not all) columns are a "shared resource" on the server: All clients that use a particular column refer to the same actual data. In installations where particular data are frequently used, it is possible to create "permanent" columns, using dt_hold(), that remain in the server's memory, thereby improving startup performance for client programs. For example, you might want to create a permanent SMILES column, a column of you company's ID number, and a column of a particular physical property that all of you users need.

Note that columns of "derived" data, such as similarity columns and columns with the function DX_FUNC_ALL, can't be shared among clients as their contents change with each search. It is not useful to use dt_hold() on such columns.

dt_hold(Handle column, string thorpassword) ==> boolean ok: Marks the specified column "held", so that it will be retained in the Merlin server's memory even when no clients are using it.
dt_isheld(Handle column) ==> boolean isheld: Returns TRUE if column is marked "hold".
dt_release(Handle database, string execpassword) ==> boolean ok: Marks the specified column "released" (not held), so that it will be removed from the Merlin server's memory when the last client deallocates it.

As mentioned in the chapter on datatype objects, column objects respond to requests about datatype and datafield properties. The following functions work when used with column objects; they are described in more detail in the chapter on Datatype objects:

dt_datatype(Handle column) ==> Handle datatype
dt_fieldtype(Handle column) ==> Handle fieldtype
dt_dfnorm(Handle obj, int norm) ==> boolean isnorm
dt_dfnormdata(Handle obj, int norm) ==> string normdata
dt_name(Handle obj) ==> string name
dt_briefname(Handle obj) ==> string briefname
dt_summary(Handle obj) ==> string summary
dt_tag(Handle obj) ==> string tag
dt_description(Handle obj) ==> string description

18.5 Hitlist Objects

A hitlist object represents an ordered set of rows from a Merlin database. Client programs typically have one primary hitlist that is used for search and sort operations, and often have auxiliary hitlists for "save/restore" and "undo" operations.

While it is possible to create as many hitlists as you like, you should remember that each one uses memory in the server (4 bytes per row in the pool). In general you should use as few as will suffice for the task at hand.

Rows in a hitlist are identified by their index in the hitlist, typically referred to as "ihit" (index of the hit).

18.5.1 Creating Hitlists

dt_mer_alloc_hitlist(Handle database) ==> Handle hitlist: Creates a hitlist. The hitlist is initially "reset" -- it contains all rows in the pool in "native" order.

18.5.2 Retrieving Data: Cells

dt_mer_cellvalue(Handle column, Handle hitlist, integer ihit) ==> string cell: Returns the value of the "cell" -- the value from the ihit position of the hitlist in the specified column. The string returned should be used or copied immediately; the Toolkit may reuse the buffer that this function returns on the next call to the Merlin Toolkit.
dt_mer_getdata(Handle hitlist, int ihit, Handle ftype, int n) ==> string data: Allows you to retrieve data without first creating a column: returns the nth occurrence of a specific fieldtype in row ihit of hitlist. The parameter ftype is a fieldtype object, and indicates what type of data is desired. It allows you to retrieve data (e.g. SMILES, conformation) whether or not you have a column of that type.

18.6 Sorting

To sort data in Merlin, one specifies a hitlist/column pair, thus defining the cells whose data are to be sorted, along with a sort method. The sort method specifies how the cells are to be compared to one another to determine which is "lowest" and which is "highest". There is a variety of sort methods available, as follows:

DX_SORT_ASC
Sort the data using straight ASCII comparison. Note that in the ASCII character set, all uppercase letters [A-Z] are less than all lowercase letters [a-z], so "Baker" will come before "able". If one string is a prefix of another, the longer string is considered to be greater than the shorter; thus "able-bodied" would come after "able".

DX_SORT_ANC
"Sort, no case" -- sort using straight ASCII comparison, but all lowercase characters [a-z] are converted to their equivalent uppercase [A-Z] before the comparison is made, thus eliminating case distinction. For example, "able" would come before "Baker".

DX_SORT_ANW
"Sort, no whitespace" -- sort using straight ASCII comparison, but ignore "whitespace" characters (space, tab, newline, and carriage- return -- ASCII 32, 7, 10, and 13, respectively). That is, it is equivalent to first removing all whitespace from the strings, then sorting by DX_SORT_ASC.

DX_SORT_ANP
"Sort, no punctuation" -- sort using straight ASCII comparison, but ignore punctuation characters. Punctuation characters are anything that is not alphanumeric ([A-Z], [a-z], and [0-9]). That is, it is equivalent to first removing all punctuation from the strings, then sorting by DX_SORT_ASC.

DX_SORT_ANCP -- "Sort, no case, no punctuation" DX_SORT_ANCW -- "Sort, no case, no whitespace" DX_SORT_ANPW -- "Sort, no punctuation, no whitespace" DX_SORT_ANCPW -- "Sort, no case, no punctuation, no whitespace"
Each of these is a combination of sorts discussed previously.

DX_SORT_AAZ
"Sort, ASCII A-Z only" -- sort ignoring all characters except a-z and A-Z, and ignore case distinction. That is, it is equivalent to removing all non-alphabetic characters from the strings and converting all uppercase characters to their lowercase equivalent, then sorting the data by DX_SORT_ASC.

DX_SORT_NUM
"Sort numerically" -- sort a column of numbers into ascending order.

DX_SORT_NAB
"Sort numerically by absolute value" -- sort a column of numbers into ascending order by magnitude (ignore the sign of the numbers).

DX_SORT_CAS
"Sort CAS numbers" -- sort Chemical Abstracts numbers into ascending order.

DX_SORT_MFM
"Sort by molecular formula" -- sorts molecular formula. Compares each element/number combination as a single "token", so that "C20" is greater than "C2N" (an ASCII sort would put the digit "0" before the letter "N").

DX_SORT_LEN
"Sort by length" -- sorts ASCII data by length; short strings are "lower" than long strings.

One function in the Merlin Toolkit handles all sort methods:

dt_mer_sort(Handle hitlist, Handle column, integer sortmethod,integer direction, RETURN integer status) ==> progress

Begins a "sort task" (see Tasks - "Time Slicing", above). Those rows currently in hitlist are sorted using the cells from column. The data are sorted into ascending or descending order according to whether direction is DX_SORT_ASCENDING or DX_SORT_DESCENDING.

The status of the sort-task is returned in the parameter status. The function's return value is either its progress on the task (see dt_done_when()), or -2 if an error is detected. If the hitlist is short enough that the server can sort it in one time-slice, the value of status will be DX_STATUS_DONE, and no task will be in progress on the server. Otherwise, the status will be DX_STATUS_IN_PROGRESS, and dt_continue() is required to finish the task.

The Merlin server will attempt to choose the most efficient sort technique for the data in the specified column. Whatever method is chosen, the following will be true:

The sort is stable. That is, if two cells are equal, the sort won't affect their relative positions in the hitlist.
The worst-case time it takes to sort a hitlist will grow as N*log(N) time, where N is the length of the hitlist. In some cases it is much faster than this.

dt_mer_defaultsort(Handle column) ==> integer sortmethod

Returns the default sort method for the specified column. The "default" is simply the "most likely" sort a user might choose; there is no real significance to the value this function returns. Returns -1 if column is not a sortable datatype, or if it isn't a column object.

dt_mer_sortapplies(Handle column, integer sortmethod) ==> boolean applies

Returns TRUE if the column can be sorted with the specified sort type, and FALSE if not or if the specified object is not a column object.

18.7 Searching

The Merlin system's most powerful feature is its ability to search a database in a variety of ways. There are five different searching functions in the Merlin Toolkit, to perform string, numeric, similarity, sub- and superstructure searches.

In spite of the variety of searches available, all of the searching functions share most of their parameters; they all look something like the following prototype. We will describe these common parameters here is this pseudo-function definition, and for each actual function only describe those parameters that are unique.

dt_mer_xxxsearch(Handle         hitlist,
                 Handle         column,
                 integer        searchtype,
                 integer        action,
                 integer        find_next,
                 RETURN integer status,
                 ...other parameters) ==> integer progress

hitlist: Where the "hits" (the rows that meet the search criteria) will be placed. Depending on the parameter action, it may also determine which rows are searched.
column: The data that are to be searched. In some cases, such as a similarity search or a column created with DX_FUNC_ALL, the column's contents also change as a result of the search.
searchtype: Most searches have "submodes" -- for example, when searching for strings, one can choose to ignore case, whitespace, and/or punctuation.
action: Specifies what is to be done with the search results. This is discussed in detail in the following subsection, Actions.
find_next: If action is one of DX_ACTION_NEXT_HIT or DX_ACTION_NEXT_NONHIT, this parameter specifies where in the hitlist the search is to begin. The search begins at the hit after this value; to search from the hitlist's beginning, specify -1. To continue searching from a previously-found hit, specify that hit's index (the value returned by the previous invocation of the search function).
status: All searches become "tasks" on the server (see the section entitled Tasks -- "Time Slicing", above). This return parameter indicates the status of the search task. If it is DX_STATUS_IN_PROGRESS, the search is not complete, and dt_continue() is required to continue the task. If it is DX_STATUS_DONE, the search is complete and the function's return value is the hitlist's length, or for "find-next" actions, the hit's index in the hitlist. If it is DX_STATUS_NOTFOUND, the search is complete but failed to find anything; the hitlist is unchanged. If it is DX_STATUS_ERROR, an error was detected; the task is complete, and the hitlist is unchanged.
progress: The return parameter for all search functions is their progress on the task. If status is DX_STATUS_IN_PROGRESS, then the function dt_done_when() will return a number which, when divided into progress, yields the fraction of the task that is completed. If status is DX_STATUS_DONE, then progress is the hitlist's new length. If status is DX_STATUS_NOT_FOUND, progress is not defined. If status is DX_STATUS_ERROR, the progress is -2.

18.7.1 Actions

The rows that are to be searched in the pool, and how the results of a search are to be combined with the original hitlist, are defined by an action. There are seven possible actions:

DX_ACTION_NEW_LIST: The original hitlist is discarded (cleared). All rows in the database are searched; all rows that meet the search criteria are added to the hitlist.
DX_ACTION_ADD_HITS: All rows not on the original hitlist are searched; rows that meet the search criteria are added to the end of the hitlist.
DX_ACTION_ADD_NONHITS: All rows not on the original hitlist are searched; rows that don't meet the search criteria are added to the end of the hitlist.
DX_ACTION_DEL_HITS: The rows in the original hitlist are searched; rows the meet the search criteria are removed from the hitlist.
DX_ACTION_DEL_NONHITS: The rows in the original hitlist are searched; rows that do not meet the search criteria are removed from the hitlist.
DX_ACTION_NEXT_HIT: The rows in the original hitlist are searched; as soon as a row is found that meets the search criteria, its hitlist index is returned. The hitlist is unchanged. The data in derived-data columns, such as similarity and columns created using DX_FUNC_ALL, will be altered by the search even though the hitlist is unchanged. The parameter find_next to the search functions (described above) indicates where in the hitlist the search is to begin: The first row examined is find_next + 1.
DX_ACTION_NEXT_NONHIT: Like DX_ACTION_NEXT_HIT, but returns the first row that does not match the search criteria.

18.7.2 Parametric Searches

There are two types of parametric searches: string and numeric. Only one or the other applies, according to whether the column is a numeric type or not (see dt_dfnorm()).

The parameters hitlist, column, action, find_next, status, and the return value progress are described in the description of the pseudo-search function dt_mer_xxxsearch(), above.

dt_mer_strsearch(Handle hitlist,
		 Handle  column
		 integer searchtype,
		 integer action,
		 integer find_next,
		 RETURN  integer status,
		 string  s1,
		 string  s2) ==> integer progress

Searches the specified column for string-based values s1 and/or s2 according to the parameter searchtype as detailed in the dt_mer_strsearch() manual page.

dt_mer_numsearch(Handle  hitlist
		 Handle  column
		 integer action,
		 integer find_next,
		 RETURN  integer ret_status,
		 float   low_limit,
		 float   high_limit) ==> integer progress

Searches the specified column for all numbers in the range low_limit to high_limit, inclusive. There is no separate "exact match" search for numbers; for this case search with the two limits equal.

Note that that unlike the other search functions, this function has no searchtype parameter; there is only one type of numeric search.

18.7.3 Structural Searches

There are three types of structural searches in the Merlin Toolkit: similarity, substructure and superstructure. All structural searches typically make use of "outside" data -- data not in the specified column -- in that they implicitly use the fingerprint data (datatype FP) in the database. If fingerprints are not available, structural searches will work much more slowly.

dt_mer_similarselect(Handle hitlist Handle column integer similartype, integer action, integer find_next, RETURN integer ret_status, string smiles, float limit, float alpha, float beta) ==> integer progress

Searches for structures similar to the structure specified by the given SMILES string. Similarity searches are unusual in that the column you specify is a derived-data column: the similarity for each row is computed and stored in the column, then compared to limit to determine if the structure meets the search criteria. Substructure searches also make implicit use of the "Fingerprint" (FP) datatype to compute the similarity values; if a particular row doesn't have a fingerprint, its similarity will be "not available".

The parameters hitlist, action, find_next, status, and the return value progress are described above in the description of the pseudo-search-function dt_mer_xxxsearch(), above. The parameter column is as described in dt_mer_xxxsearch(), but additionally it must be a column of the pseudo-datatype SIMILARITY. The parameter similartype can be either DX_SIMILAR_TANIMOTO or DX_SIMILAR_EUCLIDIAN

dt_mer_subselect(Handle hitlist, Handle column, integer searchtype, integer action, integer find_next, RETURN integer status, string smiles) ==> integer progress

Searches for substructures of smiles. The parameter searchtype is essentially "reserved" for future use -- DX_SUBSTRUCT_SMILES is presently the only allowed value.

The parameters hitlist, column, action, find_next, status, and the return value progress are described above in the description of the pseudo-search-function dt_mer_xxxsearch(), above.

dt_mer_superselect(Handle hitlist, Handle column, integer searchtype, integer action, integer find_next, RETURN integer ret_status, string smiles) ==> integer progress

Searches for superstructures of smiles. The interpretation of smiles depends on the searchtype parameter, as follows:

search_type == DX_SUPER_SMILES: The parameter smiles is interpreted as a SMILES string. Using SMILES, one can specify "ordinary" substructures -- substructures that have exactly-specified atoms and bonds (i.e. no SMARTS expressions).
search_type == DX_SUPER_SMARTS: The parameter smiles is interpreted as a SMARTS string. Using SMARTS, one can specify substructures that have expressions for atoms and bonds.
search_type == DX_SUPER_SMILESPART: The parameter smiles is interpreted as a SMILES string. Using SMILES, one can specify "ordinary" substructures -- substructures that have exactly-specified atoms and bonds (i.e. no SMARTS expressions). This search type uses the special FPP<> dataitem, if available, for screening. This search is used to rapidly find substructures within dot-separated components of SMILES, typically applicable for databases of mixtures.
search_type == DX_SUPER_SMARTSPART: The parameter smiles is interpreted as a SMARTS string. Using SMARTS, one can specify substructures that have expressions for atoms and bonds. This search type uses the special FPP<> dataitem, if available, for screening. This search is used to rapidly find substructures within dot-separated components of SMILES, typically applicable for databases of mixtures.

During a SMILES and SMARTS searches, implicit use is made of the "Fingerprint" datatype for screening purposes. If fingerprints are not available, the search may be considerably slower.

18.7.4 Program-Object searches

The Merlin server's searching capabilities can be extended via the use of user-written program objects.

The general topic is discussed in the chapter on program objects. "Attaching" a program object to a merlin server is discussed in the merlinserver manual page. For a specific example of program objects, see the "contrib" directory:

     $DY_ROOT/contrib/src/progob/merlinbintalk.c

A Merlin server can have several program objects attached to it. They are referenced by index, which by convention in this manual we call iprogob. The following two functions tell you how many program objects there are and their names:

dt_mer_nprogobs(Handle server) ==> Integer N: Reports the number of program objects attached to the Merlin server.
dt_mer_progob2name(Handle server, Integer iprogob) ==> String name: Gets the name of the program object attached to the Merlin server. The parameter iprogob indicates which program object, and ranges from 0 to N-1, where N is the number of program objects reported by dt_mer_nprogobs(), above.

Currently, Merlin program objects work strictly with "binary" data, such as fingerprints, and produce floating-point results, in a "Similarity" column. There are two tasks each Merlin program object can perform:

Given a string (e.g. a SMILES) and some parameters, generate binary data from that string (e.g. a fingerprint).
Given binary data (e.g. a fingerprint), compare it to every row in a column of binary data and return a result (e.g. similarity).

More specifically, the following two functions perform these tasks:

dt_mer_progob_compute(Handle server, Integer iprogob, Handle string_object, Handle parameters) ==> Handle string: Sends the contents of "string_object", followed by the contents of "parameters" (a sequence of string objects), to the program object attached to "server" indicated by "iprogob", and returns a string object containing binary data computed by the program object. The binary data are ASCII encoded; see dt_binary2ascii() for details.

dt_mer_progob_compare(Handle hitlist, Handle target_column, Handle result_column, Integer iprogob, Handle pattern_binary, Handle parameters, RETURN integer status) ==> progress: Uses a program object (see dt_mer_progob_compute(3)) to compare a binary datafield to the contents of a column of binary data, and stores the results of the comparisons in a column of numeric data. The binary data are ASCII encoded; see dt_binary2ascii() for details.

Program objects may require additional parameters to direct their computations and comparisons. For example, a fingerprinting program's computation function might take parameters controlling the size of the fingerprint and the maximum pathlength to follow in generating the fingerprint; a program object that compares mass spectra might take parameters controlling the relative importance of increasing mass. The above two functions both have parameter named, amazingly enough, "parameters". These are sequence-of-strings objects that can take any arbitrary parameters that you need to pass to the program objects. The parameters must be represented in string form (e.g. numeric parameters must be represented in printed ASCII characters). The interpretation of these parameters is strictly up to the program object; the Merlin server simply forwards them to the program object without interpretation.

The program objects can also supply titles for these parameters. Two functions are provided for this purpose:

dt_mer_progob_computeparams(dt_Handle server, dt_Integer iprogob) ==> Handle seq_of_strings: Asks a program object, via the Merlin server, to report the names and default values for the parameters used by the function dt_mer_progob_compute(..., parameters).
dt_mer_progob_compareparams(dt_Handle server, dt_Integer iprogob) ==> Handle seq_of_strings: Asks a program object, via the Merlin server, to report the names and default values for the parameters used by the function dt_mer_progob_compare(..., parameters).

18.8 Other Hitlist Operations

dt_mer_clear(Handle hitlist) ==> boolean ok

Clears a hitlist. Returns the number of hits in the hitlist (i.e. zero), or -2 if an error is detected.

dt_mer_combinehitlists(Handle h1, Handle h2, int action) => integer nhits

Combines two hitlists in the same manner as the search operations. That is, h1 is treated as the original hitlist, h2 is treated as the result of a search; the two hitlists are combined using action, and the result placed in h1.

dt_mer_hit2id(Handle hitlist, int ihit) ==> integer id

There are two ways of identifying a row in a pool:

hit index:: (Called "ihit") An index into a hitlist. This index is used for all operations on Merlin hitlists.
row id: (Called "id") An arbitrary but unique integer that identifies a particular row. You can ask for the id of a row in a hitlist, then use its id to find its position in another hitlist or in a modified version of the original hitlist. An id has no other use. An id is guaranteed to be invariant and unique over the life of a pool object.

Converts a row's hitlist index to its id. Returns the id, or -2 if an error is detected. A typical use of the id is to find a row's id, perform a search or sort, then convert the id back to ihit, the row's index in the modified hitlist. See dt_mer_id2hit(), below.

dt_mer_id2hit(Handle hitlist, int id) ==> integer ihit

Converts a row's "id" to its hitlist index, "ihit" (see dt_mer_hit2id(), above). Returns the row's index ("ihit"), or -2 if an error is detected. It is an error if the specified row is not in the hitlist.

dt_mer_invert(Handle hitlist) ==> boolean ok

Inverts the hitlist: All hit rows become non-hits and all non-hit rows become hits. The current order is lost; the new hits are in native order. Returns the number of hits in the resulting hitlist, or -2 if an error is detected.

dt_mer_length(Handle hitlist) ==> integer length

Returns the number of hits in a hitlist.

dt_mer_mvbottom(Handle hitlist, int ihit) ==> integer nhits

Moves the specified hit to the end of the hitlist. Returns the (unaltered) hitlist length, or -2 if an error is detected.

dt_mer_mvtop(Handle hitlist, int ihit) ==> integer nhits

Moves the specified hit to the top of the hitlist. Returns the (unaltered) hitlist length, or -2 if an error is detected.

dt_mer_native(Handle hitlist) ==> integer nhits

Reorders the hitlist (without changing its contents) to "native" order. Returns the number of hits (which is unchanged), or -2 if an error is detected.

"Native" order is essentially arbitrary. It sometimes corresponds to the order in which data are loaded into a database, but it should not be assumed that this is the case. The only thing guaranteed about "native" order is that it won't change during the life of the parent database object (dt_parent(hitlist)).

dt_mer_reset(Handle hitlist) ==> integer nhits

Resets and reorders a hitlist so that all rows in the pool are in it in "native" order. Returns the number of hits in the hitlist, or -2 if an error is detected.

dt_mer_reverse(Handle hitlist) ==> integer nhits

Reverses the order of the hitlist, without changing its contents. Returns the (unaltered) hitlist length, or -2 if an error is detected.

dt_mer_zapabove(Handle hitlist, integer ihit) ==> integer nhits

Deletes all hits above (but not including) the specified hitlist index ihit. If ihit is greater than or equal to the number of hits, all hits are deleted. If ihit is zero or less, no hits are deleted. Returns the hitlist's new length, or -2 if an error is detected.

dt_mer_zapbelow(Handle hitlist, integer ihit) ==> integer nhits

Deletes all hits below (but not including) the specified hitlist index ihit. If ihit is greater than the number of hits, no hits are deleted. If ihit is less than zero, all hits are deleted. Returns the hitlist's new length, or -2 if an error is detected.

dt_mer_zapna(Handle hitlist, Handle column) => integer nhits

Deletes rows from hitlist for which there is no data in column. Returns the hitlist's new length, or -2 if an error is detected.

dt_mer_zapnonunique(Handle hitlist, Handle column) ==> integer nhits

Deletes all rows from hitlist which, in column, have the same value as the previous row in the hitlist. Returns the hitlist's new length, or -2 if an error is detected.

18.9 Saving and Restoring Hitlists

It is often necessary to save a hitlist so that it may be restored for later use, or shared with other users. For example, one might be interested in a particular subset of a database; one could use Merlin's searching capabilities to make a hitlist consisting of that subset, then save it.

The Daylight system has no concept of "indices" or other arbitrary identifiers that might be used to save a hitlist. One must use identifiers such as SMILES as the contents of a saved hitlist. Users and programmers should be aware of the implications of this: If a particular row has no identifier loaded in the Merlin Pool, it can't be saved in a hitlist -- there is no way to name it.

Hitlists are stored as TDT files. Each TDT is a "minature" version that contains only the identifier's tag and the identifier. For example, the following hitlist might be from a database that has a number of SMILES-rooted TDT, and a number of entries for which we have no structure, only a company ID number:

     $SMI<Oc1ccccc1>|
     $SMI<OCc1ccccc1>|
     $CID<234-54A>|
     $SMI<Oc1c(O)cccc1>|
     $CIC<235-55B>|

Each row in a Merlin pool has a "root identifier" -- the root of the TDT for that row. If a pool's rows are "split out" into subtree rows (i.e. the subtree identifiers have the _P in their datatype definition), their root idntifier will be the subtree's root identifier. These root identifiers are used to store hitlists:

dt_mer_getroot(Handle hitlist, integer index) => string id

Returns the root identifier for the specified row, as a TDT string containing only the root identifier's tag and the root identifier (e.g. "$TAG<ID>|").

dt_mer_sethits(Handle hitlist, Handle column, Handle sos) ==> integer nset

Sets hits in a hitlist, using a sequence of identifiers and a column whose datatype is that of the identifiers being restored.

The parameter sos is a sequence of string objects, each string-object of which should contain a "$TAG<ID>|" string as described in dt_mer_getroot(), above, or a SMILES string (if the first character of the string is not a "$", then it is assumed to be a SMILES; otherwise it is assumed to be a TDT). Each identifier is added to the hitlist. Note that the hitlist is NOT cleared before the additions begin; long lists of identifiers can be added by breaking them into smaller groups. For efficiency, such groups should not be too small, say several hundred to several thousand identifiers at a time.

Returns the number of hits in the sequence that were actually set. This can be different from the sequence's length if one or more identifiers can't be found in the column, or if one or more identifiers is duplicated, or was already in the hitlist. Returns -1 on error.

Note: As of release 4.33, only columns with datatype SMILES ($SMI) and the pseudo-datatype "RowID" ($ROWID) will work with this function. $ROWID columns are by far the most useful, since you can feed them a sequence-of-strings object with mixed datatypes, i.e. one you got from dt_mer_getroot(), above.

Back to Table of Contents
Go to previous chapter THOR Datatrees
Go to next chapter Widgets.