18. Merlin Toolkit
Back to Table of Contents
18.1 Introduction
Previous chapters discussed those aspects of the Merlin Toolkit that
are common with the THOR Toolkit:
servers and security,
databases and
datatypes.
In this chapter, we will cover the Merlin-specific
capabilities of the Merlin Toolkit.
Merlin uses a "spreadsheet" model to represent a
database. This is
discussed in greater detail in the
Daylight THOR-Merlin Administration Guide. Merlin
has two objects, the
hitlist and the
column that represent this view
of the database. Two other concepts, the row and the cell, are also
important, but there are no row or cell objects in Merlin.
- A
column object
represents a "column" of data from a database,
i.e. one specific field from each TDT in the database. A column is
the "y axis" of the "spreadsheet" view of the database.
- A "row" is the "x axis" of the spreadsheet view of the database;
it is data from a single TDT. There is no row object; it is just an
idea we use to convey the workings of Merlin.
-
A hitlist object
represents an ordered set of rows from a database. That is, the
object holds a set of "hits" (rows) and a particular
ordering of those rows. Various search operations affect which rows
belong in the hitlist, and various sort operations affect the order of
the rows in the hitlist.
- A "cell" is the data at the intersection of a row and a column.
There is no cell object.
18.2 Tasks -- "Time Slicing"
Although Merlin is quite fast at searching and sorting, certain tasks
can take a significant amount of time. Since the server has to serve
many clients, tasks that take a long time have to be "sliced" into
smaller time segments so that requests from various clients can be
interleaved. This prevents any one client from "hogging" the server
for a long time. In addition, a client can abort a time-sliced task
part way through.
All sorting and searching Toolkit functions are time-sliced. These
functions, and the function dt_continue() (described below), have a
"status" parameter indicating how the task is progressing:
DX_STATUS_IN_PROGRESS | not finished, task in progress
|
DX_STATUS_DONE | finished search, target found
|
DX_STATUS_NOT_FOUND | finished search, target not found
|
DX_STATUS_ERROR | error, operation not completed
|
The following three functions are used in conjunction with the
searching and sorting functions (which are described in detail below)
to carry out time-sliced functions:
-
dt_continue(Handle server, RETURN integer status) ==> integer
progress
-
Continues the current task in progress. Returns the progress on the
task; dividing this value by the value returned dt_done_when() will
yield the fraction of the task that is completed.
You can only call this function when a task is in progress, e.g after a
search or sort function has returned a status of
DX_STATUS_IN_PROGRESS .
-
dt_abort(Handle server) ==> integer ok
-
Aborts the current task. A server can only have one task in progress
for any particular client, so starting a second task (another search
or sort) also has the effect of aborting the current task.
-
dt_done_when(Handle serverh) ==> integer done_when
-
Indicates the "final progress"
for dt_continue(); that is, the value
of "progress" that will mean the task is complete (where
"progress" is the value
dt_continue() returns).
A general algorithm for starting and completing a sort or search task
is:
Start the task; check the task's return status
If return status is "in progress" then
done_when = dt_done_when(server)
while (status is still "in progress")
progress = dt_continue(server, status)
report progress to the user
endwhile
endif
The following C code fragment illustrates this in more concrete
terms:
/*** Do the search ***/
progress = dt_mer_similarselect(hitlist, col, searchtype, action, -1,
&ret_status, strlen(smiles), smiles, limit,0.0,0.0);
if (ret_status == DX_STATUS_IN_PROGRESS) {
done_when = dt_done_when(server);
while (ret_status == DX_STATUS_IN_PROGRESS) {
printf("Similarity: (%d%%)\n", (100 * progress)/done_when);
progress = dt_continue(server, &ret_status);
}
}
/*** Let user know how it came out ***/
if (ret_status == DX_STATUS_NOT_FOUND)
printf("Target not found - hitlist unchanged\n");
else if (ret_status != DX_STATUS_DONE) {
printf("Error with similarity search:\n");
printerrors(stdout, 0);
}
else
printf("Done: %d hits in list\n", dt_mer_length(hitlist));
18.3 Querying for Capabilities
Merlin's set of capabilities includes several ways to sort data,
search data, and otherwise examine and modify hitlists. All of
Merlin's capabilities are enumerated in the "include" files that come
with the Toolkit; however, it is some times desirable to design a
user interface without "hard coding" this information. That is, it
might be desirable to ask the Toolkit at "run time" for its
capabilities, and build the user interface (menus, etc.) using the
reported capabilities.
The Merlin Toolkit provides functions that allow you to ask the
Merlin system for its capabilities. For example, the sorting
function take a "sort type" parameter that indicates how the data are
to be sorted (e.g. ASCII, numeric, etc.); using the "capabilities
functions" described below, you can ask the server how many types of
sorts are available, ask for the "name" of each one, and ask which of
these are appropriate for the particular column being sorted. Using
the information returned, the program can present this information as
a menu from which the user can select the sort-type desired.
The following "capability querying" functions are available:
-
dt_mer_action2name(Handle server, integer action)
dt_mer_function2name(Handle server, integer func)
dt_mer_search2name(Handle server, integer search)
dt_mer_similar2name(Handle server, integer similar)
dt_mer_sort2name(Handle server, integer sort)
dt_mer_subselect2name(Handle server, integer subselect)
dt_mer_superselect2name(Handle server, integer superselect)
-
Each of the above functions returns a string containing a an English-
language name for the specified capability. If the capability is
unknown (the parameter is out of range) or server is not a server
object, returns the invalid string.
-
dt_mer_nactions(Handle serverh)
dt_mer_nfunctions(Handle serverh)
dt_mer_nsearches(Handle serverh)
dt_mer_nsimilars(Handle serverh)
dt_mer_nsorts(Handle serverh)
dt_mer_nsubselects(Handle serverh)
dt_mer_nsuperselects(Handle serverh)
-
Each of the above functions returns an integer equal to the number of
valid capabilities. If the capability is unknown, or server is not a
server object, returns -1.
The following C code fragment illustrates how one might use these
functions to print a list of all legal sorts:
nsorts = dt_mer_nsorts(server);
for (sort = 0; sort > nsorts; sort = sort + 1) {
sort_name = dt_mer_sort2name(&alen,server,sort);
fprintf(stdout, "%d. %.*s\n", sort, alen, sort_name);
}
Two other capability-querying functions are used to help users select
capabilities that are appropriate for particular data:
-
dt_mer_sortapplies(Handle column, integer sort) ==> boolean applies
-
Returns TRUE if the specified sort can be applied to the specified
column.
Sorting methods and the function
dt_mer_sort() are
discussed below.
-
dt_mer_funcapplies(Handle fieldtype, int func) ==> boolean applies
-
Returns TRUE if the specified function can be applied to create a
column of the specified field type.
For example, you can't use DX_FUNC_STDDEV (standard deviation) on a
fieldtype that is not numeric.
Column creation is discussed below.
18.4 Column Objects
A column object
is defined by three properties:
Property | Description
|
database | The database (Merlin pool) which
is the column's parent object.
|
fieldtype | Defines which datatype and which
field within that datatype is to be
used to create the column
|
function | Describes how to extract the particular
field from each TDT.
|
18.4.1 Column "Functions"
A single
THOR datatree
(TDT) can contain many occurrences of a
particular datatype. For example, a TDT might have dozens or
hundreds of names for a compound, such as brand names for a drug.
Likewise, it could have many measurements of a particular physical
property.
Merlin's "spreadsheet" model (rows and columns) requires that we
somehow select from among the multiple occurrences of a particular
type of data to create "columns" of data. To do this, we introduce
the idea of a column-creation function. These functions provide
various methods for choosing among the various occurrences of a
particular type of data in a TDT:
DX_FUNC_FIRST
-
Select the first occurrence of the specified field in the TDT. This
is the most commonly used function in column creation.
DX_FUNC_LAST
-
Select the last occurrence of the specified field in the TDT.
DX_FUNC_MIN
-
Select the lowest-valued occurrence of the specified field in the
TDT. For numbers, this is the lowest numerical value, using a simple
"<" less-than test. For ASCII data, it is the string that is the
lowest lexically. Recall that in ASCII, lowercase letters [a-z] are
higher than all uppercase letters [A-Z]. Thus "Baker" comes before
"able".
DX_FUNC_MAX
-
Select the highest valued occurrence of the specified field in the
TDT, as with DX_FUNC_MIN, above, except using ">".
DX_FUNC_LONGEST
-
Selects the longest (where length is the number of characters in the
string).
DX_FUNC_SHORTEST
-
Selects the shortest.
DX_FUNC_AVG
-
Creates a column of "derived data" containing the average of
all occurrences of the specified fieldtype in the row. If a
particular row has no occurrences of the specified fieldtype, the cell
in the column will be "not available" (usually indicated by
"~"). The fieldtype must be numeric.
DX_FUNC_STDDEV
-
Creates a column of "derived data" containing the standard deviation
of all occurrences of the specified fieldtype in the row. If a
particular row has one or zero occurrences of the specified
fieldtype, the cell in the column will be "not available". The
fieldtype must be numeric.
DX_FUNC_COUNT
-
Creates a column of "derived data" containing the count of the
specified fieldtype in each row. That is, goes through the rows and
sums the number of occurrences of the specified fieldtype; the
resulting sums become the column's contents.
DX_FUNC_ALL
-
Creates a column of "pseudo data" which in effect has all
occurrences of the specified field type in it. The column initially
appears empty; when a search is performed, the entire row is searched;
if a field of the specified type is found that matches the search
parameters, that field becomes the cell's value. Note that this makes
these columns behave somewhat strangely, since their data changes with
each search.
18.4.2 Creating Columns
-
dt_mer_alloc_column(Handle pool, Handle ftype, integer func) ==> Handle col
-
Creates a
column
of data from the database pool, using the datafield
specified by ftype and the function func.
-
dt_mer_getnitems(Handle pool, Handle type) ==> integer nitems
-
If type is a datatype object, returns the number of dataitems in the
pool that have the specified datatype. If type is a fieldtype object,
returns the number of datafields in the pool that have the specified
field type. (Particular implementations of the Merlin server may not
be able to report these two numbers separately. In such cases, the
server may report the number of dataitems when you request the number
of datafields.)
18.4.3 Information about Columns
-
dt_mer_defaultsort(Handle column) ==> integer sort
-
Returns the index of the "most likely" sort type for the specified
column. For numeric columns, returns DX_SORT_NUM; for CAS numbers
returns DX_SORT_CAS; for all other sortable columns returns
DX_SORT_ASC.
-
dt_mer_sortapplies(Handle column, integer sort) ==> boolean applies
-
Returns TRUE if the specified sort (a string object) can be applied
to the column. For example, you can't sort a numeric column by
length, since length only applies to strings.
-
dt_mer_function(Handle column) ==> integer func
-
Returns the function that was used to create the column, or -1 if an
error is detected.
18.4.4 Polymorphic Functions on Columns
Most (but not all) columns are a "shared resource" on the
server: All clients that use a particular column refer to the same
actual data. In installations where particular data are frequently
used, it is possible to create "permanent" columns, using dt_hold(), that remain in the server's
memory, thereby improving startup performance for client programs.
For example, you might want to create a permanent SMILES column, a
column of you company's ID number, and a column of a particular
physical property that all of you users need.
Note that columns of "derived" data, such as similarity columns and
columns with the function DX_FUNC_ALL, can't be shared among clients
as their contents change with each search. It is not useful to use
dt_hold() on such columns.
-
dt_hold(Handle column, string thorpassword) ==> boolean ok
-
Marks the specified column "held", so that it will be
retained in the Merlin server's memory even when no clients are using
it.
-
dt_isheld(Handle column) ==> boolean isheld
-
Returns TRUE if column is marked "hold".
-
dt_release(Handle database, string execpassword) ==> boolean ok
-
Marks the specified column "released" (not held), so that it will be
removed from the Merlin server's memory when the last client
deallocates it.
As mentioned in the chapter on datatype objects, column objects
respond to requests about datatype and datafield properties. The
following functions work when used with column objects; they are
described in more detail in the
chapter on Datatype objects:
dt_datatype(Handle column) ==> Handle datatype
dt_fieldtype(Handle column) ==> Handle fieldtype
dt_dfnorm(Handle obj, int norm) ==> boolean isnorm
dt_dfnormdata(Handle obj, int norm) ==> string normdata
dt_name(Handle obj) ==> string name
dt_briefname(Handle obj) ==> string briefname
dt_summary(Handle obj) ==> string summary
dt_tag(Handle obj) ==> string tag
dt_description(Handle obj) ==> string description
18.5 Hitlist Objects
A hitlist object represents an
ordered set of rows from a Merlin database. Client programs typically
have one primary hitlist that is used for search and sort operations,
and often have auxiliary hitlists for "save/restore" and
"undo" operations.
While it is possible to create as many hitlists as you like, you
should remember that each one uses memory in the server (4 bytes per
row in the pool). In general you should use as few as will suffice
for the task at hand.
Rows in a hitlist are identified by their index in the hitlist,
typically referred to as "ihit" (index of the hit).
18.5.1 Creating Hitlists
-
dt_mer_alloc_hitlist(Handle database) ==> Handle hitlist
-
Creates a hitlist. The hitlist is initially "reset" -- it
contains all rows in the pool in "native" order.
18.5.2 Retrieving Data: Cells
-
dt_mer_cellvalue(Handle column, Handle hitlist, integer ihit) ==>
string cell
-
Returns the value of the "cell" -- the value from the
ihit position of the hitlist in the specified column.
The string returned should be used or copied immediately; the Toolkit
may reuse the buffer that this function returns on the next call to
the Merlin Toolkit.
-
dt_mer_getdata(Handle hitlist, int ihit, Handle ftype, int n) ==> string data
-
Allows you to retrieve data without first creating a column: returns
the nth occurrence of a specific fieldtype in row ihit of hitlist.
The parameter
ftype is a fieldtype object, and indicates
what type of data is desired. It allows you to retrieve data (e.g.
SMILES, conformation) whether or not you have a column of that type.
18.6 Sorting
To sort data in Merlin, one specifies a hitlist/column pair, thus
defining the cells whose data are to be sorted, along with a sort
method. The sort method specifies how the cells are to be compared to
one another to determine which is "lowest" and which is
"highest". There is a variety of sort methods available, as
follows:
DX_SORT_ASC
-
Sort the data using straight ASCII comparison. Note that in the ASCII
character set, all uppercase letters [A-Z] are less than all lowercase
letters [a-z], so "Baker" will come before "able".
If one string is a prefix of another, the longer string is considered
to be greater than the shorter; thus "able-bodied" would
come after "able".
DX_SORT_ANC
-
"Sort, no case" -- sort using straight ASCII comparison, but
all lowercase characters [a-z] are converted to their equivalent
uppercase [A-Z] before the comparison is made, thus eliminating case
distinction. For example, "able" would come before
"Baker".
DX_SORT_ANW
-
"Sort, no whitespace" -- sort using straight ASCII comparison, but
ignore "whitespace" characters (space, tab, newline, and carriage-
return -- ASCII 32, 7, 10, and 13, respectively). That is, it is
equivalent to first removing all whitespace from the strings, then
sorting by DX_SORT_ASC.
DX_SORT_ANP
-
"Sort, no punctuation" -- sort using straight ASCII comparison, but
ignore punctuation characters. Punctuation characters are anything
that is not alphanumeric ([A-Z], [a-z], and [0-9]). That is, it is
equivalent to first removing all punctuation from the strings, then
sorting by DX_SORT_ASC.
DX_SORT_ANCP -- "Sort, no case, no punctuation"
DX_SORT_ANCW -- "Sort, no case, no whitespace"
DX_SORT_ANPW -- "Sort, no punctuation, no whitespace"
DX_SORT_ANCPW -- "Sort, no case, no punctuation, no whitespace"
-
Each of these is a combination of sorts discussed previously.
DX_SORT_AAZ
-
"Sort, ASCII A-Z only" -- sort ignoring all characters
except a-z and A-Z, and ignore case distinction. That is, it is
equivalent to removing all non-alphabetic characters from the strings
and converting all uppercase characters to their lowercase equivalent,
then sorting the data by DX_SORT_ASC.
DX_SORT_NUM
-
"Sort numerically" -- sort a column of numbers into ascending order.
DX_SORT_NAB
-
"Sort numerically by absolute value" -- sort a column of
numbers into ascending order by magnitude (ignore the sign of the
numbers).
DX_SORT_CAS
-
"Sort CAS numbers" -- sort Chemical Abstracts numbers into ascending
order.
DX_SORT_MFM
-
"Sort by molecular formula" -- sorts molecular formula.
Compares each element/number combination as a single
"token", so that "C20" is greater than
"C2N" (an ASCII sort would put the digit "0"
before the letter "N").
DX_SORT_LEN
-
"Sort by length" -- sorts ASCII data by length; short strings are
"lower" than long strings.
One function in the Merlin Toolkit handles all sort methods:
-
dt_mer_sort(Handle hitlist, Handle column, integer sortmethod,integer direction, RETURN integer status) ==> progress
-
Begins a "sort task" (see Tasks -
"Time Slicing", above). Those rows currently in hitlist
are sorted using the cells from column. The data are sorted into
ascending or descending order according to whether direction is
DX_SORT_ASCENDING or DX_SORT_DESCENDING .
The status of the sort-task is returned in the parameter
status . The function's return value is either its
progress on the task (see
dt_done_when()), or -2 if an
error is detected. If the hitlist is short enough that the server can
sort it in one time-slice, the value of status will be
DX_STATUS_DONE , and no task will be in progress on the
server. Otherwise, the status will be
DX_STATUS_IN_PROGRESS , and dt_continue() is required to
finish the task.
The Merlin server will attempt to choose the most efficient sort
technique for the data in the specified column. Whatever method is
chosen, the following will be true:
- The sort is stable. That is, if two cells are equal, the sort
won't affect their relative positions in the hitlist.
- The worst-case time it takes to sort a hitlist will grow as
N*log(N) time, where N is the length of the hitlist. In some cases
it is much faster than this.
-
dt_mer_defaultsort(Handle column) ==> integer sortmethod
-
Returns the default sort method for the specified column. The
"default" is simply the "most likely" sort a user might choose; there
is no real significance to the value this function returns. Returns
-1 if column is not a sortable datatype, or if it isn't a column
object.
-
dt_mer_sortapplies(Handle column, integer sortmethod) ==> boolean
applies
-
Returns TRUE if the column can be sorted with the specified sort
type, and FALSE if not or if the specified object is not a column
object.
18.7 Searching
The Merlin system's most powerful feature is its ability to search a
database in a variety of ways. There are five different searching
functions in the Merlin Toolkit, to perform string, numeric,
similarity, sub- and superstructure searches.
In spite of the variety of searches available, all of the searching
functions share most of their parameters; they all look something
like the following prototype. We will describe these common
parameters here is this pseudo-function definition, and for each
actual function only describe those parameters that are unique.
dt_mer_xxxsearch(Handle hitlist,
Handle column,
integer searchtype,
integer action,
integer find_next,
RETURN integer status,
...other parameters) ==> integer progress
-
hitlist
-
Where the "hits" (the rows that meet the search criteria)
will be placed. Depending on the parameter action, it may also
determine which rows are searched.
column
-
The data that are to be searched. In some cases, such as a
similarity search or a column created with DX_FUNC_ALL, the column's
contents also change as a result of the search.
searchtype
-
Most searches have "submodes" -- for example, when
searching for strings, one can choose to ignore case, whitespace,
and/or punctuation.
action
-
Specifies what is to be done with the search results. This
is discussed in detail in the following subsection, Actions.
find_next
-
If action is one of
DX_ACTION_NEXT_HIT or
DX_ACTION_NEXT_NONHIT , this parameter specifies where in
the hitlist the search is to begin. The search begins at the hit
after this value; to search from the hitlist's beginning, specify -1.
To continue searching from a previously-found hit, specify that hit's
index (the value returned by the previous invocation of the search
function).
status
-
All searches become "tasks" on the server (see the section
entitled Tasks -- "Time Slicing", above). This return parameter
indicates the status of the search task. If it is
DX_STATUS_IN_PROGRESS , the search is not complete, and dt_continue()
is required to continue the task. If it is DX_STATUS_DONE , the
search is complete and the function's return value is the hitlist's
length, or for "find-next" actions, the hit's index in the hitlist.
If it is DX_STATUS_NOTFOUND , the search is complete but failed to
find anything; the hitlist is unchanged. If it is DX_STATUS_ERROR ,
an error was detected; the task is complete, and the hitlist is
unchanged.
progress
-
The return parameter for all search functions is their
progress on the task. If status is
DX_STATUS_IN_PROGRESS , then the
function dt_done_when() will return a number which, when divided into
progress, yields the fraction of the task that is completed. If
status is DX_STATUS_DONE , then progress is the hitlist's new length.
If status is DX_STATUS_NOT_FOUND , progress is not defined. If status
is DX_STATUS_ERROR , the progress is -2.
18.7.1 Actions
The rows that are to be searched in the pool, and how the results of
a search are to be combined with the original hitlist, are defined by
an action. There are seven possible actions:
DX_ACTION_NEW_LIST
-
The original hitlist is discarded (cleared). All rows in the
database are searched; all rows that meet the search criteria are
added to the hitlist.
DX_ACTION_ADD_HITS
-
All rows not on the original hitlist are searched; rows that meet the
search criteria are added to the end of the hitlist.
DX_ACTION_ADD_NONHITS
-
All rows not on the original hitlist are searched; rows that don't
meet the search criteria are added to the end of the hitlist.
DX_ACTION_DEL_HITS
-
The rows in the original hitlist are searched; rows the meet the
search criteria are removed from the hitlist.
DX_ACTION_DEL_NONHITS
-
The rows in the original hitlist are searched; rows that do not meet
the search criteria are removed from the hitlist.
DX_ACTION_NEXT_HIT
-
The rows in the original hitlist are searched; as soon as a row is
found that meets the search criteria, its hitlist index is returned.
The hitlist is unchanged. The data in derived-data columns, such as
similarity and columns created using
DX_FUNC_ALL , will be altered by
the search even though the hitlist is unchanged. The parameter
find_next to the search functions (described above) indicates where
in the hitlist the search is to begin: The first row examined is
find_next + 1.
DX_ACTION_NEXT_NONHIT
-
Like
DX_ACTION_NEXT_HIT , but returns the first row that does not
match the search criteria.
18.7.2 Parametric Searches
There are two types of parametric searches: string and numeric. Only
one or the other applies, according to whether the column is a
numeric type or not (see dt_dfnorm()).
The parameters hitlist , column , action ,
find_next , status , and the return value
progress are described in the description of the pseudo-search
function dt_mer_xxxsearch() , above.
dt_mer_strsearch(Handle hitlist,
Handle column
integer searchtype,
integer action,
integer find_next,
RETURN integer status,
string s1,
string s2) ==> integer progress
-
Searches the specified column for string-based values
s1 and/or
s2 according to the parameter searchtype as detailed in
the dt_mer_strsearch() manual page.
dt_mer_numsearch(Handle hitlist
Handle column
integer action,
integer find_next,
RETURN integer ret_status,
float low_limit,
float high_limit) ==> integer progress
-
Searches the specified column for all numbers in the range
low_limit to high_limit , inclusive. There
is no separate "exact match" search for numbers; for this
case search with the two limits equal.
Note that that unlike the other search functions, this function has no
searchtype parameter; there is only one type of numeric
search.
18.7.3 Structural Searches
There are three types of structural searches in the Merlin Toolkit:
similarity, substructure and superstructure. All structural searches
typically make use of "outside" data -- data not in the
specified column -- in that they implicitly use the fingerprint
data (datatype FP) in the database. If fingerprints are not available,
structural searches will work much more slowly.
-
dt_mer_similarselect(Handle hitlist
Handle column
integer similartype,
integer action,
integer find_next,
RETURN integer ret_status,
string smiles,
float limit,
float alpha,
float beta) ==> integer progress
-
Searches for structures similar to the structure specified by the
given SMILES string. Similarity searches are unusual in that the
column you specify is a derived-data column: the similarity for each
row is computed and stored in the column, then compared to
limit to determine if the structure meets the search
criteria. Substructure searches also make implicit use of the
"Fingerprint" (FP) datatype to compute the similarity
values; if a particular row doesn't have a fingerprint, its similarity
will be "not available".
The parameters hitlist , action ,
find_next , status , and the return value
progress are described above in the description of the
pseudo-search-function dt_mer_xxxsearch() , above. The
parameter column is as described in
dt_mer_xxxsearch() , but additionally it must be a column
of the pseudo-datatype SIMILARITY . The parameter
similartype can be either
DX_SIMILAR_TANIMOTO or DX_SIMILAR_EUCLIDIAN
-
dt_mer_subselect(Handle hitlist,
Handle column,
integer searchtype,
integer action,
integer find_next,
RETURN integer status,
string smiles) ==> integer progress
-
Searches for substructures of
smiles .
The parameter searchtype is essentially "reserved" for future use --
DX_SUBSTRUCT_SMILES is presently the only allowed value.
The parameters hitlist , column , action ,
find_next , status , and the
return value progress
are described above in the description of the
pseudo-search-function dt_mer_xxxsearch() , above.
-
dt_mer_superselect(Handle hitlist,
Handle column,
integer searchtype,
integer action,
integer find_next,
RETURN integer ret_status,
string smiles) ==> integer progress
-
Searches for superstructures of smiles. The interpretation of
smiles depends on the searchtype
parameter, as follows:
-
search_type == DX_SUPER_SMILES
-
The parameter smiles is interpreted as a
SMILES string. Using SMILES,
one can specify "ordinary" substructures -- substructures
that have exactly-specified atoms and bonds (i.e. no SMARTS
expressions).
-
search_type == DX_SUPER_SMARTS
-
The parameter
smiles is interpreted as a
SMARTS string. Using
SMARTS, one can specify substructures that have expressions for atoms
and bonds.
-
search_type == DX_SUPER_SMILESPART
-
The parameter smiles is interpreted as a
SMILES string. Using SMILES,
one can specify "ordinary" substructures -- substructures
that have exactly-specified atoms and bonds (i.e. no SMARTS
expressions). This search type uses the special FPP<> dataitem, if
available, for screening. This search is used to rapidly find substructures
within dot-separated components of SMILES, typically applicable for databases
of mixtures.
-
search_type == DX_SUPER_SMARTSPART
-
The parameter
smiles is interpreted as a
SMARTS string. Using
SMARTS, one can specify substructures that have expressions for atoms
and bonds. This search type uses the special FPP<> dataitem, if
available, for screening. This search is used to rapidly find substructures
within dot-separated components of SMILES, typically applicable for
databases of mixtures.
During a SMILES and SMARTS searches, implicit use is made of the
"Fingerprint" datatype for screening purposes. If fingerprints
are not available, the search may be considerably slower.
18.7.4 Program-Object searches
The Merlin server's searching capabilities can be extended via the use of
user-written program objects.
The general topic is discussed in the
chapter on program objects.
"Attaching" a program object to a merlin server is discussed in the
merlinserver manual page.
For a specific example of program objects, see the "contrib"
directory:
$DY_ROOT/contrib/src/progob/merlinbintalk.c
A Merlin server can have several program objects attached to it. They
are referenced by index, which by convention in this manual we call
iprogob . The following two functions tell you how many
program objects there are and their names:
-
dt_mer_nprogobs(Handle server) ==> Integer N
-
Reports the number of program objects attached to the Merlin server.
-
dt_mer_progob2name(Handle server, Integer iprogob) ==> String name
-
Gets the name of the program object attached to the Merlin server.
The parameter
iprogob indicates which program object, and
ranges from 0 to N-1, where N is the number of program objects
reported by
dt_mer_nprogobs(), above.
Currently, Merlin program objects work strictly with "binary" data,
such as fingerprints, and produce floating-point results, in a
"Similarity" column. There are two tasks each Merlin program object
can perform:
- Given a string (e.g. a SMILES) and some parameters, generate binary
data from that string (e.g. a fingerprint).
- Given binary data (e.g. a fingerprint), compare it to every row
in a column of binary data and return a result (e.g. similarity).
More specifically, the following two functions perform these tasks:
-
dt_mer_progob_compute(Handle server,
Integer iprogob,
Handle string_object,
Handle parameters) ==> Handle string
-
Sends the contents of "string_object", followed by the contents of
"parameters" (a sequence of string objects), to the program object
attached to "server" indicated by "iprogob", and returns a string
object containing binary data computed by the program object.
The binary data are ASCII encoded; see
dt_binary2ascii() for details.
-
dt_mer_progob_compare(Handle hitlist,
Handle target_column,
Handle result_column,
Integer iprogob,
Handle pattern_binary,
Handle parameters,
RETURN integer status) ==> progress
-
Uses a program object (see dt_mer_progob_compute(3)) to compare
a binary datafield to the contents of a column of binary data, and stores the
results of the comparisons in a column of numeric data.
The binary data are ASCII encoded; see
dt_binary2ascii() for details.
Program objects may require additional parameters to direct their
computations and comparisons. For example, a fingerprinting program's
computation function might take parameters controlling the size of the
fingerprint and the maximum pathlength to follow in generating the
fingerprint; a program object that compares mass spectra might take
parameters controlling the relative importance of increasing mass.
The above two functions both have parameter named, amazingly enough,
"parameters". These are sequence-of-strings objects that
can take any arbitrary parameters that you need to pass to the program
objects. The parameters must be represented in string form (e.g.
numeric parameters must be represented in printed ASCII characters).
The interpretation of these parameters is strictly up to the program
object; the Merlin server simply forwards them to the program object
without interpretation.
The program objects can also supply titles for these parameters. Two
functions are provided for this purpose:
-
dt_mer_progob_computeparams(dt_Handle server, dt_Integer iprogob) ==> Handle seq_of_strings
-
Asks a program object, via the Merlin server, to report the names
and default values for the parameters used by the function
dt_mer_progob_compute(..., parameters).
-
dt_mer_progob_compareparams(dt_Handle server, dt_Integer iprogob) ==> Handle seq_of_strings
-
Asks a program object, via the Merlin server, to report the names
and default values for the parameters used by the function
dt_mer_progob_compare(..., parameters).
18.8 Other Hitlist Operations
-
dt_mer_clear(Handle hitlist) ==> boolean ok
-
Clears a hitlist. Returns the number of hits in the
hitlist
(i.e. zero), or -2 if an error is detected.
-
dt_mer_combinehitlists(Handle h1, Handle h2, int action) => integer
nhits
-
Combines two hitlists in the same
manner as the search operations. That is,
h1 is treated
as the original hitlist, h2 is treated as the result of a
search; the two hitlists are combined using action , and the result
placed in h1 .
-
dt_mer_hit2id(Handle hitlist, int ihit) ==> integer id
-
There are two ways of identifying a row in a pool:
- hit index:
-
(Called "ihit") An index into a
hitlist. This index
is used for all operations on Merlin hitlists.
- row id
-
(Called "id") An arbitrary but unique integer that
identifies a particular row. You can ask for the id of a row in a
hitlist, then use its id to find its position in another hitlist or
in a modified version of the original hitlist. An id has no other
use. An id is guaranteed to be invariant and unique over the life of
a pool object.
Converts a row's hitlist index to its id. Returns the id, or -2 if
an error is detected. A typical use of the id is to find a row's id,
perform a search or sort, then convert the id back to ihit , the row's
index in the modified hitlist. See dt_mer_id2hit(), below.
-
dt_mer_id2hit(Handle hitlist, int id) ==> integer ihit
-
Converts a row's "id" to its hitlist index, "ihit" (see
dt_mer_hit2id(), above). Returns the row's index ("ihit"), or -2 if
an error is detected. It is an error if the specified row is not in
the hitlist.
-
dt_mer_invert(Handle hitlist) ==> boolean ok
-
Inverts the
hitlist:
All hit rows become non-hits and all non-hit rows become hits. The
current order is lost; the new hits are in native order. Returns the
number of hits in the resulting hitlist, or -2 if an error is
detected.
-
dt_mer_length(Handle hitlist) ==> integer length
-
Returns the number of hits in a hitlist.
-
dt_mer_mvbottom(Handle hitlist, int ihit) ==> integer nhits
-
Moves the specified hit to the end of the
hitlist. Returns the
(unaltered) hitlist length, or -2 if an error is detected.
-
dt_mer_mvtop(Handle hitlist, int ihit) ==> integer nhits
-
Moves the specified hit to the top of the
hitlist. Returns the
(unaltered) hitlist length, or -2 if an error is detected.
-
dt_mer_native(Handle hitlist) ==> integer nhits
-
Reorders the hitlist
(without changing its contents) to "native"
order. Returns the number of hits (which is unchanged), or -2 if an
error is detected.
"Native" order is essentially arbitrary. It sometimes
corresponds to the order in which data are loaded into a database, but
it should not be assumed that this is the case. The only thing
guaranteed about "native" order is that it won't change
during the life of the parent database object (dt_parent(hitlist)).
-
dt_mer_reset(Handle hitlist) ==> integer nhits
-
Resets and reorders a
hitlist
so that all rows in the pool are in it in "native" order.
Returns the number of hits in the hitlist, or -2 if an error is
detected.
-
dt_mer_reverse(Handle hitlist) ==> integer nhits
-
Reverses the order of the
hitlist, without changing its contents.
Returns the (unaltered) hitlist length, or -2 if an error is
detected.
-
dt_mer_zapabove(Handle hitlist, integer ihit) ==> integer nhits
-
Deletes all hits above (but not including) the specified hitlist index
ihit . If
ihit is greater than or equal to the number of hits, all
hits are deleted. If ihit is zero or less, no hits are
deleted. Returns the hitlist's new length, or -2 if an error is
detected.
-
dt_mer_zapbelow(Handle hitlist, integer ihit) ==> integer nhits
-
Deletes all hits below (but not including) the specified hitlist index
ihit . If
ihit is greater than the number of hits, no hits are
deleted. If ihit is less than zero, all hits are
deleted. Returns the hitlist's new length, or -2 if an error is
detected.
-
dt_mer_zapna(Handle hitlist, Handle column) => integer nhits
-
Deletes rows from hitlist for which
there is no data in column. Returns the hitlist's new length, or -2
if an error is detected.
-
dt_mer_zapnonunique(Handle hitlist, Handle column) ==> integer nhits
-
Deletes all rows from hitlist
which, in
column , have the same value as the previous row
in the hitlist. Returns the hitlist's new length, or -2 if an error
is detected.
18.9 Saving and Restoring Hitlists
It is often necessary to save a hitlist so that it may be restored
for later use, or shared with other users. For example, one might be
interested in a particular subset of a database; one could use
Merlin's searching capabilities to make a hitlist consisting of that
subset, then save it.
The Daylight system has no concept of "indices" or other arbitrary
identifiers that might be used to save a hitlist. One must use
identifiers such as SMILES as the contents of a saved hitlist. Users
and programmers should be aware of the implications of this: If a
particular row has no identifier loaded in the Merlin Pool, it can't
be saved in a hitlist -- there is no way to name it.
Hitlists are stored as TDT files. Each TDT is a "minature" version
that contains only the identifier's tag and the identifier. For
example, the following hitlist might be from a database that has a
number of SMILES-rooted TDT, and a number of entries for which we
have no structure, only a company ID number:
$SMI<Oc1ccccc1>|
$SMI<OCc1ccccc1>|
$CID<234-54A>|
$SMI<Oc1c(O)cccc1>|
$CIC<235-55B>|
Each row in a Merlin pool has a "root identifier" -- the
root of the TDT for that row. If a pool's rows are "split
out" into subtree rows (i.e. the subtree identifiers have the
_P in their datatype definition), their root idntifier will be the
subtree's root identifier. These root identifiers are used to store
hitlists:
-
dt_mer_getroot(Handle hitlist, integer index) => string id
-
Returns the root identifier for the specified row, as a TDT string
containing only the root identifier's tag and the root identifier
(e.g. "$TAG<ID>|").
-
dt_mer_sethits(Handle hitlist, Handle column, Handle sos) ==> integer nset
-
Sets hits in a hitlist, using a sequence of identifiers and a column
whose datatype is that of the identifiers being restored.
The parameter sos is a sequence of string objects, each
string-object of which should contain a "$TAG<ID>|" string
as described in
dt_mer_getroot(),
above, or a
SMILES string
(if the first character of the string is not a "$", then it
is assumed to be a SMILES; otherwise it is assumed to be a TDT). Each
identifier is added to the hitlist. Note that the hitlist is NOT
cleared before the additions begin; long lists of identifiers can be
added by breaking them into smaller groups. For efficiency, such
groups should not be too small, say several hundred to several
thousand identifiers at a time.
Returns the number of hits in the sequence that were actually set.
This can be different from the sequence's length if one or more
identifiers can't be found in the column, or if one or more
identifiers is duplicated, or was already in the hitlist. Returns -1
on error.
Note: As of release 4.33, only columns with datatype SMILES ($SMI)
and the pseudo-datatype "RowID" ($ROWID) will work with this
function. $ROWID columns are by far the most useful, since you can
feed them a sequence-of-strings object with mixed datatypes, i.e. one
you got from dt_mer_getroot(), above.
Back to Table of Contents
Go to previous chapter THOR Datatrees
Go to next chapter Widgets.
|