MCL is an English language interface to Merlin. In general:
o MCL is intended to be readable and writable by both machines and humans.o MCL statements consist of reserved words and other text, mainly user-defined names for things such as columns and hitlists.
o MCL statements resemble English sentences: the first word starts with a capital letter; statements end with a period; everything is case sensitive. Correct MCL grammar is also correct English grammar.
o Whitespace outside quoted text is ignored. Line numbers are reported for errors; line-breaks (newlines) are otherwise ignored.
o Non-reserved text needs to be quoted only if it collides with reserved words or contains characters other than letters or digits. For clarity, such text is uniformly quoted in this document.
2. CONVENTIONS USED IN THIS DOCUMENT
The following section includes prototypes and brief explanations for all MCL statements, using these conventions.
text ..... user text prototype; quotes aren't needed if text doesn't collide with reserved and contains only letters and digits.word ..... unquoted words are case-sensitive reserved words
[A] ...... optional specification, i.e., A may be omitted
A | B .... optional choice, i.e., accepts A or B or neither
[A ...] .. repeating specification, i.e., A may be repeated
The following prototypes are used in this document:
"column" .......... column's name (from previous Create column...)"database" ........ database name in the form: base@host:service:user
"field" ........... field name ("pH") or field number ("#5")
"fontname" ........ short font name as per option, or "default"
"hitlist" ......... hitlist name
"hitlist2" ........ name of a different hitlist than "hitlist"
"integer" or "n" .. an integer, e.g., just digits
"lpr" ............. lines per row, an integer in range 1-8
"max" ............. maximum value, form is context dependent
"min" ............. minimum value, form is context dependent
"name" ............ user supplied name, max length 128 chars
"regexp" .......... regular expression
"smarts" .......... molecular pattern in the SMARTS language
"smiles" .......... molecule in the SMILES language
"string" .......... string of character except NUL (\\" means ").
"tag" ............. datatype tag
"tan" ............. value in the range 0.0 - 1.0
User entry conventions:
+---------------+---------------------------+ | this entry | is the same as | +===============+===========================+ | "column" | [ column ] "column" | +---------------+---------------------------+ | ["hitlist"] | [ [ hitlist ] "hitlist" ] | +---------------+---------------------------+ | alpha | alpha "alpha" | +---------------+---------------------------+ | beta | beta "beta" | +===============+===========================+Assume this has been done: Set default hitlist "defhits".
+-------------------------------+------------------------------------+ | this statement | is equivalent to full statement | +===============================+====================================+ | Remove "CLUSTER" repeats. | Remove column "CLUSTER" repeats. | +-------------------------------+------------------------------------+ | Reset "defhits". | Reset hitlist "defhits". | +-------------------------------+------------------------------------+ | Reset. | Reset hitlist "defhits". | +-------------------------------+------------------------------------+ | Move to row 10 of "defhits". | Move to row 10 of "defhits". | +-------------------------------+------------------------------------+ | Move to row 10. | Move to row 10 of "defhits". | +-------------------------------+------------------------------------+ | ... Tversky 0.9 0.1 ... | ... Tversky alpha 0.9 beta 0.1 ... | +===============================+====================================+
3. RESERVED WORDS
The following (case-sensitive) words are reserved in MCL:
above default least Print structure(s) Add depiction less Put submixture(s) alpha Display lines range substring ares down list Read substructure(s) as entitled matching regexp supermixture(s) at Exchange missing Remove superstructure(s) available field mixture(s) repeats table below file molecule(s) Reset tautomer(s) beta font Move Reverse text by Free named row than Clear from nativeorder Select to column function next Set Tversky containing graph(s) non similar tversky Copy hitlist not similarity up Create in of Sort value(s) database into pattern status with datatype Invert per string(s) Write
The following words (sort flags) are also reserved in MCL:
/AAZ /ANC /ANCP /ANCPW /ANCW /ANP /ANPW /ANW /ASC /CAS /LEN /MF /NAB /NUM
The following pairs of reserved words are synonyms:
graph ........... graphs molecule ........ molecules string .......... strings structure ....... structures substructure .... substructures superstructure .. superstructures tautomer ........ tautomers value ........... values
Additionally, the symbols $1 - $9 are reserved as formal parameters.
4. THE MCL ENVIRONMENT
MCL is designed to run in at least two different environments: internally in an interactive program such as xvmerlin, and as a non-interactive "batch" process. Actually, in both cases MCL runs in a non-interactive way, i.e., a whole program (or at least a whole statement) is expected to be delivered to the MCL processor. MCL features which are environment- specific are described here.
A number of MCL output-control statements are provided for use in an xvmerlin-like environment. These statments are ignored in an ASCII-oriented batch environment. Examples of such commands are:
Set font "Times-14". Display smiles as depiction.In an interactive environment like xvmerlin, user input is typically transcribed to the MCL program which is then run. Although this is possible in a batch environment, it is often more convenient to use a fixed MCL program with data supplied externally, e.g., via command line arguments in the `mcl' program. The symbols $1 - $9 are used to refer to formal parameters to the MCL program. Each symbol represents an externally supplied string which is used as an MCL language token. If a parameter is referenced in MCL but not supplied by the environment, the MCL program generates an error and quits.
A few (very few) other MCL statements operate differently in different environments. The most important example is "Select database...". In an interactive environment such as xvmerlin, the database to be selected is expected to be already open; if not, the statement fails and a warning is issued. In a batch environment, "Select database...". connects to servers and opens specified database(s); the statement only fails if a database fails to open for some reason.
5. MCL OBJECTS
The most fundamental MCL object is the database, selected with the "Select database..." statement. All MCL statements apply to the currently selected database.
To be useful, each database must have one or more named columns, which represent a kind of data (actually, one field of a datatype) and a function (e.g., the FIRST, LONGEST, AVERAGE, COUNT, etc. of that kind of data).
To be useful, each database must also have one or more named hitlists, which represent an ordered set of entries in the database. Positions in a hitlist are referred to as "rows". On initial creation, hitlists represent all entries in the databases, i.e., there are the same number of rows in a hitlist as there are entries in its database, and the initial order is defined as the "native order" of the database.
Hitlists have a "current position" property. Although this is referred to by its numeric (1-origin) position in the hitlist, the "current position" is defined by the entry in that position, and is stable with respect to operations which change the hitlist order. In other words, if the entry at the "current position" before an operation is present after the operation, it will still be current.
6. DEFAULTS AND STYLE To produce readable MCL code, it is wise to select names carefully. In our examples we give columns short names consisting capital letters (which tends to make them stand out) and hitlist names that end with the word "hits", e.g., "hits" or "curhits" or "savehits".
A number of conventions are used in our examples for clarity: hitlists are usually specified explicity, user-supplied names are shown quoted, and each statement is shown on its own line, e.g.:
Reset "hits". Sort "hits" by "COST". Sort "hits" by "CLUSTER". Remove "CLUSTER" repeats in "hits".
In general, columns and hitlists may be specified using only their names. If it seems clearer to do so, the keywords "column" or "hitlist" may precede column and hitlist names, although it is never required. The following MCL code produces the same program as that above:
Reset hitlist "hits". Sort hitlist "hits" by column "COST". Sort hitlist "hits" by column "CLUSTER". Remove column "CLUSTER" repeats in hitlist "hits".However, it is perfectly acceptable to omit quotes and references to the default hitlist as long as the meaning remains clear, e.g., the following MCL code also means the same thing as that shown above:
Reset. Sort by COST. Sort by CLUSTER. Remove CLUSTER repeats.
Tautomer and graph searches are unusual because they are done by the Thor server directly from SMILES (all other searches are done by the Merlin server on data in columns). In these searches (only) the search is always done with data from a SMILES ($SMI) column; MCL syntax gives you no opportunity to move to the next tautomer in an existing hitlist or specify which column is to be searched, e.g.:
Put tautomers of "O=c1[nH]c2cncnc2[nH]1" into "hits".
7. SEARCH LOGIC
Merlin provides 10 different types of searches:
o find superstructure of a given molecule (replacing H's) o find molecules which match a given SMARTS pattern o find substructures with a given embedded molecule o find tautomers of a given molecule o find molecules with same oxdation-suppresed graph as the given molecule o find structures which are similar to a given molecule o find strings which are contain an given embedded substring o find strings which match a regular expression exactly o find strings which match a regular expression approximately o find values in a given range
Each type of search may be invoked to produce in seven different actions:
o create a new hitlist of matching entries o add matching entries to a hitlist o add non-matching entries to a hitlist o delete matching entries from a hitlist o delete non-matching entries from a hitlist o find matching entry in a hitlist o find non-matching entry in a hitlist
The English-like MCL language allows these (70) different searches to be specified quite naturally, e.g., to create a hitlist of dopamine superstructures:
Put SMILES superstructures of "NCCc1ccc(O)c(O)c1" into "hitlist".
Note that there is no way to create a hitlist of non-matches using just one statement. Use two statements to do this, e.g.,
Put SMILES superstructures of "NCCc1ccc(O)c(O)c1" into "hitlist". Invert "hitlist".
or use the reverse logic:
Clear "hitlist". Add SMILES non superstructures of "NCCc1ccc(O)c(O)c1" to "hitlist".
The words "non" or "not" are used as a word to reverse the meaning of a match. The one exception is similarity search where the following two statements refer to complementary sets of structures:
Remove SMILES structures at least 0.9 similar to "NCCc1cc(O)c(O)c1". Remove SMILES structures less than 0.9 similar to "NCCc1cc(O)c(O)c1".
In an attempt to clarify the effect of a search operation, MCL syntax requires that the preposition must match the verb, e.g.:
Put ....... into "hitlist". Add ....... to "hitlist". Remove .... from "hitlist". Move to ... in "hitlist".
"Move to ..." searches cause the current position to be moved to the next entry which meets the given requirements, i.e., the next position in the current hitlist, starting with the current position. For example, this MCL code produces a hitlist of entries with known pKa's, sorted by pKa:
Reset "hits". Remove missing "pKa" from "hits". Sort "hits" by "pKa".
If we move to the first entry in this list which meets another criterion, e.g., a given pharmacological activity, we are assured that we are pointing to the one which has the lowest-valued pKa:
Move to row 1. Move to "ACTIVITY" string containing "NARCOTIC" in "hits".
We can remove entries with lower pKa values (row 0 is current position):
Remove above row 0 of "hits".
And then do the reverse:
Reverse "hits" Move to row 1. Move to "ACTIVITY" string containing "NARCOTIC" in "hits". Remove above row 0 of "hits".
The resulting hitlist now contains "all entries with pKa's in the range of pKa's of known narcotics". One might repeat this type of search for other properties (LogP, CMR, etc.) to select structures which meet an observed physiochemical profile (e.g., of known narcotics).
All Merlin (and thus MCL) sorts are stable, i.e., after a sort, the order of entries with equal value is unchanged. For example, the following MCL code produces a hitlist containing only the lowest-cost member of each cluster.
Reset "hits". Sort "hits" by "COST". Sort "hits" by "CLUSTER". Remove "CLUSTER" repeats in "hits".
8. MCL STATEMENTS
Select database "database" [ "thorbase" ].
Selects a database by name in the form: base@host:service:user "database" is used for Merlin access; "thorbase" for Thor access. If "thorbase" is not specified, "database" is used with the service name "thor". Only the "base" part of the name is needed, e.g. "wdi93". Default values are the local machine name for "host"; "merlin" or "thor" for "service"; and the user's login name for "user".
Create column of datatype "tag" [field "field"] [function <func>] named "name".
Creates a named column. "tag" is an internal tag, e.g. "$CAS"."field" may be either a field name (e.g., "Reference") or the 0-origin index of the field, preceded by '#' (e.g., "#0" refers to the first field). If unspecified, the first field ("#0") is used.
If specified, <func> must be one of the following:
ALL AVERAGE COUNT FIRST LAST LONGEST MAXIMUM MINIMUM SHORTEST STDDEVIf unspecified, FIRST (the first dataitem) is used.Two kinds of columns are special in that they are required for certain kinds of searches: a column of type "$SMI" (SMILES) is required for tautomer and graph searches and a column of type "SIMILARITY" is required for sorting or searching by structural similarity.
By convention, columns are named after the type of data and are composed of capital letters. However, any string can be used (though you might have to quote it), e.g.,
Create column of datatype P function AVERAGE named "Mean(LogP(o/w))".Create hitlist "hitlist".
Creates an empty, named hitlist. Hitlists are relatively expensive, so it's a good idea to reuse them. By convention, hitlists names end with "hits", e.g., "hits", "curhits" and "bkphits". However, any string may be used (though you might have put quotes around it).
Free column "column".
Free hitlist "hitlist".
Free database "database".
Free object by name in current context (database). Freeing a database automatically frees its columns and hitlists.
Set default hitlist "hitlist".
Make named hitlist default -- if the hitlist is not specified in MCL statements where it is optional to do so, this hitlist will be used. The first hitlist created for each database is normally the default and need not be set explicitly. It is important to set the default hitlist only if you want to use a different one or if you deallocate an existing default hitlist (with Free hitlist...).
Reset ["hitlist"].
Reset the named (or default) hitlist such that all rows are hit, i.e., same as xvmerlin's "Set all hit" menu item.Note: this doesn't quite mean "reset to initial state", because the current hit (if any) remains current. You can truly reset to the initial state by: Reset "hitlist". Move to row 1 of "hitlist".
Clear ["hitlist"].
Clear the named (or default) hitlist such that no rows are hit, i.e. same as the sequence:Reset "hitlist". Invert "hitlist".
Invert ["hitlist"].
Invert the named (or default) hitlist, i.e., make non-hits hits and vise-versa.
Reverse ["hitlist"].
Reverse the order of the named (or default) hitlist.
Copy "hitlist" to "hitlist2".
Copy the contents of one extant hitlist to another. This can be used for backup (e.g., Copy curhits to bkphits.) for later operations or such as restoration (e.g., Copy bkphits to curhits.)
Add "hitlist" to "hitlist2".
Add hits in the first hitlist to those in the second. E.g., the statement:Add bkphits to curhits.replaces curhits with the union of curhits and bkphits.
Select "hitlist" in "hitlist2".
Remove hits in the second hitlist which are not in the first. E.g.,Select bkphits in curhits.replaces curhits with the intersection of curhits and bkphits.
Exchange "hitlist" with "hitlist2".
Exchange the contents of the two hitlists, e.g., the implementation of the "undo" facility in xvmerlin is equivalent to:Exchange undohits with curhits.
Move to row "integer" [of "hitlist"].
Set the current hitlist position, where "integer" is interpreted as:unsigned number ... absolute row number (1 is first row) zero .............. current row signed number ..... number relative to current rowMoves to approprate extreme value (first or last row) if out-of-range, e.g., Move to row 9999999... will move to the last row of most databases.
Remove row "integer" [of "hitlist"].
Remove above row "integer" [of "hitlist"].
Remove below row "integer" [of "hitlist"].
Removes row(s) from hitlist.
Remove "column" repeats [in "hitlist"].
Removes rows from hitlist with values in given column which are identical to the value in the previous row. Typically used to select the first row of each value in a sorted list.
Remove missing "column" [from "hitlist"].
Removes rows from hitlist for which data is not available in given column.
Put "column" superstructures of "smiles" [into "hitlist"].
Add "column" [non] superstructures of "smiles" [to "hitlist"].
Remove "column" [non] superstructures of "smiles" [from "hitlist"].
Move to "column" [non] superstructures of "smiles" [in "hitlist"].
The classic "substructure search". The "Put" form replaces the given hitlist; "Add" and "Remove" modify it; "Move" sets the current position but leaves the hitlist unchanged. $SMI or ISM columns are typically used.
Put "column" structures matching "smarts" [into "hitlist"].
Add "column" structures [not] matching "smarts" [to "hitlist"].
Remove "column" structures [not] matching "smarts" [from "hitlist"].
Move to "column" structure [not] matching "smarts" [in "hitlist"].
Search for a SMARTS pattern. "Put" form replaces the given hitlist; "Add" and "Remove" modify it; "Move" resets the current position but leaves the hitlist unchanged. $SMI or ISM columns are typically used.
Put "column" substructures of "smiles" [into "hitlist"].
Add "column" [non] substructures of "smiles" [to "hitlist"].
Remove "column" [non] substructures of "smiles" [from "hitlist"].
Move to "column" [non] substructure of "smiles" [in "hitlist"].
Converse of the classic "substructure" search, this one looks for structures embedded in the given molecule. The "Put" form replaces the given hitlist; "Add" and "Remove" modify it; "Move" resets the current position but leaves the hitlist unchanged. $SMI or ISM columns are typically used for "column".
Put "column" structures <op> "tan" similar to "smiles" [into "hitlist"].
Add "column" structures <op> "tan" similar to "smiles" [to "hitlist"].
Remove "column" structures <op> "tan" similar to "smiles" [from "hitlist"].
Move to "column" structure <op> "tan" similar to "smiles" [in "hitlist"].
... where <op> is <at least | less than>.
Select (or find) structures based on similarity to a given molecule. "tan" is a number indicating Tanimoto similarity for qualification where 1.0 is perfect similarity (identity). These values are typically used:0.90 ...... very highly similar 0.75 ...... highly similar 0.60 ...... moderately similar 0.50 ...... roughly similar 0.00 ...... select all structures$SMI or ISM columns are typically used for "column", e.g.,Add SMILES structures at least 0.9 similar to "NCCc1ccc(O)c(O)c1".Note: a column of type SIMILARITY must exist to do these searches.
Put "column" structures <op> "value" Tversky alpha beta to "smiles"
[into "hitlist"].
Add "column" structures <op> "value" Tversky alpha beta to "smiles"
[to "hitlist"].
Remove "column" structures <op> "value" Tversky alpha beta to "smiles"
[from "hitlist"].
Move to "column" structure <op> "value" Tversky alpha beta to "smiles"
[in "hitlist"].
... where <op> is <at least | less than>
Select (or find) structures based on Tversky similarity to a given molecule using given alpha/beta parameters. Alpha and beta are typically in the range 0.0 - 1.0. "value" is a number indicating Tversky similarity; the meaning depends on specific alpha/beta settings but higher values are typically used than with Tanimoto similarity, e.g.,0.95 ...... very highly similar 0.90 ...... highly similar 0.85 ...... moderately similar 0.80 ...... roughly similar 0.00 ...... select all structures$SMI or ISM columns are typically used for "column", e.g.,Add SMILES structures at least 0.9 similar Tversky alpha 0.9 beta 0.1 to "NCCc1ccc(O)c(O)c1".Note: a column of type SIMILARITY must exist to do these searches.
Put <tautomers | graphs> of "smiles" [into "hitlist"].
Add [non] <tautomers | graphs> of "smiles" [to "hitlist"].
Remove [non] <tautomers | graphs> of "smiles" [from "hitlist"].
Select tautomers or graphs of a given molecule. A "graph" match discounts oxidation state completely, i.e., the molecules' heavy atoms are connected in the same way (ignoring hydrogens, bond orders, and charges). A "tautomer" match requires a graph match and the net charge and hydrogen count much match, i.e., molecules which differ only in positions of H's (protons) and electrons. (Labile hydrogens are not distinguished from non-labile ones.)Note: to do these searches, a column of type $SMI (SMILES) must exist and the Thor database must have a $SMI datatype with AUTOGRAPH normalization (this is true for default Thor databases).
Put "column" strings containing[/cmp] "string"
[into "hitlist"].
Add "column" strings [not] containing[/cmp] "string"
[to "hitlist"].
Remove "column" strings [not] containing[/cmp] "string"
[from "hitlist"].
Move to "column" string [not] containing[/cmp] "string"
[in "hitlist"].
... where cmp is one of:
/ASC .. ASCII (default) /ANCW ... ignore case and w/s /ANC .. ignore case /ANCP ... ignand case or punct /ANW .. ignore whitespace /ANPW ... ignore punct and w/s /ANP .. ignore punctuation /ANCPW .. ignore case, punct, and w/s
Select (or find) strings containing a given substring. Various character classes can be ignored by appending an option to the keyword "containing". For instance, the statement:Put "NAME" strings containing/ANCPW "METHYLPYRROLE" into "curhits".does comparisons ignoring case, punctuation, and whitespace (spaces, tabs, and newlines), i.e., makes a hitlist of entries with names containing the substrings "Methyl Pyrrole", "methyl-pyrrole", "methylpyrrole", etc.
Put "column" strings matching "regexp" [into "hitlist"].
Add "column" strings [not] matching "regexp" [to "hitlist"].
Remove "column" strings [not] matching "regexp" [from "hitlist"].
Move to "column" string [not] matching "regexp" [in "hitlist"].
Select (or find) strings matching a regular expression.
Put "column" values in range "min" to "max"
[into "hitlist"].
Add "column" values [not] in range "min" to "max"
[to "hitlist"].
Remove "column" values [not] in range "min" to "max"
[from "hitlist"].
Move to "column" value [not] in range "min" to "max"
[in "hitlist"].
Select (or find) values by range specification.
Sort[/cmp] "hitlist" by "column". ... where cmp is one of:
/ASC .... ASCII /ANCPW .. ignore case, punct, w/s /ANC .... ignore case /AAZ .... letters only, ignore case /NUM .... numeric /ANP .... ignore punctuation /NAB .... absolute numeric /ANW .... ignore whitespace /CAS .... CAS Number compare /ANCP ... ignore case and punct /ANCW ... ignore case, w/s /MF ..... molecular formula /ANPW ... ignore punct, w/s /LEN .... length of string
Sort hitlist by value of given column. If cmp is not specified, the "normal" comparison type for the column's datatype will be used.
Sort ["hitlist"] by nativeorder.
Rearrange the given hitlist to original pool order.
Sort ["hitlist"] by similarity ["column"] [to "smiles"].
Sort hitlist by Tanimoto similarity to given "smiles", saving similarity values in similarity column "column", if specified. If "column" is omitted, the default similarity column will be used. If "smiles" is omitted, existing values in the column will be used.This produces a descending sort in similarity, i.e., the most similar structures (high similarity values) sort to the top of the list.
Set font "fontname".
Set lines per row to "lpr".
Display smiles as <text | depiction>.
Special functions for interactive environments such as xvmerlin. "fontname" must as specified in a FONT_ options (or "default"). "lpr" (lines per row) must be in the range 1-8.Note: These statements are disabled in non-interactive environments such as the `mcl' program (they do nothing, successfully).
Print status.
Print the status of the current environment: name the database, name all columns and show their datatypes, name all hitlists and show their length, indicate which hitlist is default and report the current hitlist position.
Print "string" ["string" ...]
Print string literally, followed by newline. If more than one string is supplied they are concatenated then output. Special characters such as newline and bell are copied to output literally. There is no way to print the NULL character to output.
Print <list | table> [of "hitlist"] from row "integer" to row "integer" [ containing "column" ["column" ...] ] [ entitled "string" ["string" ...] ]
... where integer row values are interpreted as:
unsigned number ... absolute row number (1 is first row) zero .............. current row signed number ..... number relative to current row
If the `containing ...' phrase is specified, only those columns will be printed; if omitted, all columns will be printed.If the `entitled ...' phrase is specified, the following strings will be printed as a title; if omitted, no title will be printed.
For example, the statement:
Print table of curhits from row 1 to row 100 containing "SMILES" "Name" entitled "Table 1. The first 100 entries in this hitlist.".produces a 2-column table of the first 100 rows in the hitlist "curhits". If the hitlist is shorter than 100, the table will be truncated.Print list from row -10 to row +10.prints data from all columns in the default hitlist for the current position and 10 rows of context above and below (truncated as needed).