21. Reentrant Toolkit InterfaceBack to Table of Contents21.1 IntroductionBeginning with version 4.93, the Daylight toolkits can be used effectively in a multi-threading environment. There are several driving forces for this development:
With 4.93 we are providing a general, reentrant, multithreading interface to the Daylight toolkits via POSIX threads. The multithreading interface does not substantially change the current external toolkit interface and causes minimal impact on performance of single-threaded toolkit programs.
21.2 Data IssuesAbove and beyond the normal programming concerns, the main additional issue which one must be aware of when writing multithreading toolkit programs is the potential need to share objects and other resources within the toolkits across multiple threads. Because the Daylight Toolkit provides an opaque object model (the internal structure and implementation of the objects is not visible to the programmer) we must provide rules to guide the sharing of objects. There are two main programming models which can be used in a multithreaded program. In both models, however, there are some common features. First, all error-handling and error queues are implemented on a per-thread basis. That is, if a toolkit function results in an error, that error will be placed on the error queue for the thread which ran the function. The error will not be visible to any other threads. The functions dt_errors(3), dt_errorworst(3), dt_errorsave(3), and dt_errorclear(3) operate on the current threads error queue only. Heap data (eg. from malloc()) is shared across the entire process. This is as expected for a multithreaded program, however the implications for Daylight Toolkit objects are worth mentioning. All of the internal toolkit object implementations use heap data and the toolkits perform large numbers of malloc() and free() calls. Hence, the performance of the system malloc library is critical to overall throughput of a multithreaded Daylight toolkit program. On some platforms it is desirable to use an alternative malloc library instead of the default system malloc library. On Solaris the library /usr/lib/libmtmalloc.a is optimized for multithreading programs and improves performance significantly. On Linux and SGI the default system malloc library gives good performance. The other issue with heap data is strings within the toolkit. Most of the toolkit functions which return strings (eg. dt_cansmiles()) are actually returning pointers to strings which are owned by the object itself. It is valid to use these strings across threads, however one must make sure that the object continues to exist within the program. int funca(char *arg) { pthread_t tid; dt_Handle mol; dt_String str; dt_Integer slen; mol = dt_smilin(strlen(arg), arg); str = dt_cansmiles(&slen, mol, 1); pthread_create(&tid, NULL, funcb, str); dt_dealloc(mol); return 0; } In the above example, a new thread calling 'funcb()' is created. The problem here is that the canonical SMILES string (str), which is passed into the child thread, gets removed when the molecule object is deallocated. It is likely that funcb() will fail as soon as it tries to use the string. In this case it would be better to duplicate 'str' and pass the duplicate to the child thread. 21.3 Per-Thread Object ModelThe first model for programming uses per-thread objects. Each thread maintains its own handle table for dispatching handles to their underlying objects. The objects are not shared across threads. This is the simpler model to implement as no locking of objects needs to be performed. The basic program requirements are as follows:
A simple example is the smarts_filter_mt.c program, which reads SMILES on stdin and writes any SMILES which match a given SMARTS query to stdout. void *do_smarts_forever(void *arg) { static const int ok = 1; static const int fail = -1; char line[MAXSMI]; dt_Handle mol, pattern, pathset; char *smarts = (char *)arg; pattern = dt_smartin(strlen(smarts), smarts); if (pattern == NULL_OB) { fprintf(stderr, "Can't parse SMARTS <%s> in child thread\n", smarts); return((void *)&fail); } while (!feof(stdin)) { if (!gets(line)) return((void *)&ok); mol = dt_smilin(strlen(line), line); if (mol != NULL_OB) { pathset = dt_match(pattern, mol, TRUE); if (pathset != NULL_OB) { dt_dealloc(pathset); printf("%s\n", line); } } dt_dealloc(mol); } return((void *)&ok); } #define THR_COUNT 4 int main(int argc, char *argv[]) { pthread_t tid; int i; dt_mp_initialize(DX_MP_PER_THREAD_HANDLES); /*** Get SMARTS from command line ***/ if (2 != argc) { fprintf(stderr, "usage: %s SMARTS\n", argv[0]); exit(1); } for (i = 0; i < THR_COUNT; i++) pthread_create(&tid, NULL, do_smarts_forever, (void *)&argv[1]); return (0); } The main points illustrated in the smarts_filter example are:
21.4 Global Object ModelThe second model for programming used global objects. That is, every object allocated within the application is visible to all threads. This model is more complicated to implement as it is necessary for the programmer to synchronize access to any shared objects that are used between threads. The basic program requirements are as follows:
The analogous smarts_filter_mt.c is shown below, where the SMARTS pattern is shared between all threads. static dt_Handle pattern; void *do_smarts_forever(void *arg) { static const int ok = 1; static const int fail = -1; char line[MAXSMI]; dt_Handle mol, pathset; while (!feof(stdin)) { if (!gets(line)) return((void *)&ok); mol = dt_smilin(strlen(line), line); if (mol != NULL_OB) { dt_mp_lock(pattern); pathset = dt_match(pattern, mol, TRUE); dt_mp_unlock(pattern); if (pathset != NULL_OB) { dt_dealloc(pathset); printf("%s\n", line); } } dt_dealloc(mol); } return((void *)&ok); } #define THR_COUNT 4 int main(int argc, char *argv[]) { pthread_t tid; int i; dt_mp_initialize(DX_MP_PER_THREAD_HANDLES); /*** Get SMARTS from command line ***/ if (2 != argc) { fprintf(stderr, "usage: %s SMARTS\n", argv[0]); exit(1); } pattern = dt_smartin(strlen(smarts), smarts); if (pattern == NULL_OB) { fprintf(stderr, "Can't parse SMARTS <%s> in child thread\n", smarts); exit(1); } for (i = 0; i < THR_COUNT; i++) pthread_create(&tid, NULL, do_smarts_forever, NULL); return (0); } In the above example, the pattern object is created in the parent (main) thread. Each child uses the same pattern object for the dt_match() operation. The pattern object must be locked before executing the dt_match() function and unlocked after the match is complete. It is important to note that the dt_mp_lock()/dt_mp_unlock() mechanism is a cooperative locking scheme. Every thread which needs to access a shared object must lock it. When one thread locks an object it does not automatically prevent other threads from accessing the object or calling it's methods; the programmer has the responsibility of guarding against access by other threads. 21.5 Object GranularitySince the toolkit is implemented using an opaque interface, internal behaviors of toolkit functions are not rigorously defined. Only external behaviors are well defined. This is a very pleasant model for Daylight; we are free to use lazy evaluation, to organize internal data however we choose, and to change these internal data organizations at will. The unfortunate side effect of this opaque interface is that we can't precisely describe the results of a given modification of one object. A seemingly simple modification to an object may have far-reaching impacts upon other data structures within the toolkit. Given that the key to a multithreaded library implementation is the control of data modifications, this leads to a problem. The solution to this problem is simply to define the allowed concurrancy at a larger granularity than otherwise possible. The basic granularity of access allowed for objects within the toolkit is the object family. The 'object family' is a new concept within the toolkit which refers to a collection of objects which are related to one-another as parent/children or base/derivitives. All objects within an object family will share the same object as their ancestor (dt_ancestor(3)). The ancestor object is the ultimate parent of all of the objects within the family. If a thread is accessing/manipulating any object within an object family it is not safe for another thread to be accessing/manipulating any other objects within that object family. Because we can guarantee that side-effects caused by one thread manipulating an object will be contained within the object family, this works out to allow well-defined behavior for multithreaded programs. There is one complication however, which is toolkit functions which depend on multiple objects (eg. dt_match()). In those cases each thread must have exclusive access to all of the object families needed for the function in order to be thread safe. In the case of dt_match() the thread must either lock both the pattern object and the target object. Alternatively, if either object is known to be local to a thread (eg. in the smarts_filter_mt.c example above the molecule is local to the thread) then it is not necessary to lock that object. 21.6 Thread Safety versus ReentrancyThe toolkit is reentrant, not thread-safe. It is the responsibility of the programmer to take the tools and write multithreading applications. If desired, one can implement a heavy-handed thread-safe toolkit interface as follows. Every toolkit function could be wrapped with a layer which locks the object family before operating on an object, then this wrapper layer would provide a completely thread-safe toolkit API. This wrapper would probably have fairly poor performance. dp_xxx(ob) { dt_mp_lock(ob); rc = dt_xxx(ob); if (type(rc) is string) duplicate string; dt_mp_unlock(ob)) return rc or duplicate string; } Note that in the above wrapper any returned strings are duplicated. This eliminates the previously mentioned warning about accessing strings within shared objects. Nothing prevents a programmer from misusing threads, or from accessing the same object from multiple threads. Only if the programmer always gets a lock for the object family before modification can he be guaranteed to have exclusive access to the object. If one or more threads does not obey this convention, then the program may not be thread-safe.
21.7 LimitationsThere are several areas of the toolkit which are not reentrant and can not be used by multiple threads at the same time. The database access toolkits (Thor and Merlin client toolkits) are not reentrant and can not be used by more than one thread in a program at a time. It is possible for a multithreaded program to include these toolkits provided that access to both toolkits is serialized completely within the application. The Rubicon toolkit is not reentrant and can not be used by more than one thread at a time. It is possible for a multithreaded program to include Rubicon functionality provided that access to the toolkit is serialized completely within the application. Within the depict toolkit, use of the dt_depict() function and drawing library must be serialized across the entire application. It is not possible for multiple dt_depict() calls to be in progress in different threads at the same time since both will result in the invocation of global drawing library functions. In practice, the drawing library that a user implements must reference one or more global variables (locked along with the dt_depict() function) which can provide the local thread context necessary for the desired drawing operations. Several obsolete toolkit functions are not reentrant. These include dt_smilinerrtext(), dt_alloc_fp(), dt_fp_fingerprint(), dt_fp_fold(), dt_fp_mindensity(), dt_fp_minsize(), dt_fp_size(), dt_fp_setmindensity(), dt_fp_setminsize() and dt_fp_setsize(). These functions had previously been made obsolete because their API did not fit well within the Daylight toolkit model and should not be used for new programs. |