21. Reentrant Toolkit Interface

Back to Table of Contents

21.1 Introduction

Beginning with version 4.93, the Daylight toolkits can be used effectively in a multi-threading environment. There are several driving forces for this development:
  • Database searching. In database systems such as Daycart and Merlinserver multithreading can be used to seamlessly provide increased search throughput and performance improvement.

  • Java. Java supports full multithreading in its object model and a robust toolkit interface to Java should be multithreaded. Previous Java wrappers for the Daylight toolkits were limited by the concurrency issues within the Daylight toolkit.

  • Pthreads. The POSIX thread interface (which includes threads, mutexes, conditional variables, signal handling) provides cross-platform capabilities for developing multithreaded applications. This environment allows standard programming methods to be applied for multithreading applications.

With 4.93 we are providing a general, reentrant, multithreading interface to the Daylight toolkits via POSIX threads. The multithreading interface does not substantially change the current external toolkit interface and causes minimal impact on performance of single-threaded toolkit programs.

21.2 Data Issues

Above and beyond the normal programming concerns, the main additional issue which one must be aware of when writing multithreading toolkit programs is the potential need to share objects and other resources within the toolkits across multiple threads.

Because the Daylight Toolkit provides an opaque object model (the internal structure and implementation of the objects is not visible to the programmer) we must provide rules to guide the sharing of objects. There are two main programming models which can be used in a multithreaded program. In both models, however, there are some common features.

First, all error-handling and error queues are implemented on a per-thread basis. That is, if a toolkit function results in an error, that error will be placed on the error queue for the thread which ran the function. The error will not be visible to any other threads. The functions dt_errors(3), dt_errorworst(3), dt_errorsave(3), and dt_errorclear(3) operate on the current threads error queue only.

Heap data (eg. from malloc()) is shared across the entire process. This is as expected for a multithreaded program, however the implications for Daylight Toolkit objects are worth mentioning. All of the internal toolkit object implementations use heap data and the toolkits perform large numbers of malloc() and free() calls. Hence, the performance of the system malloc library is critical to overall throughput of a multithreaded Daylight toolkit program. On some platforms it is desirable to use an alternative malloc library instead of the default system malloc library. On Solaris the library /usr/lib/libmtmalloc.a is optimized for multithreading programs and improves performance significantly. On Linux and SGI the default system malloc library gives good performance.

The other issue with heap data is strings within the toolkit. Most of the toolkit functions which return strings (eg. dt_cansmiles()) are actually returning pointers to strings which are owned by the object itself. It is valid to use these strings across threads, however one must make sure that the object continues to exist within the program.

int funca(char *arg)
{
  pthread_t tid;

  dt_Handle mol;
  dt_String str;
  dt_Integer slen;

  mol = dt_smilin(strlen(arg), arg);
  str = dt_cansmiles(&slen, mol, 1);

  pthread_create(&tid, NULL, funcb, str);

  dt_dealloc(mol);
  return 0;
}

In the above example, a new thread calling 'funcb()' is created. The problem here is that the canonical SMILES string (str), which is passed into the child thread, gets removed when the molecule object is deallocated. It is likely that funcb() will fail as soon as it tries to use the string. In this case it would be better to duplicate 'str' and pass the duplicate to the child thread.

21.3 Per-Thread Object Model

The first model for programming uses per-thread objects. Each thread maintains its own handle table for dispatching handles to their underlying objects. The objects are not shared across threads. This is the simpler model to implement as no locking of objects needs to be performed.

The basic program requirements are as follows:

  • The function dt_mp_initialize(DX_MP_PER_THREAD_HANDLES) must be called before any toolkit objects are allocated. It is generally best to call this function from the main thread during startup.
  • Each child thread created in the program will only have access to objects created within that thread. It is important to note that every thread *may* use a given handle ID to represent a different local object. That is, every thread has it's own internal table of handle IDs and the same ID numbers will be used by multiple threads. So if a programmer takes a handle allocated in one thread and attempts to access it in second thread, the second thread will find that the handle is either invalid or refers to a different object (local) object.

    If an object is required in multiple threads it is necessary to create that object from it's string representation in each thread. Typically one thread can create a string from the object and pass that string to the other threads that need the object. They will then instantiate a local object.

A simple example is the smarts_filter_mt.c program, which reads SMILES on stdin and writes any SMILES which match a given SMARTS query to stdout.

void *do_smarts_forever(void *arg)
{
  static const int ok = 1;
  static const int fail = -1;
  char      line[MAXSMI];
  dt_Handle mol, pattern, pathset;
  char *smarts = (char *)arg;

  pattern = dt_smartin(strlen(smarts), smarts);
  if (pattern == NULL_OB)
    {
      fprintf(stderr, "Can't parse SMARTS <%s> in child thread\n", smarts);
      return((void *)&fail);
    }

  while (!feof(stdin))
    {
      if (!gets(line))
        return((void *)&ok);

      mol = dt_smilin(strlen(line), line);
      if (mol != NULL_OB)
        {
          pathset = dt_match(pattern, mol, TRUE);
          if (pathset != NULL_OB)
            {
              dt_dealloc(pathset);
              printf("%s\n", line);
            }
         }
      dt_dealloc(mol);
    }
  return((void *)&ok);
}

#define THR_COUNT 4

int main(int argc, char *argv[])
{
  pthread_t tid;
  int i;

  dt_mp_initialize(DX_MP_PER_THREAD_HANDLES);

  /*** Get SMARTS from command line ***/

  if (2 != argc)
    {
      fprintf(stderr, "usage: %s SMARTS\n", argv[0]);
      exit(1);
    }

  for (i = 0; i < THR_COUNT; i++)
    pthread_create(&tid, NULL, do_smarts_forever, (void *)&argv[1]);
  return (0);
}

The main points illustrated in the smarts_filter example are:

  • Each thread creates it's own local 'pattern' object from the given SMARTS string rather than attempting to share a single pattern.
  • The coordination of input and output streams (gets() and printf()) is handled by the stdio library. Since the granularity of the I/O is line-at-a-time the stdio library makes sure that the I/O is parsed properly. The programmer does not need to do anything special.

21.4 Global Object Model

The second model for programming used global objects. That is, every object allocated within the application is visible to all threads. This model is more complicated to implement as it is necessary for the programmer to synchronize access to any shared objects that are used between threads.

The basic program requirements are as follows:

  • The function dt_mp_initialize(DX_MP_GLOBAL_HANDLES) must be called before any toolkit objects are allocated. It is generally best to call this function from the main thread during startup.
  • There is a single handle ID namespace, so handles can be passed between threads and reference the same object in both threads. Objects which are created and used locally within a thread do not need special handling; they can be used without any locking or synchronization.
  • If an object is required in multiple threads the program must take care to only allow one thread to be accessing the object and it's methods at a time. The functions dt_mp_lock(), dt_mp_trylock(), and dt_mp_unlock() are provided as a convenience for these operations.

The analogous smarts_filter_mt.c is shown below, where the SMARTS pattern is shared between all threads.


static dt_Handle pattern;

void *do_smarts_forever(void *arg)
{
  static const int ok = 1;
  static const int fail = -1;
  char      line[MAXSMI];
  dt_Handle mol, pathset;

  while (!feof(stdin))
    {
      if (!gets(line))
        return((void *)&ok);

      mol = dt_smilin(strlen(line), line);

      if (mol != NULL_OB)
        {
          dt_mp_lock(pattern);
          pathset = dt_match(pattern, mol, TRUE);
          dt_mp_unlock(pattern);
          if (pathset != NULL_OB)
            {
              dt_dealloc(pathset);
              printf("%s\n", line);
            }
         }
      dt_dealloc(mol);
    }
  return((void *)&ok);
}

#define THR_COUNT 4

int main(int argc, char *argv[])
{
  pthread_t tid;
  int i;

  dt_mp_initialize(DX_MP_PER_THREAD_HANDLES);

  /*** Get SMARTS from command line ***/

  if (2 != argc)
    {
      fprintf(stderr, "usage: %s SMARTS\n", argv[0]);
      exit(1);
    }

  pattern = dt_smartin(strlen(smarts), smarts);
  if (pattern == NULL_OB)
    {
      fprintf(stderr, "Can't parse SMARTS <%s> in child thread\n", smarts);
      exit(1);
    }

  for (i = 0; i < THR_COUNT; i++)
    pthread_create(&tid, NULL, do_smarts_forever, NULL);
  return (0);
}

In the above example, the pattern object is created in the parent (main) thread. Each child uses the same pattern object for the dt_match() operation. The pattern object must be locked before executing the dt_match() function and unlocked after the match is complete.

It is important to note that the dt_mp_lock()/dt_mp_unlock() mechanism is a cooperative locking scheme. Every thread which needs to access a shared object must lock it. When one thread locks an object it does not automatically prevent other threads from accessing the object or calling it's methods; the programmer has the responsibility of guarding against access by other threads.

21.5 Object Granularity

Since the toolkit is implemented using an opaque interface, internal behaviors of toolkit functions are not rigorously defined. Only external behaviors are well defined. This is a very pleasant model for Daylight; we are free to use lazy evaluation, to organize internal data however we choose, and to change these internal data organizations at will.

The unfortunate side effect of this opaque interface is that we can't precisely describe the results of a given modification of one object. A seemingly simple modification to an object may have far-reaching impacts upon other data structures within the toolkit. Given that the key to a multithreaded library implementation is the control of data modifications, this leads to a problem. The solution to this problem is simply to define the allowed concurrancy at a larger granularity than otherwise possible.

The basic granularity of access allowed for objects within the toolkit is the object family. The 'object family' is a new concept within the toolkit which refers to a collection of objects which are related to one-another as parent/children or base/derivitives. All objects within an object family will share the same object as their ancestor (dt_ancestor(3)). The ancestor object is the ultimate parent of all of the objects within the family.

If a thread is accessing/manipulating any object within an object family it is not safe for another thread to be accessing/manipulating any other objects within that object family. Because we can guarantee that side-effects caused by one thread manipulating an object will be contained within the object family, this works out to allow well-defined behavior for multithreaded programs.

There is one complication however, which is toolkit functions which depend on multiple objects (eg. dt_match()). In those cases each thread must have exclusive access to all of the object families needed for the function in order to be thread safe. In the case of dt_match() the thread must either lock both the pattern object and the target object. Alternatively, if either object is known to be local to a thread (eg. in the smarts_filter_mt.c example above the molecule is local to the thread) then it is not necessary to lock that object.

21.6 Thread Safety versus Reentrancy

The toolkit is reentrant, not thread-safe. It is the responsibility of the programmer to take the tools and write multithreading applications. If desired, one can implement a heavy-handed thread-safe toolkit interface as follows. Every toolkit function could be wrapped with a layer which locks the object family before operating on an object, then this wrapper layer would provide a completely thread-safe toolkit API. This wrapper would probably have fairly poor performance.

dp_xxx(ob)
{

  dt_mp_lock(ob);
  rc = dt_xxx(ob);

  if (type(rc) is string)
    duplicate string;
  dt_mp_unlock(ob))
  return rc or duplicate string;
}

Note that in the above wrapper any returned strings are duplicated. This eliminates the previously mentioned warning about accessing strings within shared objects.

Nothing prevents a programmer from misusing threads, or from accessing the same object from multiple threads. Only if the programmer always gets a lock for the object family before modification can he be guaranteed to have exclusive access to the object. If one or more threads does not obey this convention, then the program may not be thread-safe.

21.7 Limitations

There are several areas of the toolkit which are not reentrant and can not be used by multiple threads at the same time.

The database access toolkits (Thor and Merlin client toolkits) are not reentrant and can not be used by more than one thread in a program at a time. It is possible for a multithreaded program to include these toolkits provided that access to both toolkits is serialized completely within the application.

The Rubicon toolkit is not reentrant and can not be used by more than one thread at a time. It is possible for a multithreaded program to include Rubicon functionality provided that access to the toolkit is serialized completely within the application.

Within the depict toolkit, use of the dt_depict() function and drawing library must be serialized across the entire application. It is not possible for multiple dt_depict() calls to be in progress in different threads at the same time since both will result in the invocation of global drawing library functions. In practice, the drawing library that a user implements must reference one or more global variables (locked along with the dt_depict() function) which can provide the local thread context necessary for the desired drawing operations.

Several obsolete toolkit functions are not reentrant. These include dt_smilinerrtext(), dt_alloc_fp(), dt_fp_fingerprint(), dt_fp_fold(), dt_fp_mindensity(), dt_fp_minsize(), dt_fp_size(), dt_fp_setmindensity(), dt_fp_setminsize() and dt_fp_setsize(). These functions had previously been made obsolete because their API did not fit well within the Daylight toolkit model and should not be used for new programs.