Overhead of File-Based Data Transfer

Example Usage (Running a 4-component reaction & getting properties):

  • Import 4 components (17 x 17 x 17 x 1): 0.2 sec.
    • real 0m0.018s
    • user 0m0.000s
    • sys 0m0.000s
  • React to form 4,913 products: 38.62 sec
    • real 0m38.623s
    • user 0m38.480s
    • sys 0m0.040s
  • Compute common properties: 3.98 sec
    • real 0m3.983s
    • user 0m3.920s
    • sys 0m0.050s
  • Total: 42.62 sec.

What’s the "ouch" level for this example?

From baseline tests, we know it takes about 0.157 sec / 1000 mol to read, parse, create, & free molecules. So, the overhead for processing the XML file for step 3 is about 0.77 sec, or about 19%.

A more efficient, 3GL program would read in all 52 components, parse & save these, produce each product, calculate that product’s properties, and then output each result. Even if we have no context swapping at all, the best we could get would be about 38.62 + 3.98 - 0.77 = 41.83 sec, about a 2% improvement.

Discussion

We could add more steps to our example to build up the inequity. For example, we can add a filter step at the end. That would add another file-processing step w/ about a ¾ to 1 sec penalty, bringing the difference in this example to about 3.8%. By adding many such steps, it’s conceivable (though unlikely) that our overhead would approach that of the final step, namely 19%. To reach that level of overhead, however, we’d need to add more than 100 steps with little or no reduction in the size of our data stream. It’s more likely that the data stream will be filtered down, possibly joined with another data stream in a 2nd or 3rd reaction, and filtered again.

Somewhere in the vicinity of six steps involving large data streams seems a more realistic worst-case scenario, carrying a penalty of about 8-13%, depending on how much data reduction occurs during filtering. And we’d expect to frequently see something like one, two, or three steps with the largest data stream, carrying a penalty somewhere between 2-10%.

Precise estimates can only be made based on overall usage patterns: the number of process steps with millions of molecules, number of simultaneous users, point at which data streams are filtered, and how much data are carried forward at each step, all affect efficiency.

Based on these estimates, however, it seems reasonable to conclude that the overhead for using this 4GL, is surprisingly small, even with the very verbose XML data stream.

Prev | Abstract | Next