Along with numerous updates to Chem package over the last weeks, we’ve also implemented the benchmark (link) by the spec provided by Andrew @skalkin. While the details of benchmark’s implementation are a subject of discussion and improvement, we do already have the grounds to reason about further performance-related directions, for client RDKit support in particular.
We’ve ran the benchmark on 100’000 molecules subset extracted from 1.94M molecules of the Chembl Database. The test was ran on a 6-core Intel Core i7-8750H 2.21 Hz with 32 Gb RAM.
We’ve got the following results:
1. Rendering a 1000 random molecules: 13095 ms
2. Horizontal scrolling (20 random molecules, 100 times): 9679 ms
3. Vertical scrolling (20 random molecules, 100 times): 19562 ms
4. Substructure search, building a library of 100000 molecules: 93403 ms
5. Substructure search, searching benzene in 100000 molecules: 3539 ms
6. Substructure search, searching aspirin in 100000 molecules: 102 ms
7. Similarity scoring, building a library of 100000 molecules: 84099 ms
8. Similarity scoring, search for 10 samples in 100000 molecules: 1399 ms
9. Substructure search (server), searching benzene in 100000 molecules: 25324 ms
10. Substructure search (server), searching aspirin in 100000 molecules: 7300 ms
Notes
-
We’ve parallelized JS-RDKit substructure search via JS Web Workers. Each Web Worker constructs a SubstructLibrary’s library and owns it. The actual search then queries all Web Workers with a substructure sample, and the results are merged. Note there is also a built-in SubstructLibrary support for a parallel construction using C++ threads, yet this isn’t available in JS-RDKit right now. It would be interesting to compare the latter version with our Web Workers variant.
-
The number of cores for JS-RDKit substructure search (through Chem) was set to 10, whereas the server RDKit substructure search was parallelised to run on 16 cores.
-
Our version of JS-RDKit-based similarity scoring was not parallelised yet, the reported performance is for 1 core. We haven’t run the server version of RDKit similarity scoring due to the problem with parsing some molecules from the dataset, which we are going to address soon.
Discussion
It is clear that the server-based RDKit substructure search outperforms the JS-RDKit based one. With 1.5 times more cores it produced the result for both library construction and searching, including the server round-trip, almost 4 times faster than the JS-WASM-based version of the same functionality. On the other hand, in a separate benchmark run we’ve figured that our JS-RDKit substructure search on 10 cores is 4 times faster than the same on 1 core. Thus, adding even more cores to the laptop won’t bring us on par with the server version. We shouldn’t count on having a beefy laptop either, as the experience with Datagrok should be interactive even on weak laptops and tablets.
We are in touch with RDKit authors on improving the JS-RDKit performance.
We also noticed that the scrolling performance should be improved. Our plans are to enhance this benchmark to compare rendering and scrolling via both RDKit and OpenChemLib-JS. We also plan to expand molecules caching implemented in Chem to caching actual renders too, which is expected to further improve scrolling performance.