RDKit is a modern cheminformatics toolkit most highly recognized in the industry. WebAssembly, or WASM, is an open standard for shipping software to be run in any browser in a uniform way with near-native performance. This revolutionizes software architectures by dissolving the boundaries between the client and the server, native and web applications, bringing the computation right to where the data lives.
We mostly plan to discuss here innovative ways of using RDKit via WebAssembly, including molecule rendering with canvas, multithreading (including such for substructure search), substructure alignments, and performance thereof.
Since performance is really pivotal here, I think it would make sense to start with a set of benchmarks first, so that we can set goals and track the improvements. Let’s keep it very simple, yet representative of chemist’s everyday work. Here are some ideas:
Rendering
Dataset: 1,000 random Chembl molecules
Goal: make rendering seamless with no visual lags (currently, RDKit is slower than OpenChemLib in that area)
Render 20 molecules 100 times (horizontal scrolling)
Render 20 molecules 100 times with a sliding window (vertical scrolling)
Substructure Search
Dataset: 100,000 random Chembl molecules
Goal: make substructure search in 1 million molecules an interactive experience
Let’s start with substructures, and ignore complex SMARTS for the moment. For each test, we will be doing two searches (first one might involve calculating the fingerprints that we can speculatively do in the background)
Search for benzene ring
Search for aspirin
Similarity Search
Dataset: 100,000 random Chembl molecules
Goal: make similarity search in 1 million molecules an interactive experience
Find 50 most similar molecules to 10 random molecules
In the last version of RDKit WASM with substructLibrary support, thanks to @ptosco, the compilation option ALLOW_MEMORY_GROWTH helped open the entire 4 Gb memory space for the library, compared to just 16 Mb available by default. We’ve also posted the recent benchmarks in Cheminformatics which show the consistent 10x speedup compared to a “naive” graph-based search.
We’ve also learned that one needs to use add_smiles instead of add_trusted_smiles in case your SMILES comes not from the RDKit itself, thus it isn’t a normalized (trusted) SMILES. E.g., this is not a normalized SMILES: COc1ccc(c2c1cccc2)C(=O)CCC(=O)O, but that is: COc1ccc(C(=O)CCC(=O)O)c2ccccc12.
There are some remaining questions. I hope @ptosco would some time to answer them.
What is the difference between get_mol and get_qmol?
In general, what is a safe way to estimate whether there is enough memory for the given amount of molecules, say, for an array of molecules size N?
How would the performance change compared to substructLibrary if we imagine such scenario. First (1), we only compute the fingerprints to molecules of the fingerprint type used for substructure search (which are they in RDKit substructLibrary?). Second (2), we simply go through these pre-computed fingerprints and match them against the pattern fingerprint perhaps using some additional logic for matching, perhaps via some additional function which RDKit may expose to match the fingerprints. How much more to this substructLibrary does? We are seeking out for such use case, as in many applications we can compute and cache these fingerprints once and then reuse all the time.
Along with numerous updates to Chem package over the last weeks, we’ve also implemented the benchmark (link) by the spec provided by Andrew @skalkin. While the details of benchmark’s implementation are a subject of discussion and improvement, we do already have the grounds to reason about further performance-related directions, for client RDKit support in particular.
We’ve ran the benchmark on 100’000 molecules subset extracted from 1.94M molecules of the Chembl Database. The test was ran on a 6-core Intel Core i7-8750H 2.21 Hz with 32 Gb RAM.
We’ve got the following results:
1. Rendering a 1000 random molecules: 13095 ms
2. Horizontal scrolling (20 random molecules, 100 times): 9679 ms
3. Vertical scrolling (20 random molecules, 100 times): 19562 ms
4. Substructure search, building a library of 100000 molecules: 93403 ms
5. Substructure search, searching benzene in 100000 molecules: 3539 ms
6. Substructure search, searching aspirin in 100000 molecules: 102 ms
7. Similarity scoring, building a library of 100000 molecules: 84099 ms
8. Similarity scoring, search for 10 samples in 100000 molecules: 1399 ms
9. Substructure search (server), searching benzene in 100000 molecules: 25324 ms
10. Substructure search (server), searching aspirin in 100000 molecules: 7300 ms
Notes
We’ve parallelized JS-RDKit substructure search via JS Web Workers. Each Web Worker constructs a SubstructLibrary’s library and owns it. The actual search then queries all Web Workers with a substructure sample, and the results are merged. Note there is also a built-in SubstructLibrary support for a parallel construction using C++ threads, yet this isn’t available in JS-RDKit right now. It would be interesting to compare the latter version with our Web Workers variant.
The number of cores for JS-RDKit substructure search (through Chem) was set to 10, whereas the server RDKit substructure search was parallelised to run on 16 cores.
Our version of JS-RDKit-based similarity scoring was not parallelised yet, the reported performance is for 1 core. We haven’t run the server version of RDKit similarity scoring due to the problem with parsing some molecules from the dataset, which we are going to address soon.
Discussion
It is clear that the server-based RDKit substructure search outperforms the JS-RDKit based one. With 1.5 times more cores it produced the result for both library construction and searching, including the server round-trip, almost 4 times faster than the JS-WASM-based version of the same functionality. On the other hand, in a separate benchmark run we’ve figured that our JS-RDKit substructure search on 10 cores is 4 times faster than the same on 1 core. Thus, adding even more cores to the laptop won’t bring us on par with the server version. We shouldn’t count on having a beefy laptop either, as the experience with Datagrok should be interactive even on weak laptops and tablets.
We are in touch with RDKit authors on improving the JS-RDKit performance.
We also noticed that the scrolling performance should be improved. Our plans are to enhance this benchmark to compare rendering and scrolling via both RDKit and OpenChemLib-JS. We also plan to expand molecules caching implemented in Chem to caching actual renders too, which is expected to further improve scrolling performance.