Cheminformatics updates

This is the root topic for discussing cheminformatics, including the features of our Chem package.

The Chem package provides first-class cheminformatics support for the Datagrok platform. See it in action on YouTube. Given existing platform’s capabilities in rich exploratory data analysis, advanced data mining and out-of-the-box support for predictive modeling and scientific computations, this package turns Datagrok into a comprehensive platform for working with chemical and biological data.

Our goal in performance is to be able to open chemical datasets of up to 10 millions small molecules completely in the browser, and interactively perform commonly used operations such as substructure and similarity search without having to rely on a server. In order to hit these goals, we are using a couple of techniques. First of all, we are leveraging Datagrok’s capability to efficiently work with relational data. For cheminformatics, we are relying on the RDKit library compiled to WebAssembly. Not only this gives us the ability to execute C++ code at the native speed, but also enables full advantage of the modern multicore CPUs by running computations in multiple threads.

Here are some of the Chem features:

  • Works completely on the client side where possible
  • RDKit-based rendering:
    • Highlighting substructures on search (in progress),
    • Aligning to scaffolds (in progress),
    • Rendering options (in progress)
  • Fingerprint-based similarity and diversity analyses (see video)
  • Efficient in-memory substructure and similarity searching

The following Chem features are still in the core, but we plan to move them out to this package:

  • Molecule sketching
    • OpenChemLib
  • SAR analysis
  • Property calculators (server-side)
  • 3D: coordinate calculation using RDKit, rendering using NGL Viewer
  • Chembl integration
  • Pubchem integration
  • “Sketch-to-predict”: run predictive models as you sketch the molecule

Join our discussion here if you are interested in high-performance cheminformatics, and check what’s already available with the Chem package.

2 Likes

Chem 0.3 (Chem.0.3.0-8b678c) is released, providing for a number of great performance improvements and new functionalities.

  1. grok.chem.substructureSearch now provides for both “naive” graph-based search and library-based search via WASM RDKit’s substructLibrary. The latter is chosen as a default for this method. Microbenchmarks are added too (more on their performance is further). Thanks to Paolo Tosco, we are using a pre-release version of WASM RDKit containing substructLibrary methods. This version will soon be merged to RDKit’s MinimalLib master.

  2. RDKit-based rendering is now supplied with LRU-caching of RDKit’s molecule structures for both rendered as-is and scaffold-aligned renders. The Coordgen-based rendering is turned off by default (until we improve upon this further in Chem), providing for better scrolling experience.

All this is offered completely in the browser, without involving any server-based computations.

One area of particular interest for our users was the performance of substructure search. In fact, already with “naive” graph-based search, where we match the given pattern molecule to every entry of a dictionary in a loop using get_substruct_match, the performance was reasonable on small datasets. Yet, this wasn’t a match for the most typical use case, where many subsequent pattern searches are applied to a single dataset which changes rarely.

That’s were the RDKit’s substructLibrary comes in handy. It first allows building the so-called library, to which we perform further pattern searches against. Let’s look at some of our micro-benchmarking results of the discussed approach. We are taking a large dataset of 40 000+ molecules, and do a search for 3 small patterns written in SMILES: ['c1ccccc1' (Benzene), 'C1CCCCC1' (Cyclohexane), 'CC' (ethane)]. We linearly increase the dataset being sought through from 2’000 to 12’000 molecules and track the performance of these 3 searches altogether, averaged on 10 independent runs. For the input file demo/chem/zbb/99_p3_4.5-6.csv it is in the picture bottom, and for demo/chem/zbb/99_p1_ph6-8.csv it is in the picture top.

You can check these numbers here at Datagrok https://public.datagrok.ai/p/d9f38d00-2e30-11eb-eb2f-cf7849c78d19. We encourage you to try the microbenchmark yourself and get the figures on your computer: https://public.datagrok.ai/js/samples/domains/chem/substructure-search-library.

It’s clear how faster the substructLibrary-based search is compared to a graph-based one. 10x speed-up is a consistent observation on both 99_p3_4.5-6.csv and 99_p1_ph6-8.csv. We also see that the initialization time of the library is linearly dependant on the dataset size being indexed, and is roughly comparable in time to just a single graph-based substructure search.

These numbers are already great, yet there is a good room for improvements in going parallel and enabling library construction using pthreads via WASM. In fact, the infrastructure for that is already in-place, and we are already figuring out the details of making these threads work efficiently on WebAssembly.

Some plans for the future versions of Chem:

  1. Make a choice of either OCL or RDKit-based rendering as a package option.
  2. Make a Coordgen-based or RDKit-native rendering as a package option.
  3. Add caching (precomputed search structure) to similarity scoring in a similar fashion to substructure search.
  4. Detach computations in all the cheminformatics methods from the main thread via a standalone JS web worker. That should improve the user experience, as the UI won’t be blocked on processing large pieces of data, even if this is blocked for a few seconds.
  5. Speed up similarity scoring dictionary search dictionary preparation via parallel computation on JS web workers.
  6. Speed up substructure search library creation via activating pthread-based parallelism already provided in substructLibrary, but for its WASM version which we are using right now.

Please let us know of the other possible improvements to the cheminformatics functionality shipped via the browser, as well as whether you’d want to see more microbenchmarks from the Datagrok team. We are also going to implement an advanced benchmark based on the summary.

2 Likes

This is a follow up on a feature currently in design purposed to dropping incompatible coordinates of molecules.

In the current Chem version 0.8.18, we’ve implemented coordinates regeneration covering a variety of cases.

One may want to drop the given coordinates and just render the molecule through smiles. But if there is such a scaffold which has coordinates incompatible with rdkit coordinate system (done by hand, gotten from a non-rdkit util, etc.), it first needs to be converted (through smiles) to the “default coordinates” of RDKit, then it may still be a scaffold to which other molecules may be aligned to.

This is what the regenerate-coords tag is for. If it is set on a column, the molecule, if it isn’t in smiles, is passed through cleaning, where the existing coords are dropped and the clean rdkit coords are regenerated, rendered and be used for alignment, if requested (with a scaffold-col tag). In addition, if the regenerate-coords is set on a scaffold column, then the column being aligned to this column is also set to regenerate-coords to make a proper match to RDKit’s ones. In the result, all alignments in all cases are visually correct.

We’ve noticed though that, in case the hand-made coordinates are given for a scaffold, there is a strong reason for having them be aligned in that original way for chemists. In such case, setting the regenerate-coords on such scaffold column may be just a fallback, but not a permanent solution (without fixing the actual coordinates by hand).

However, it isn’t clear how to simplify the scheme in case the scaffold is given in a coordinate system incompatible to the molecule column being aligned:

  1. If we drop the coordinate data from the molecule after it is aligned, we’d loose the visual alignment

  2. If we render the molecule being aligned with dropping coordinate data (using only smiles), the alignment may be distorted in the same way it was without dropping coordinate data, as the source scaffold is still in the alien coordinate system

  3. If we render the scaffold column simply by dropping coordinate data (using only smiles), we could not align to it, that is why we recreate the “default rdkit coords” for it before it is used to align to

Please share your thoughts @dpetrov.gnf.org, @asantrosyan.gnf.org Andrew @skalkin.

1 Like

We’ve updated Chem to 0.9.0. This update includes:

  • Scaffold highlight with alignment when filtering by a substructure
  • Scaffold highlight with alignment when aligning to a scaffold via a column property panel “Chem”
  • For a given column, selecting a column with a source scaffold to align to, both visually via a column property panel “RDKit Settings”, and programmatically through a scaffold-col tag
  • Optional highlighting for the scaffold specified in a scaffold column of the above
  • An option to regenerate coordinates of a column, which comes in handy when the coordinate system of these column’s entries isn’t native to RDKit and thus hindering alignments. This option “forgets” the coordinates provided for the MolBlock molecule, and regenerates them based on RDKit. Available both visually via a column property panel “RDKit Settings”, and programmatically through a regenerate-coords tag
  • Per package properties, an option to choose the molecule renderer between JS-RDKit and OpenChemLib (reload Datagrok to make a new choice into effect)
  • The recent version of JS-RDKit 2021.03 with stability and rendering improvements

We’ve also improved the molecules’ renders cache, which makes both horizontal and vertical scrolling 13-19 times faster and, therefore, visually smoother. Let’s compare them against previously published Chem Benchmark results:

Horizontal scrolling (20 random molecules, 100 times): 526 ms (earlier: 9679 ms)
Vertical scrolling (20 random molecules, 100 times): 1324 ms (earlier: 19562 ms)

Check this short gif showcasing new features and give them a try at Datagrok!

1 Like

RDKit-based structure depiction is now completely integrated with the platform:

  • grid
  • tooltip
  • form
  • tile viewer
  • other viewers (bar chart, trellis plot, etc)

2 Likes

Recently introduced value comparators could of course be used for cheminformatics purposes as well. In the picture below, a trivial comparator based on the SMILES length is used to roughly order the molecules by complexity. All visualizations pick it up automatically.

@nikolaus.stiefl.novartis.com @nico.pulver.novartis.com @dpetrov.gnf.org @asantrosyan.gnf.org @ptosco

2 Likes

“Save as SDF” function is now exposed in the main “save as” menu. At the moment it simply exports the first structure column and all other columns as properties, but we will add options as well:

  • choosing a structure column
  • choosing properties
  • choosing file format

1 Like