Since performance is really pivotal here, I think it would make sense to start with a set of benchmarks first, so that we can set goals and track the improvements. Let’s keep it very simple, yet representative of chemist’s everyday work. Here are some ideas:
Rendering
Dataset: 1,000 random Chembl molecules
Goal: make rendering seamless with no visual lags (currently, RDKit is slower than OpenChemLib in that area)
- Render 1,000 molecules (overall rendering performance)
- Render 20 molecules 100 times (horizontal scrolling)
- Render 20 molecules 100 times with a sliding window (vertical scrolling)
Substructure Search
Dataset: 100,000 random Chembl molecules
Goal: make substructure search in 1 million molecules an interactive experience
Let’s start with substructures, and ignore complex SMARTS for the moment. For each test, we will be doing two searches (first one might involve calculating the fingerprints that we can speculatively do in the background)
- Search for benzene ring
- Search for aspirin
Similarity Search
Dataset: 100,000 random Chembl molecules
Goal: make similarity search in 1 million molecules an interactive experience
- Find 50 most similar molecules to 10 random molecules
Miscellaneous
- R-group analysis (need a good datasets for that)
- Computing Lipinski properties for 100K molecules