Dimensionality reduction

This topic will cover updates to dimensionality reduction capabilities of Datagrok, including chemical space, sequence space and a new method supporting multiple columns of multiple types at once.

Chemical space and Sequence space now supporting modifying parameters to encoding functions and clustering embeddings.

You can now choose which molecular fingerprints are used for chemical space analysis and cluster embeddings after calculation. The fingerprints option is available through the gear icon(:gear:) situated to the left of encoding function input and offers choices of Morgan, Pattern and RDKit. Also, at the bottom of the dialog, you can see the Cluster embeddings checkbox, which allows to cluster entries based on their embeddings positions using DBSCAN algorithm. Parameters to clustering are also available through the gear icon. If Cluster embeddings checkbox is enabled, new cluster column will be generated and the chemical space scatterplot will be colored based on this column.


Similarly, the Bio package uses the same approach for Sequence space. There, you can modify which type of fingerprints are used to generate monomer substitution matrix and additional parameters for Needleman-Wunsch distance function (Gap open penalty and gap extension penalty).


Under the hood

Contrary to previous approach, where sequence space and chemical space used separate methods, now they both use improved and unified one, that can support columns of different types and semantic types. According to what type and semantic type the column is, dimensionality reduction dialog will search for supported encoding functions and supporting distance functions for given column and offer them as options. In other words, instead of writing the whole code of chemical space in Chem package, only thing needed to be done now in the package is writing of encoding function and adding special tags. In case of Chem, this encoding function would be converting molecules to fingerprints and supported distance functions would be those for bit arrays. these encoding functions can be parametrized as well, meaning that developers can add as many options to it as they like. For example, for Chem it might be the type of fingerprints used for conversion (Morgan, Pattern, RDKIit…) and these additional parameter inputs will be automatically included in the dialog. The default encoding function for fingerprints looks like this:

//name: Fingerprints
//tags: dim-red-preprocessing-function
//meta.supportedSemTypes: Molecule
//meta.supportedDistanceFunctions: Tanimoto,Asymmetric,Cosine,Sokal
//input: column col {semType: Molecule}
//input: string _metric {optional: true}
//input: string fingerprintType = Morgan {caption: Fingerprint type; optional: true; choices: ['Morgan', 'RDKit', 'Pattern']}
//output: object result
export async function getFingerprints(
  col: DG.Column, _metric: string, fingerprintType: Fingerprint = Fingerprint.Morgan) {
  const fpColumn = await chemSearches.chemGetFingerprints(col, fingerprintType, false);
  malformedDataWarning(fpColumn, col);
  return {entries: fpColumn, options: {}};