Virtual columns

Datagrok’s data engine enables interactive analysis of big datasets right in the browser. To achieve that, we do all sorts of optimizations that you would expect to see in a well-tuned, performance-critical C++ application and not in a web app, such as manual memory management with the column-based layout, adaptive bit storage, custom serialization codecs, and many others. This approach works great and allows us to work with datasets consisting of tens of millions of rows, or of millions of columns, however you are restricted to having columns of the pre-defined scalar types, such as int, double, string, bool, or DateTime.

Due to the complexity of business needs, sometimes having a real JavaScript object in a table cell simplifies things a lot. Previously, it was possible to do that by creating a column of type "Object’ and populating it manually - while this approach works, it uses a lot more memory and has other drawbacks.

To get the best of both worlds, we have decided to introduce a special “virtual” column type. When creating a virtual column, you would pass a function that takes an index and returns an object, and the platform would call this function whenever a table cell is accessed (for instance, when rendering a grid). This way, the data still resides in the most efficient way in the compressed columnar format, but at the same time application developers have an option to access it using custom, domain-specific objects that get constructed on the fly.

The screenshot below demonstrates the concept; the functionality is not yet released, and your feedback is appreciated.

5 Likes

Could you please add a reference to the virtual column as additional parameter to the function’s signature so that the signature would look like (i, col) => new Car()?

In our application the association between JS object and the columns of primitives is determined on the fly in a loop. That’s why we cannot hardcode columns in the Car’s constructor as in the example above. The parameters of the association will be stored in the virtual column (as custom semantic type object), and for each call the function will be reading these parameters from there.

We’ll consider changing the API to make it more expressive, but in any case what you want to achieve could be done with one additional line by declaring the column variable and using it in the closure like that:

There is a recently introduced bug in the addNewVirtual method.

let dframe = grok.data.demo.randomWalk(100, 100);
try{dframe.columns.addNewVirtual(“Virt Column”, function (nRow) {return 3;});}
catch(e)
{
let err = e.message;
}

Thanks for reporting, and apologies for breaking it - this is now fixed, check out our dev server in 30 minutes. We are also adding virtual columns to our unit test suite to prevent similar bugs in the future.