I was looking at a useful feature “Categorize…” which allows to convert a string column into a categorical column, based on given patterns. There is also a very useful popup on the column header showing the existing categories with their frequencies.
Is there a feature which allows converting a string column into a categorical column with automatic categories enumeration, say, based on frequency decrease, or alike?
I’m not exactly sure what the question is, what is “converting a string column into a categorical column with automatic categories enumeration”? There are several aspects to that:
API: Internally, all string columns are represented as categorical columns. This is an extremely convenient and efficient representation that is ideal for the purposes of interactive analytics and exploratory data analysis.
Exploration: there are several methods for exploration of string columns, including tooltips, filters, stacked bar charts (with custom sorting function, so it is possible to enumerate categories based on frequency in the descending order, like you mentioned), trellis plots, property panels, and others.
Manipulation: there are multiple ways for working with string values, including categorization, batch edits, etc.
Technically, I am converting a string column with categories into a numeric column.
There, this column assigns numbers to categories in order of their appearance. This is useful in many cases: for instance, to anonymize data, or to put datasets into machine learning algorithms not working with string categorical columns.
This GIF also shows an approach for Excel In fact, this only works well in Excel on small datasets, but takes considerable times otherwise on reasonably sized datasets of 10’000+ elements.
I know that we have .categories map on the dataframe’s column, but this isn’t quite what I was looking for. I thought it’d be useful to have this data augmentation mechanism out of the box in the UI.
Maybe there’s a formula possible to use in Add New Column which I’m not aware of? In any case, I wasn’t going to use the one for Excel from a GIF, as it’s more of a peculiar fact than a daily tool.