Handling decimal notations

August 20, 2016

The most common way to display large and decimal numbers looks like that: 1,234,567.89

However, there are many more formats available. For example, the previous number can be written, depending on the countries:

Since DSS needs to help a lot of different systems talk together, and those systems do not have the same opinions, DSS only treats "computer-notation" numbers as decimals, out of the box.

Thus, both for the float and double storage types, and for the Decimal meaning, DSS will only accept the following kind of notations:

  • 1234567.89
  • 1.23456789E6
  • -1234.33
  • ...
You might want to re-read our documentation about storage types and meanings

While DSS could recognize more forms, other systems like Hive would not, and that would cause various inconsistencies.

Thus, for example, 1,234,567.89 will be recognized as a String by DSS, not a number.

Normalizing in a preparation script

The best way to handle datasets containing this kind of notations is to use a preparation script (either in a visual analysis or a recipe). The visual data preparation contains a Convert number formats processor, which can translate between various numerical representations.

Here is an example with a dataset containing one "US-formatted" column and a "French-formatted" one.

As you can see, DSS does not recognize these as valid "Decimal". It recognizes the french format as "Decimal (comma)". For this one, DSS will thus automatically suggest the conversion:

For the first column, we need to create the processor manually. Open the processors library and search for the Convert number formats processor.

Select the input column, and output column. The input format is "English", and the output format is "Raw" (meaning "raw decimals"). We now have converted both columns. In the output dataset, they can now be considered as decimals and processed as such by all DSS supported compute engines.