KNIME’s Binning Nodes
Exploring KNIME’s Auto-Binner and Numeric Binner Nodes
Watch The Video! https://youtu.be/nNmA3a8FdXM
Today’s we are exploring KNIME’s binning nodes — specifically the Auto-Binner and Numeric Binner nodes. These nodes allow us to quickly and easily bucket or categorize our data to summarize it, prepare it for further analysis or visualize. Figure 1.
Download the follow-along workflow here: https://kni.me/w/wKkbivQvvaXupilL.
KNIME Analytics Platform is free for Windows, macOS and Linux. Download to get started: https://www.knime.com/downloads.
Binning is a convenient way to bucket, or categorize our data to reduce the number of possible values. It’s a way for us to take continuous data and convert it to discrete forms — often into intervals. Figure 2.
Remember that discrete data ( Figure 3) is an isolated finite value, something that is counted, like the number of purchases made or number of staff in a department. Whereas continuous data can fluctuate in value and change, examples being heights, weights or as we will see in our example, salaries.
Binning can be helpful when we are visually exploring our data in the form of a histogram or if we want to summarize it to understand it better. The relative success of our binning efforts can depend in part on our domain knowledge. Figure 4.
In some cases, binning can be a process of trial and error as we work to best categorize our data.
Binning also includes some specific notation that we will cover, so when you hear or read “left side closed” or “right side open” it will hopefully make more sense. Figure 5,6.
Finally, binning serves a number of important roles. It assists in providing protection against outliers (through the collection into a ranged bucket), provides support to sequester missing values in a dataset and can be key when preparing data for data modeling.
The Dataset
The dataset we will be exploring is Employee Salaries data from the City of Virginia Beach Website. Figure 7. Its 6,954 rows by 8 columns and was last updated March 12th of 2019. It includes columns that provide salary and hourly rate, position title, department, hire date and more. The dataset is included in the companion workflow which can be downloaded for free from the KNIME Hub.
As you may know, data is not always organized how we need it when we get it, so in this yellow component (Label, Split and Domain Re-Calc) we label, transform and split it into hourly and salaried rates of pay. Figures 8,9.
Domain Re-Calculation
Splitting data into hourly and salaried rates of pay means we need to recalculate the table domain — the minimum and maximum values identified by the table spec for our column. The Domain Calculator node will update our table specification with the correct minimum and maximum values which you can see on the Spec tab in the lower and upper bound columns next to Salary. If we don’t do this the Auto-Binner node will not behave as expected. Figure 10.
Auto-Binner Configuration
Let’s start with the Auto-Binner node on the hourly rate data. When we open the Auto-Binner node configuration the node needs a numeric data type to go into the side. If our dataset lacked numeric data, this Exclude and Include twin-list would be empty. We have Wildcard and Regex selection options here as well. Figure 11.
Equal Width Binning
The first big decision here is how exactly we want to bin our data.
Our first choice is Fixed number of bins. Within Fixed number of bins we set the number of bins as an integer value.
Next, we select what equal method we want to use.
The first dropdown choice is Equal width. This means that the value range in the bins themselves will be equal based on the number of bins we select. So maybe each bin is 10. 0–10, 11–21 and so on and so forth. This can be helpful when we are trying to visualize the data whether it’s salary data (as in this example) age, miles per gallon etc. It’s easier to understand the breakdown of the data, but our number of values per bin may vary. Figure 12.
According to the node documentation this equal width is over the domain range — remember the range in the table specification for our filtered table?
Equal Frequency Binning
The next choice is Equal frequency, where the bins have the same if not close to the same number of values per bin so we don’t see big differences in bins like with equal width. This kind of binning can be helpful in certain feature engineering tasks and modeling/algorithm use. Figure 13.
Sample Quantile Binning
We can also bin by sample quantiles and are free to adjust these quantiles as we see fit. According to the node documentation ( Figure 14) the smallest element corresponds to a probability of zero and the largest to a probability of 1. Figure 15.
Ideally, there will be an equal number of values per bin for both quantile and equal frequency binning but be aware of the that tied values at boundaries can potentially increase the size (or value count) of the bins.
Next we can select our bin naming. We have the choice between numbered where we get a Bin 1, Bin 2 etc. label. Borders, which give us the left and right open and closed notation and finally we can get midpoint data for each bin.
For this example, let’s check the Borders option so I can cover the notation which is different here than in the Numeric Binner node.
We can force integer bounds which means that the lower value (the one on the left side, gets converted to its floor and the upper value (the one on the right side) gets converted to its ceiling. If we select this, duplicate edges will be removed. Figure 16.
If we see fit, we can choose to replace our target column(s) on output. If we selected this, the Salary column from our dataset would get replaced with the bins.
The number format settings tab ( Figure 17) allows us to adjust the number formatting, from how its output (standard string to engineering string) to precision, precision mode and rounding mode. Don’t forget this tab is here it could come in handy!
Let’s just leave it at default formatting, click OK and execute the node or press F7. Right click and look at the Binned data output. We still have our table of data, but now we have a new column called Salary [Binned] which includes our bins. Figure 18.
You’ll notice since we selected the Borders option for our bin naming, we see values in each row.
Since we forced integer bounds, we have whole number values here and the notation can be read as follows. Figure 19.
The parenthesis means that its open and the bracket means closed. A value coming after the parenthesis is not included in the listed range, whereas a value coming before the end bracket means it’s included.
With a GroupBy node, we can really explore the data by summarizing the bins and counting the number of salaries per bin, departments, position titles and more. I highly encourage you to explore!
From the Auto-Binner where we selected equal width, the bin that ranges from greater than 11 dollars per hour (it’s left-hand side open) to 21 dollars per hour (it’s right-hand side closed) has the most recorded salaries based on our dataset and binning choices. Figure 20.
As I mentioned earlier and as you can see here, the missing data gets its own bin. You can probably understand where domain knowledge is a necessity when binning values and can greatly assist in successfully summarizing and drawing insights from the data.
The Auto-Binner also includes a PMML or Predictive Model Markup Language processing fragment outport, to connect these configurations to PMML models.
Included in this workflow are many other examples for you to step through and the same examples using the salaried data are available as well.
For these examples Force Integer Bounds has been selected to clean up the outputs a bit.
Further Exploration
We can explore all these different Auto-Binning options visually by configuring Histogram nodes the same way we configured the Auto-Binner . As you can see (Figure 21), I’ve configured all of them so you can explore the different outputs and visually understand how the bins changes based on the configurations.
Numeric Binner Node
Next, let’s dive into the Numeric Binner. This node allows us to define our own values and give them custom names or use generated names. Just as with the Auto-Binner node we can choose to overwrite the target column or Append a new column.
A key thing to recall here is that the notation is different. Rather than a parenthesis and a bracket, the notation here is just brackets. Figure 22.
I’ve included a simple test dataset to practice this on, something I highly recommend. Figure 23.
We start by adding our target column. The first bin basically covers from negative infinity through infinity. We need to add another bin to open the ranges and get things started. Now you can see that the first bin automatically updates to cover from negative infinity through 0. Figure 24.
As we add range limits and then new columns, the range minimums will update. We still must be mindful about what bins our values are going to reside in. Pay careful attention to the values and remember the number types of your dataset, integer vs. double for instance.
The nice thing here is that rather than need to write a long else-if statement to create categories, we can do it here and adjust on the fly.
For this example, we want 4 bins each with 5 values. We can just click Add for each bin and enter the range we want the bin to include. Then we add a descriptive name and move to the next.
Think about how we might want the final bins defined. In this example, we ensure that 16–20 is in its own bin and then create a 21+ bin. The nice thing is that we can add additional bins here to account for future data that may not be in the current dataset. Figure 25.
Let’s select Append New Column and execute the node. Take a look at the output. We can clearly see how the values were binned based on our defined ranges in the node configuration. Figure 26.
Flow Variable Control
As with the vast majority of KNIME nodes, the Auto-Binner and Numeric Binner have several controllable flow variables shown here. The most practical use of flow variables here would be with the Auto-Binner node and in the form of a component allowing someone to set the number of bins and equality method. Figures 26,27.
That’s it for the Auto-Binner and Numeric Binner nodes. These nodes can make working with our data much easier by allowing us to categorize it in ways that make sense to enable clearer reporting or targeted analysis.
Thanks for reading! If you have questions please post them on the YouTube video for this article, or email me at info@knime.tips!
Originally published at https://knime.tips.