KNIME’s Binning Nodes

John Denham
10 min readNov 22, 2021

Exploring KNIME’s Auto-Binner and Numeric Binner Nodes

Watch The Video! https://youtu.be/nNmA3a8FdXM

Today’s we are exploring KNIME’s binning nodes — specifically the Auto-Binner and Numeric Binner nodes. These nodes allow us to quickly and easily bucket or categorize our data to summarize it, prepare it for further analysis or visualize. Figure 1.

Download the follow-along workflow here: https://kni.me/w/wKkbivQvvaXupilL.

KNIME Analytics Platform is free for Windows, macOS and Linux. Download to get started: https://www.knime.com/downloads.

Figure 1: KNIME’s Binning Nodes

Binning is a convenient way to bucket, or categorize our data to reduce the number of possible values. It’s a way for us to take continuous data and convert it to discrete forms — often into intervals. Figure 2.

Figure 2: Putting Data Into Buckets

Remember that discrete data ( Figure 3) is an isolated finite value, something that is counted, like the number of purchases made or number of staff in a department. Whereas continuous data can fluctuate in value and change, examples being heights, weights or as we will see in our example, salaries.

Figure 3: Discrete vs Continuous Data

Binning can be helpful when we are visually exploring our data in the form of a histogram or if we want to summarize it to understand it better. The relative success of our binning efforts can depend in part on our domain knowledge. Figure 4.

Figure 4: Domain Knowledge Defined

In some cases, binning can be a process of trial and error as we work to best categorize our data.

Binning also includes some specific notation that we will cover, so when you hear or read “left side closed” or “right side open” it will hopefully make more sense. Figure 5,6.

Figure 5: Output Notation Defined
Figure 6: Output Notation Defined 2

Finally, binning serves a number of important roles. It assists in providing protection against outliers (through the collection into a ranged bucket), provides support to sequester missing values in a dataset and can be key when preparing data for data modeling.

The Dataset

The dataset we will be exploring is Employee Salaries data from the City of Virginia Beach Website. Figure 7. Its 6,954 rows by 8 columns and was last updated March 12th of 2019. It includes columns that provide salary and hourly rate, position title, department, hire date and more. The dataset is included in the companion workflow which can be downloaded for free from the KNIME Hub.

Figure 7: The Source Website

As you may know, data is not always organized how we need it when we get it, so in this yellow component (Label, Split and Domain Re-Calc) we label, transform and split it into hourly and salaried rates of pay. Figures 8,9.

Figure 8: The Component
Figure 9: Inside The Component

Domain Re-Calculation

Splitting data into hourly and salaried rates of pay means we need to recalculate the table domain — the minimum and maximum values identified by the table spec for our column. The Domain Calculator node will update our table specification with the correct minimum and maximum values which you can see on the Spec tab in the lower and upper bound columns next to Salary. If we don’t do this the Auto-Binner node will not behave as expected. Figure 10.

Figure 10: Re-Calculated Domain For Hourly Rates Of Pay

Auto-Binner Configuration

Let’s start with the Auto-Binner node on the hourly rate data. When we open the Auto-Binner node configuration the node needs a numeric data type to go into the side. If our dataset lacked numeric data, this Exclude and Include twin-list would be empty. We have Wildcard and Regex selection options here as well. Figure 11.

Figure 11: Auto Binner Configuration Setting Tab

Equal Width Binning

The first big decision here is how exactly we want to bin our data.

Our first choice is Fixed number of bins. Within Fixed number of bins we set the number of bins as an integer value.

Next, we select what equal method we want to use.

The first dropdown choice is Equal width. This means that the value range in the bins themselves will be equal based on the number of bins we select. So maybe each bin is 10. 0–10, 11–21 and so on and so forth. This can be helpful when we are trying to visualize the data whether it’s salary data (as in this example) age, miles per gallon etc. It’s easier to understand the breakdown of the data, but our number of values per bin may vary. Figure 12.

Figure 12: Equal Width Bins

According to the node documentation this equal width is over the domain range — remember the range in the table specification for our filtered table?

Equal Frequency Binning

The next choice is Equal frequency, where the bins have the same if not close to the same number of values per bin so we don’t see big differences in bins like with equal width. This kind of binning can be helpful in certain feature engineering tasks and modeling/algorithm use. Figure 13.

Figure 13: Equal Frequency Bins

Sample Quantile Binning

We can also bin by sample quantiles and are free to adjust these quantiles as we see fit. According to the node documentation ( Figure 14) the smallest element corresponds to a probability of zero and the largest to a probability of 1. Figure 15.

Figure 14: Node Sample Quantiles Defined
Figure 15: Sample Quantiles

Ideally, there will be an equal number of values per bin for both quantile and equal frequency binning but be aware of the that tied values at boundaries can potentially increase the size (or value count) of the bins.

Next we can select our bin naming. We have the choice between numbered where we get a Bin 1, Bin 2 etc. label. Borders, which give us the left and right open and closed notation and finally we can get midpoint data for each bin.

For this example, let’s check the Borders option so I can cover the notation which is different here than in the Numeric Binner node.

We can force integer bounds which means that the lower value (the one on the left side, gets converted to its floor and the upper value (the one on the right side) gets converted to its ceiling. If we select this, duplicate edges will be removed. Figure 16.

Figure 16: Example Of Forcing Integer Bounds

If we see fit, we can choose to replace our target column(s) on output. If we selected this, the Salary column from our dataset would get replaced with the bins.

Figure 17: The Number Format Settings Tab

The number format settings tab ( Figure 17) allows us to adjust the number formatting, from how its output (standard string to engineering string) to precision, precision mode and rounding mode. Don’t forget this tab is here it could come in handy!

Let’s just leave it at default formatting, click OK and execute the node or press F7. Right click and look at the Binned data output. We still have our table of data, but now we have a new column called Salary [Binned] which includes our bins. Figure 18.

Figure 18: Dataset With Bins!

You’ll notice since we selected the Borders option for our bin naming, we see values in each row.

Since we forced integer bounds, we have whole number values here and the notation can be read as follows. Figure 19.

Figure 19: Auto-Binner Output Notation

The parenthesis means that its open and the bracket means closed. A value coming after the parenthesis is not included in the listed range, whereas a value coming before the end bracket means it’s included.

With a GroupBy node, we can really explore the data by summarizing the bins and counting the number of salaries per bin, departments, position titles and more. I highly encourage you to explore!

From the Auto-Binner where we selected equal width, the bin that ranges from greater than 11 dollars per hour (it’s left-hand side open) to 21 dollars per hour (it’s right-hand side closed) has the most recorded salaries based on our dataset and binning choices. Figure 20.

Figure 20: Group-By Node Bin Aggregations

As I mentioned earlier and as you can see here, the missing data gets its own bin. You can probably understand where domain knowledge is a necessity when binning values and can greatly assist in successfully summarizing and drawing insights from the data.

The Auto-Binner also includes a PMML or Predictive Model Markup Language processing fragment outport, to connect these configurations to PMML models.

Included in this workflow are many other examples for you to step through and the same examples using the salaried data are available as well.

For these examples Force Integer Bounds has been selected to clean up the outputs a bit.

Further Exploration

We can explore all these different Auto-Binning options visually by configuring Histogram nodes the same way we configured the Auto-Binner . As you can see (Figure 21), I’ve configured all of them so you can explore the different outputs and visually understand how the bins changes based on the configurations.

Figure 21: Lot’s Of Options To Explore!

Numeric Binner Node

Next, let’s dive into the Numeric Binner. This node allows us to define our own values and give them custom names or use generated names. Just as with the Auto-Binner node we can choose to overwrite the target column or Append a new column.

A key thing to recall here is that the notation is different. Rather than a parenthesis and a bracket, the notation here is just brackets. Figure 22.

Figure 22: Numeric Binner Notation

I’ve included a simple test dataset to practice this on, something I highly recommend. Figure 23.

Figure 23: The Test Data

We start by adding our target column. The first bin basically covers from negative infinity through infinity. We need to add another bin to open the ranges and get things started. Now you can see that the first bin automatically updates to cover from negative infinity through 0. Figure 24.

Figure 24: Automatic Bin Values In The Numeric Binner

As we add range limits and then new columns, the range minimums will update. We still must be mindful about what bins our values are going to reside in. Pay careful attention to the values and remember the number types of your dataset, integer vs. double for instance.

The nice thing here is that rather than need to write a long else-if statement to create categories, we can do it here and adjust on the fly.

For this example, we want 4 bins each with 5 values. We can just click Add for each bin and enter the range we want the bin to include. Then we add a descriptive name and move to the next.

Think about how we might want the final bins defined. In this example, we ensure that 16–20 is in its own bin and then create a 21+ bin. The nice thing is that we can add additional bins here to account for future data that may not be in the current dataset. Figure 25.

Figure 25: Our Bin Configurations

Let’s select Append New Column and execute the node. Take a look at the output. We can clearly see how the values were binned based on our defined ranges in the node configuration. Figure 26.

Figure 26: Output As Expected!

Flow Variable Control

As with the vast majority of KNIME nodes, the Auto-Binner and Numeric Binner have several controllable flow variables shown here. The most practical use of flow variables here would be with the Auto-Binner node and in the form of a component allowing someone to set the number of bins and equality method. Figures 26,27.

Figure 26: Auto-Binner Flow Variables
Figure 27: Numeric Binner Flow Variables

That’s it for the Auto-Binner and Numeric Binner nodes. These nodes can make working with our data much easier by allowing us to categorize it in ways that make sense to enable clearer reporting or targeted analysis.

Thanks for reading! If you have questions please post them on the YouTube video for this article, or email me at info@knime.tips!

Originally published at https://knime.tips.

--

--

John Denham

I am a Data Scientist who is passionate about empowering people to make the most of their data. I run the website KNIME.tips.