Binning Continuous Variables in Boxplot R
[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Herman Jopia
What is Binning?
Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.
Why Binning?
Though there are some reticence to it [1], the benefits of binning are pretty straight forward:
- It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
- It controls or mitigates the impact of outliers over the model.
- It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.
          Unsupervised Discretization          
Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.
Equal length intervals
- Objective: Understand the distribution of a variable.
- Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
- Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.
                      
Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.
Equal frequency intervals
- Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
- Example: Quartlies or Percentiles.
- Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2
                      
Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).
Supervised Discretization
Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.
In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.
An Example With 'smbinning'
Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).
# Load package and its data  library(smbinning)  data(chileancredit)  # Training and testing samples  chileancredit.train=subset(chileancredit,FlagSample==1)  chileancredit.test=subset(chileancredit,FlagSample==0)  # Run and save results  result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05)  result$ivtable   # Relevant plots (2x2 Page)  par(mfrow=c(2,2))  boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB,  horizontal=T, frame=F, col="lightgray",main="Distribution")  mtext("Time on Books (Months)",3)  smbinning.plot(result,option="dist",sub="Time on Books (Months)")  smbinning.plot(result,option="badrate",sub="Time on Books (Months)")  smbinning.plot(result,option="WoE",sub="Time on Books (Months)")                          
          
Table 3. Time on Books cutpoints mapped to Credit Performance.
                  
Figure 1. Plots generated by the package.
In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.
For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.
          References          
[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.
Source: https://www.r-bloggers.com/2015/03/r-package-smbinning-optimal-binning-for-scoring-modeling/
0 Response to "Binning Continuous Variables in Boxplot R"
Enregistrer un commentaire