Machine Learning with Decision Tree Classification

Data Classification

Another form of data mining is data classification with decision tree induction. It is a way to organize data into buckets with items labeled as “blue apples” or “red apples” or “beer”. In computing terms, it’s data association as it relates to a given set of attributes that will yield pure classification. More on this later. In more practical terms, it is classifying a customer’s credit history data as risky or safe in a loan application process. A classification of stocks as a “buy” or a “sell” based on financial features such as one-year growth, earnings, revenues, and so on.

These classifications are built using models called classifiers, which describe the classes of data used in machine learning, pattern recognition, and statistics. A model can be an If-else condition or a decision tree where the data is applied to an algorithm that generates a tree-like data structure with roots, branches, and leaves with the leaves as the resulting classified output.

The illustration above is a basic example of classifying data by creating a tree-like data structure.

General Approach:

The general approach to data classification can be broken down into 2 general processes: First, the learning step. This is where a model is generated based on sample data. In case you’re wondering what a model is, it’s a representation of data. In mathematical terms E=mc squared, for example, can be thought of as a model of mass and energy relation. A representation of their equivalence. In data classification, a model can be attributes a and b = c, with a certain threshold of c being labeled as “cat”, for example (with a representing whiskers and b as claws).

The second step is the classification step where the model is applied to a sample data set, the training data set. I was confused on what training data set means since the materials I have read did not really explain it to a level that I can understand – a 5th grade level. No shame there. And in case you are wondering what it means as well, in its most simplest concept, which is what this blog is about – over simplifying things, at least to start, I will provide a very simple example.

Consider your financial statement at the end of the month where you have a negative account balance and you cannot make your rent payment for that month. You review your statement and see three transactions that go over $300. You realized that this is why you have no money at the end of the month. Then, you review past months statement and you see a similar pattern. That whenever you spend more than $300 at least three times for a given month your account goes negative. You then come to a conclusion that you cannot afford spending $300 three times in a month. Mathematically you might write this as (s  → d) * n / Mi = “bad”. With s as spending, d is the dollar amount ties, n is the number of concurrences over M sub i, with M as the month and i as the index for each month. This is your model, your “classifier” that is generated from training data; the current month, the previous month, and the month before that.

After defining your model you also reviewed many previous months to validate the accuracy of your conclusion, your model. This is called “model evaluation”, an important aspect in data classification where the accuracy of the model is compared based on its performance on another set of sample data. If the accuracy of the model is not within an expected or acceptable threshold then the model is revised with updated conditions or by applying statistical logic to fit the desired accuracy threshold. Or, adjusting the threshold limit with the knowledge that this is simply what the data sample can provide while adjusting the model as more sample data is available. It’s an iterative step until the desired outcome is reached. Model evaluation is not covered in this post. Perhaps later.

In more practical terms, the illustration below shows how a bank loan training data set can be classified, either manually or as part of the data, with a label; low, safe, and risky.


The class attribute in this case is the loan_decision with the training data set parsed by a classification algorithm to generate the classification rules. In the example above, an if-else condition is generated, a form of decision tree. The generation of this if-then classifier is based on the pattern from the training data set. The combination of attribute patterns that has the loan_decision label.

This process of classifying tuples with known labels is called “supervised” classification, or more commonly known as “supervised learning.” It is called supervised because the classes of data are known in advance. This is in contrast with unsupervised learning, a form of data clustering. This topic will be covered in my succeeding post.


Decision Tree Induction

Decision tree induction is where a tree is generated given the classification steps we have learned so far. Side note: I personally think of it as decision tree “generation” instead of induction, as it is known in data mining concepts. Generation makes more sense than induction because it is a process where a decision tree is generated by the classification algorithm from the if-then rules using a decision tree algorithm.

In case you’re wondering, like I did when I was learning this, why do we need to generate a decision tree if we already have our if-then rules? The reason is that decision trees are very efficient in storing, processing, and retrieving data compared to if-then statements. The way the data is organized lends itself well to information processing by low-level programs (like C or Java). It is commonly used in data structures, and the concepts of binary trees are widely used in all forms of computing and data storage. It is an important form of data structure with many applications.

In a decision tree, the internal nodes are tests (if-else) decisions on certain conditions with each binary output (either yes or no) branching to the lower levels of the tree. Lower, because the common representation of a data tree is an upside down tree, with the root node being drawn on top branching downwards to the leaf nodes. The leaf nodes holds the class label.

The ID3 algorithm is the algorithm for generating decision trees and is typically used in Machine Learning and Natural Language processing.

More details can be found here:


The idea behind the ID3 algorithm is to identify the attribute that will lead to the highest information gain or the one with the least entropy (information loss). An attribute with high Information gain means that it can serve as a decision point in the tree structure. Simply put, it is an attribute where other attributes can be associated to given its high number of occurrence, that is, in relation to other attributes in the data set. Determining this entropy business can be daunting. But luckily, in the world of information science, this has been modeled for us. Thanks to mathematician, Claude Shannon.

H(S) represents entropy, or more specifically Shannon Entropy. S can have the value x1, x2, xn, which can represent any possible values of attributes in the data set. The p is the probability of an attribute value from occurring, which is calculated as the number of expected value divided by the total number of occurrences.

A very good illustration and example of this concept can be found here:

Key to classifying data by inducing a decision tree from a data set is determining the criteria for splitting (horizontally, or by columns) the data into individual classes. This means that classes must already be labeled. Data with labeled classes in Machine Learning nomenclature as supervised learning while data with unlabeled classes is known as unsupervised learning. More on this on my future article.

Identifying the splitting attributes for a data set means ranking each attribute by score, by information gain using the Shannon Entropy model above. The attribute with the highest score will serve as the splitting attribute, and will continue to be split until it has reached the score of 100% or 0%.  A score of 100% or 0% means that no other data attributes can be associated to a splitting attribute for further categorization. They now all belong to the same class. This is known as attribute purity, with 100% and 0% indicating binary output 1 or 0, or “yes” or “no”, or “black” or “white”.

If the attribute is not pure, meaning there are parameters that have non-zero or non-one hundred values, split it again using the next attribute. This attribute selection cycle is what generates the tree data structure with branches growing from each node until it can no longer be split, the leaf node.




Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei


Leave a Reply

Your email address will not be published. Required fields are marked *