Facebook Data Mining

Facebook has incredible amounts of data that can be mined for analysis and insights on user or customer behavior, social trends, market analysis, and so on. Graph API is the interface for extracting this data. It is the primary way to get data into and out of the Facebook platform. It’s a low-level HTTP-based API that apps can use to programmatically query data, post new stories, manage ads, upload photos, and perform a wide variety of other tasks.

I created a project called Facebook Miner for extracting this data using Python. This project extracts Facebook page posts along with its comments and sub-comments.

The project is located on Github. To execute, simply execute the Python script. It does require preliminary setup, such as the FB page lists file for extraction. These preliminary steps are detailed in the Github readme file.

Facebook Graph API for Data Mining

Facebook provides an API for extracting company and customer comments called Graph API. It is the primary way to get data into and out of the Facebook platform. It’s a low-level HTTP-based API that apps can use to programmatically query data, post new stories, manage ads, upload photos, and perform a wide variety of other tasks[1].

Before the Graph API can be accessed, a user token is required[2]. An access_token can be acquired from Facebook’s developer tools located at https://developers.facebook.com/tools/

Once acquired, an endpoint API can be access as follows:

curl -i -X GET \

The Graph API is named after the idea of a “social graph” — a representation of the information on Facebook. It’s composed of:

  • nodes — basically individual objects, such as a User, a Photo, a Page, or a Comment
  • edges — connections between a collection of objects and a single object, such as Photos on a Page or Comments on a Photo
  • fields — data about an object, such as a User’s birthday, or a Page’s name

Typically, you use nodes to get data about a specific object, use edges to get collections of objects on a single object, and use fields to get data about a single object or each object in a collection.

User Page Data Structure

Each post by a user on their page in Facebook is published on the main page accessible using the URL form http://www.facebook.com/<page_name>. There are variations on how content gets published on a company page. Some companies only allow authorized personnel to post content, while others are open to all users in Facebook.

Below is the structure of posts for a company page:

- User A Page 
-> Page Post by Friends (or enemies)
-> Page Post Comments
-> Page Post Comment Sub-Comments
-> Page Post by Friends (or enemies)
-> Page Post Comments
-> Page Post Comment Sub-Comments
- User B Page
- User C Page

See screenshot below for illustration:

For each post, customers can add comments as well as sub-comments. To retrieve sub-comments, a GET request must include the posts node identifier in the request URL. Doing so will return all the associated comments for a given post.

Python Code for Extracting Facebook Posts

This Github Python project (https://github.com/rayitis/Facebook-Miner) uses Pandas Dataframe[3] to load and manipulate data for processing. A Data frame is a two-dimensional data structure, i.e. data is aligned in a tabular fashion in rows and columns. The following is a structure of the methods in the Python program for extracting Facebook data.

import pandas as pd
import json
from pprint import pprint
import os
import requests
# Load company list
# companyList = pd.read_csv('DataSet/Fortune 500 2017 - Fortune 500-small.csv')
companyList = pd.read_csv('DataSet/Fortune 500 2017 - Fortune 500.csv')
def constructUrl(contentType, variableStr):
   #contentType is either post, comment, or subcomment
   #variableStr is either the company name, post id (for comments), or comment id (for sub comments)
    if contentType == 'post':
<URL txt here>
    return url
def graphApiRequest(url):
    requestResult = requests.get(url)
    return requestResult
def writeToFile(folderName, fileName, extractedData):
    if not os.path.exists(folderName):
def getNextPageUrl(folderName, fileName):
        nextPageUrl = jsonData["paging"]["next"]
    except KeyError:
        nextPageUrl = 'null'
    return nextPageUrl
companyFbName = companyList['FBPage'][0]
folderName = 'DataSet/'+companyFbName+'/1-CompanyPost/'
url = constructUrl('post', companyFbName)
for i in range(companyList['FBPage'].size):
    pageNum = 0
    companyFbName = companyList['FBPage'][i]
    #companyNum += 1
    folderName = 'DataSet/' + str(companyList['Rank'][i]) + '-' + companyFbName + '/1-CompanyPost/'
    url = constructUrl('post', companyFbName)
    while url != 'null':
        #print "We're on time %d" % (x)
        pageNum += 1
        fileName = companyFbName+"_posts_page"+str(pageNum)+".json"
        extractedData = graphApiRequest(url)
        writeToFile(folderName, fileName, extractedData)
        nextPageUrl = getNextPageUrl(folderName, fileName)
        url = nextPageUrl
        print(companyFbName + ' page '+ str(pageNum) + ': '+ url)
        if url == 'null' and pageNum == 1:
            with open('DataSet/Exceptions.txt', "a") as exceptionFile:
                exceptionFile.write(',' + companyFbName)

To extract user comments from Facebook, the code set above is leveraged with the addition of processing logic that is suited for the user comments data structure provided by Facebook’s Graph API.

The code process and architecture are as follows:

  1. Read each extracted company post JSON files for each company. There are several JSON files representing multiple pages. This means that a new nested loop is added that will loop through each post in the JSON file.
  2. Extract the company post ID element in the JSON file. This post ID is the unique identifier for a given company post the is read by the “while” code snippet in step 1 above.
  3. Invoke the Graph API with this post ID, which is represented as the variable “variableStr.” This user comments Graph API URL is slightly different than the company post URL, and therefore a new URL string is generated.
  4. Write the output of the JSON request to a file. Each file is represented by user comments to a given company post. Similar to the company post files, this is also broken down into multiple pages with each company post file generating an average of 5,000 user comment files.


[1] Facebook Graph API for Developers. URL: https://developers.facebook.com/docs/graph-api/overview/

[2] Facebook Access Tokens. URL: https://developers.facebook.com/docs/facebook-login/access-tokens

[3] Pandas DataFrame. URL: https://pandas.pydata.org/pandas-docs/stable/dsintro.html

Unsupervised Machine Learning – Primer

Clustering, in data science, is a form of unsupervised learning. A type of machine learning where there is no predefined classes of data. Data classification, as I covered in my previous post, relies in knowing the data features and its classes. The “safe” or “risky” classes of data determined by a set of values in its associated data features. In unsupervised learning, the features are unknown. This is what makes data clustering a complex topic and is an active area of research and development in the field of data science, and is a key foundation in the field of artificial intelligence.

Feature learning

The general idea of data clustering, as the nomenclature suggests, is to cluster (or group) data based on similar characteristics so that each data feature not only share common traits but are also dissimilar in other associated features. The outcome of this grouping is a set of data classes, a process called “feature learning” – the machine learning from the data set without a known or predefined class, unsupervised.

When the feature is learned, or the classes of data identified, data classification algorithms can then be applied. Clustering can therefore be a first step in machine learning, where the data and its structure is unknown, but given a set of computational algorithms a pattern can be identified for processing such as pattern identification and classification.

General Approach

As you might have imagined this is quite a complex subject with multi-dimensional facets requiring many algorithms and computational processing. However, the intuition is straightforward and the complexity lies in the details of various clustering methods and approaches, both technical and theoretical. One such clustering method is K-means clustering. A widely used clustering method and serves as a foundation to clustering.

K-Means Algorithm

First, K stands for (notice I did not use “means”, as in K means this… I used “stands” instead to avoid confusion) a pre-defined variable. A variable set by a clustering algorithm, or a data programmer. It represents the number of partitions or clusters we want to have for a data set with unknown classes. The means in K-Means is exactly what it says – means, or the average of the feature values. A K, for example can be 2 (a randomly selected value, or via an algorithmic process), which means that the entire data set will be clustered into two once the computation is done. Next, K number of means is randomly selected, in this case 2 random means. This mean value will represent the center of the cluster, incorrect initially given that the values were selected at random but will be corrected by iterating through the entire data set. Each data scan iteration will update the mean value and will continue to this iteration and update process until the mean value is fixed, which means that the centroid of the data set has been found.

The K-Means algorithm for partitioning a data set is as follows.

1. Arbitrarily choose k number of clusters. The number of clusters you want out of the data set.

2. Arbitrarily choose a mean value for each k. This requires some prior knowledge of the data set, statistical values such as its overall mean or range values.

3. Assign each data object to a k cluster by computing it’s nearest centroid, the arbitrarily chosen mean value of k.

dist() is the Euclidean distance of ci and x, or simply the difference between the two data points x, which is the data point and ci, the k-mean value. This value is squared and the distances are summed so that the result is as separate as possible. In 5th grade speak – squaring exaggerates these values so that they are more distinguishable for computational purposes, or visualization purposes when presented in a graph. These squared distances are then summed. argmin is the minimized value of this distance given the k-mean value in the data set, C.

4. Update the mean value point of each cluster by calculating the mean of all data points in cluster ci.

Where Si is the set of data point assignments for each of the centroid cluster, with the sum of xi computed for all elements in Si.

5. Iterate steps 3 and 4 until the cluster mean value no longer changes, and data point assignment no longer changes. This is what is called the convergence of the algorithm.


The algorithm steps I have outlined above is a much more simplified version of what you’d find in textbook and online resources. Wikipedia (https://en.wikipedia.org/wiki/K-means_clustering), for example, defines it with a 2-step process with the details defined mathematically.

The idea behind unsupervised machine learning, as in the way humans learn, is to first discover patterns in the data either by similarities by categorization or by outliers. Indeed, data clustering is a deep subject. There is significantly more aspect in data clustering. The K-Means Clustering, for example, is just one of many clustering approaches. It is a partitioning approach while others are hierarchical, density-base, or grid-based methods. The approach in this article is the organization of data by partitioning and clustering them in a way that yields meaningful associations. Granted, this article is an over-simplification of data clustering it serves to demystify the idea behind unsupervised learning, with K-Means serving as an important foundation for machine learning.  


Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei



Machine Learning with Decision Tree Classification

Data Classification

Another form of data mining is data classification with decision tree induction. It is a way to organize data into buckets with items labeled as “blue apples” or “red apples” or “beer”. In computing terms, it’s data association as it relates to a given set of attributes that will yield pure classification. More on this later. In more practical terms, it is classifying a customer’s credit history data as risky or safe in a loan application process. A classification of stocks as a “buy” or a “sell” based on financial features such as one-year growth, earnings, revenues, and so on.

These classifications are built using models called classifiers, which describe the classes of data used in machine learning, pattern recognition, and statistics. A model can be an If-else condition or a decision tree where the data is applied to an algorithm that generates a tree-like data structure with roots, branches, and leaves with the leaves as the resulting classified output.

The illustration above is a basic example of classifying data by creating a tree-like data structure.

General Approach:

The general approach to data classification can be broken down into 2 general processes: First, the learning step. This is where a model is generated based on sample data. In case you’re wondering what a model is, it’s a representation of data. In mathematical terms E=mc squared, for example, can be thought of as a model of mass and energy relation. A representation of their equivalence. In data classification, a model can be attributes a and b = c, with a certain threshold of c being labeled as “cat”, for example (with a representing whiskers and b as claws).

The second step is the classification step where the model is applied to a sample data set, the training data set. I was confused on what training data set means since the materials I have read did not really explain it to a level that I can understand – a 5th grade level. No shame there. And in case you are wondering what it means as well, in its most simplest concept, which is what this blog is about – over simplifying things, at least to start, I will provide a very simple example.

Consider your financial statement at the end of the month where you have a negative account balance and you cannot make your rent payment for that month. You review your statement and see three transactions that go over $300. You realized that this is why you have no money at the end of the month. Then, you review past months statement and you see a similar pattern. That whenever you spend more than $300 at least three times for a given month your account goes negative. You then come to a conclusion that you cannot afford spending $300 three times in a month. Mathematically you might write this as (s  → d) * n / Mi = “bad”. With s as spending, d is the dollar amount ties, n is the number of concurrences over M sub i, with M as the month and i as the index for each month. This is your model, your “classifier” that is generated from training data; the current month, the previous month, and the month before that.

After defining your model you also reviewed many previous months to validate the accuracy of your conclusion, your model. This is called “model evaluation”, an important aspect in data classification where the accuracy of the model is compared based on its performance on another set of sample data. If the accuracy of the model is not within an expected or acceptable threshold then the model is revised with updated conditions or by applying statistical logic to fit the desired accuracy threshold. Or, adjusting the threshold limit with the knowledge that this is simply what the data sample can provide while adjusting the model as more sample data is available. It’s an iterative step until the desired outcome is reached. Model evaluation is not covered in this post. Perhaps later.

In more practical terms, the illustration below shows how a bank loan training data set can be classified, either manually or as part of the data, with a label; low, safe, and risky.


The class attribute in this case is the loan_decision with the training data set parsed by a classification algorithm to generate the classification rules. In the example above, an if-else condition is generated, a form of decision tree. The generation of this if-then classifier is based on the pattern from the training data set. The combination of attribute patterns that has the loan_decision label.

This process of classifying tuples with known labels is called “supervised” classification, or more commonly known as “supervised learning.” It is called supervised because the classes of data are known in advance. This is in contrast with unsupervised learning, a form of data clustering. This topic will be covered in my succeeding post.


Decision Tree Induction

Decision tree induction is where a tree is generated given the classification steps we have learned so far. Side note: I personally think of it as decision tree “generation” instead of induction, as it is known in data mining concepts. Generation makes more sense than induction because it is a process where a decision tree is generated by the classification algorithm from the if-then rules using a decision tree algorithm.

In case you’re wondering, like I did when I was learning this, why do we need to generate a decision tree if we already have our if-then rules? The reason is that decision trees are very efficient in storing, processing, and retrieving data compared to if-then statements. The way the data is organized lends itself well to information processing by low-level programs (like C or Java). It is commonly used in data structures, and the concepts of binary trees are widely used in all forms of computing and data storage. It is an important form of data structure with many applications.

In a decision tree, the internal nodes are tests (if-else) decisions on certain conditions with each binary output (either yes or no) branching to the lower levels of the tree. Lower, because the common representation of a data tree is an upside down tree, with the root node being drawn on top branching downwards to the leaf nodes. The leaf nodes holds the class label.

The ID3 algorithm is the algorithm for generating decision trees and is typically used in Machine Learning and Natural Language processing.

More details can be found here: https://en.wikipedia.org/wiki/ID3_algorithm


The idea behind the ID3 algorithm is to identify the attribute that will lead to the highest information gain or the one with the least entropy (information loss). An attribute with high Information gain means that it can serve as a decision point in the tree structure. Simply put, it is an attribute where other attributes can be associated to given its high number of occurrence, that is, in relation to other attributes in the data set. Determining this entropy business can be daunting. But luckily, in the world of information science, this has been modeled for us. Thanks to mathematician, Claude Shannon.

H(S) represents entropy, or more specifically Shannon Entropy. S can have the value x1, x2, xn, which can represent any possible values of attributes in the data set. The p is the probability of an attribute value from occurring, which is calculated as the number of expected value divided by the total number of occurrences.

A very good illustration and example of this concept can be found here:


Key to classifying data by inducing a decision tree from a data set is determining the criteria for splitting (horizontally, or by columns) the data into individual classes. This means that classes must already be labeled. Data with labeled classes in Machine Learning nomenclature as supervised learning while data with unlabeled classes is known as unsupervised learning. More on this on my future article.

Identifying the splitting attributes for a data set means ranking each attribute by score, by information gain using the Shannon Entropy model above. The attribute with the highest score will serve as the splitting attribute, and will continue to be split until it has reached the score of 100% or 0%.  A score of 100% or 0% means that no other data attributes can be associated to a splitting attribute for further categorization. They now all belong to the same class. This is known as attribute purity, with 100% and 0% indicating binary output 1 or 0, or “yes” or “no”, or “black” or “white”.

If the attribute is not pure, meaning there are parameters that have non-zero or non-one hundred values, split it again using the next attribute. This attribute selection cycle is what generates the tree data structure with branches growing from each node until it can no longer be split, the leaf node.




Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei


Apriori, an Algorithm for Frequent Pattern Mining

Frequent pattern mining can be implemented in many ways. It can be as simple as performing a scan in a data array and performing a count, while aggregating that total count of another column and validating if it meets a certain threshold. It can be as complex as cross-referencing data across multiple data features and applying conditional logic based on some external dataset. In data science, however, the simplest form of frequent pattern mining is the Apriori algorithm.

Apriori algorithm is the simplest form of frequent pattern mining. It generates strong association rules from frequent itemsets by using prior knowledge of itemset properties. It is an iterative approach that scans the database several times to accumulate count of a frequently occurring property value, L1, which is used to find L2, the second frequently occurring value, then L3 and so on until no more frequently occurring itemsets can be found.

These frequent iterative scans can consume large amounts of system resources and can take a significant amount of time to complete. In my experience, a program that I wrote in Python took three days to run as it scans through several datasets totaling several gigabytes in size. Indeed, it was a naïve solution and there were plenty of areas where the algorithm could have been improved, but it was a class project and I had to shift focus on other areas of the project. As long as it runs and I have time to wait, and I know the output is accurate, why mess with it. That was my thinking. But I digress. Improving the Apriori algorithm is a large subject on its own and is beyond the scope of this blog. Perhaps later.

The Apriori algorithm, however, does have a property that makes it efficient by minimizing the scan space when performing scans. The apriori property reduces the scan space by eliminating all subsets of a non-frequent itemset during a scan. To illustrate this simply, consider the binary search algorithm and how it eliminates half of the data population when a certain value is identified as being less than or equal to the search value, thus significantly minimizing the scan space. This intuition is applicable to the apriori property, which skips data population during a scan by only scanning subsets of frequent itemsets and ignoring the rest, the subsets of non-frequent itemsets.

Formally, the apriori property is: “All nonempty subsets of a frequent itemset must also be frequent.


Apriori Algorithm, How?

The algorithm is a 2-step process[1]; the Join step and the Prune step. In the join step, items are “joined” together as an itemset, with each item added incrementally with each iterative scans. A count is performed for each itemset, which is aggregated into memory as Lk, with the subscript “k” indicating a variable for the number of itemsets in a list, L.

In the prune step, all items or itemsets that do not meet minimum support threshold (the minimum number of occurrence designated by a knowledge worker or domain expert), is removed.

These two steps are performed iteratively starting with a count of each item in the dataset, then adding (aka joining) the next item, performing a count for each occurrence of that itemset, then rescanning and “excluding” (aka pruning) the itemset that does not meet minimum support requirements.

In writing, it can be confusing. But the illustration below helps visualize the process.

A scan is performed on a dataset, D, counting each occurrence, Sup. Count, of an itemset, C1. This is box 1 above, with each occurrence of item I1 showing a frequency (Sup. Count) of 6, I2 with a frequency of 7, and so on. This generated table is called Candidate 1, hence, the variable C1.

Next is the pruning step where we generate the itemset L1 from C1, while excluding itemset that does not meet minimum support. In the example diagram above, not itemset is eliminated since the minimum threshold in this example is 2.

The next step is generating Candidate 2, C2, for the next round of iteration (same as above – scanning and joining, then pruning), using L1. This is illustrated in the diagram below:

Using L1, we add the next item in the list creating C2, and performing a scan and count for each itemset, i.e., the number of transactions where the pair {I1, I2}, {I1, I3}, etc. are found. We then compare if the count meets the minimum support requirement and create the next list of itemset, L2, which is a pruned list, a list that only includes itemsets that meet the minimum support requirement.

This iteration of joining and pruning continues until all itemsets, Lk, are generated. Lk can then be used for further processing and analysis.

Going back to the apriori property previously mentioned, and how it relates to this algorithm, the process of pruning is the application of this property. By the nature of the apriori property, an algorithm was created that eliminates further scanning of non-frequently occurring itemsets, and only scanning and counting itemsets that are subsets of a frequently occurring itemset.

That is the Apriori algorithm. However, to make use of the itemsets found, those meeting strong association rules, an association rule can be generated, which can be applied to future transaction datasets. In my previous post I mentioned strong association rules, where strong means that the rules both satisfy the minimum support and the minimum confidence set by a domain expert. Support and Confidence are calculated as follows:

support (A ⇒ B) = P(A ∪ B)
confidence (A ⇒ B) = P(A | B)

To extend this equation further, P(A | B) is computed as:

             Support_count(B ∪ A)
P(A | B) = -------------------------

The support_count is the number of itemsets for a given transaction, with Support_count(B ∪ A) the count of all transactions containing A and/or B. Support_count(B) is simply the count of transactions containing B items.

Association rules can now be generated based on this equation by generating all subsets, s, of the frequent itemset, l. For every s of l we can output the rule “s (l – s)” …



… is greater than or equal to the minimum confidence threshold.


This basic frequent pattern mining algorithm serves as the basis for more complex pattern mining. With candidate and itemset generation being performed iteratively, statistical computing concepts can be applied in generating association rules, as well as correlations between cross-referenced itemsets with varying datasets.




[1] Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei



Frequent Pattern Mining – Basic Concepts

Frequent pattern mining searches for recurring relationships in a given data set for the discovery of interesting associations and correlations between item sets in transactional and relational databases[1]. That sounds technical. Of course, it is. It’s from a textbook. But it’s basically finding patterns in data tuples, a row, a record in a data set. Dataset such as customer transactions with each transaction or payment entry in the database. Pattern mining can be as simple as finding items that are commonly purchased together, such as coffee and creamer. As simple as this example may be, it is the basic premise of what is called “market basket analysis” software used in identifying trends in customer buying behavior, which can lead into the development of a recommendation system strategy in purchasing systems.

Frequent Pattern Mining, How?

These patterns can be represented using association rules. They are thresholds set by domain experts to determine what constitutes an interesting pattern. Using our coffee and creamer example, it is represented as follows:


Coffee ⇒ Creamer [support = 2%, confidence = 60%]

Both support and confidence are measures of an association’s interestingness. Is the support or confidence high enough to justify a marketing strategy that can lead to higher sales?

The 2% support in this example means that 2% of all transactions show that coffee and creamer are purchased together, while the 60% confidence represents customers that bought coffee also bought creamer. Typically, support and confidence are considered interesting if they meet a minimum support threshold and a minimum confidence threshold. These are the thresholds set by the domain expert, a person who understands the business.

And this is how support and confidence are calculated:

support (A ⇒ B) = P(A ∪ B)
confidence (A ⇒ B) = P(A | B)

A ⇒ B is read as “if A then B”, with A or B as an item or a set of items. And P(A ∪ B) is read as the probability (P) of A union B (or A together with B). The P(A | B) is read as the probability of A given B, which means that the probability of A occurring in a data set where it also has B (occurring in the same transaction).

Support and confidence are considered strong if they satisfy the minimum support threshold and the maximum support threshold, with frequency representing the number of transactions that contain the itemset. Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong. By convention, we write support and confidence values to occur between 0% and 100%.

In general, association rule mining can be viewed as a two-step process [2]:

  1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup.
  2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

To those more familiar with RDBMS (relational database software), like I am, these basic frequent pattern mining concepts can be applied using traditional SQL statements in a relational database. However, SQL is limited when it comes to complex big data type processing. As frequent itemset combinations increase, along with cross-referencing other data or applying additional rules in frequency calculation, scalability becomes issue. Processing, associating, and correlating large amounts of data will require significant computing resources. This is why Python, along with distributed architecture platform like Apache Spark, has become popular in recent years, and also R, the lingua franca [3] of statistical computing, which I’ll be blogging in the near future – computational statistics with Python, and R.

In summary, frequent pattern mining is conceptually simple, but can dramatically grow into highly complex pattern identification. The intuition is simple, patterns of words in an email containing words such as “Viagra” or “free tickets” might be considered as spam. In this blog I covered support and confidence, and their computations using probabilities (although I did not cover how to compute probabilities, the math is straightforward. It involves ratios and percentages.)

What’s next in this blog series is the Apriori algorithm. The most basic algorithm for mining frequent itemsets that builds on the concepts I have outlined above.



[1] Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei

[2] Data Mining Concepts and Techniques, 3rd Edition, Chapter 6.1 Basic Concepts . By Jiawei Han, Micheline Kamber, and Jian Pei

[3] Vance, Ashlee (2009-01-06). “Data Analysts Captivated by R’s Power“. New York Times. Retrieved 2018-08-06. “R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca…”

Mining the World for Data

It is important to note that the field of data science must have a resemblance to how the mind works. We are extracting knowledge after all, and deriving from it an understanding of the world around us. This is why the field of data science has many aspects of our mental processing. Deep learning and neural networks, for example, all have algorithmic patterns of how the brain works. Additionally, supervised and unsupervised machine learning is very similar to how we process and learn from the information we gather around us. And before these learning and processing can take place, the first step is to collect and mine data.

The world and everything around us is data. This is probably why the human mind’s primary skill is to make sense of this world around us by collecting and processing data. According to the US Library of Medicine [1], pattern processing is “the fundamental basis of most, if not all, unique features of the human brain including intelligence, language, imagination, and invention.” By finding familiar patterns, classifying them into known sets based on past experiences, or grouping them when they don’t make sense and applying generalizations. What we see, feel, and taste are all based on data collected by our five senses, while applying unnoticed mathematical concepts using algorithmic approaches.


Data Mining, Why?

Our gadgets collect data in the cloud, terabytes, and terabytes of it. A sea of data with varying potentials, different shades of gold. These gadgets not only collect our emails and chat messages and documents online, but even the way we interact with the application. All of them can be potential for mining valuable information about each interaction and each activity.

Like gold mining, digging through mountains of worthless substances, rocks or sediments, cleaning and sorting them, crushing them, then filtering them through a fine mesh material to reduce them down to where we can visualize that rare golden glow among fine-grained sand and crushed rocks. But what do we mine in data? What is our gold? What shines among the deep sediments of worthless noise in a sea of never-ending system logs and chat messages? Gold, of course, metaphorically speaking. But not the kind of gold you melt in a half inch spiral, into jewelry. Gold, in this case, is what is valuable to a given entity. It all depends on the stakeholders. Organizations, for example, will equate their ability to learn about their customer behavior as gold. To know this is to allow them to target specific products for marketing, and therefore manufacturing and inventory support that is tailored to market demands. In short, profit.

Today’s data is big, really big. Think terabytes upon terabytes of information with hundreds, perhaps even thousands of attributes (also known as features). This big data have four traits; volume, variety, velocity, and veracity – these are known as the four V’s of Big Data.

Volume is the extremely large amounts of data we have generated over the years since the inception of the internet, with thousands of variations in these data, i.e. chat data, emails, photos, videos, and so on. This is variety, the second V. Third is velocity, the speed in how we are generating these data, which we, according to an IBM industry research[2], generate e 2.5 quintillion bytes of data created every day. And last veracity, the trustworthiness of data. Can we rely on them by providing us a valuable source of information when applying predictive analysis allowing us to see expected and repeatable results of our prediction?

There is a fifth V in Big Data characteristics, and that is Value. Some online materials and Big Data articles will include them as an inherent characteristic of Big Data. However, I personally do not consider it as part of Big Data. I consider it as the “outcome” of data mining, the potential of Big Data. As I have referred to previously, it’s the gold we mine in data mining – the knowledge.

Data mining, therefore, turns datasets into knowledge.

I don’t consider Value as part of Big Data but the outcome of it because it is something that data miners must derive out of data, which must be in line with organizational or stakeholder goals. The value must answer what is valuable to them, what can potentially yield profit, either directly or indirectly based on the value gathered.

Deriving value out of Big Data can be a daunting task, given its characteristics. However, breaking down the processes and procedures in small and manageable size will minimize its complexity and allow data miners and stakeholders transparency on how the value was gathered. Knowledge discovery can be broken down into these high-level areas: 1) data pre-processing, 2) data analysis, and 3) data post-processing. The subtasks under each of these areas are illustrated below[3]:

It’s an iterative process from pre-processing to post-processing with post-processing and analysis iterating more in the discovery pipeline.

Data Mining, How?

In data mining, it’s not the shine on a pan of wet dirt being filtered through nets or cloth but patterns that can derive future outcomes, the knowledge we gained from mining. Instead, they are the customer buying patterns that can help marketers identify the next best product offering to certain demographics. It’s the Fitbit tracker; where you’ve been, how, and how fast tells companies about your lifestyle. And with this restaurant locations and coffee shop stops you frequent. Knowing where you go for breakfast, lunch, or dinner coupled with a few data features – weather conditions, for example, allows marketers to know what products you’ll be buying next.

These outcomes depend on three general categories in data mining [3]; pattern mining – finding frequently occurring patterns in sets of data, as the nomenclature suggests. Classification – a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. And data clustering – the process of grouping data objects into sets so that each object in a cluster are similar to one another. It is useful when the data classification is not known and defining classifications of data is not possible. Instead, a model is created to find common attributes of a data observation, which shows the similarity or dissimilarity of data points in a data set. This can lead to the discovery of previously unknown groups or patterns within the data.

One can imagine how limitless data gathering and analysis can be. It is a scary thought, but this is simply what the new world brings us, the world where everything is connected and constantly gathering data.


Data mining creates a multitude of outcomes, with each outcome molded by the miner, the data engineer, to meet stakeholder requirements. Unlike traditional mining, a value can take on a different form with each data set yielding many possible outcomes. In mining, a feature can be added in a data set to allow dynamic outcome, with a slight variation on how the feature set are mined in terms of patterns, classifications, or clustering, the mined value can take on a different form each time.

In a Fortune Magazine article [4], they have defined the sea of data we now have available as the new oil, the new gold rush. It is no wonder that organizations are now collecting data. It has the ability to strengthen artificial intelligence capabilities by providing it with enough training data that can provide highly insightful information. Insights that we have not seen before. Insights that will change the way we do things, from the average number of steps we take in a given day to addressing complex social issues. In fact, most organizations are offering free services to users and consumers for the sole purpose of collecting data that can be parsed and mined for patterns for future analysis. The more data they collect the larger their mountain, their premium land to extract oil or gold.


This is my first of many blogs regarding my journey with data mining, and data science in general. I’m planning to blog next each of the topics I have outlined here, covering many aspects of data analysis. Topics around frequent pattern mining, data classification, and cluster analysis and extending them further on how they are used in statistical computing and machine learning to derive gold, that is… knowledge.



[1] US National Library of Medicine, Superior pattern processing is the essence of the evolved human brain, July 2014.

[2] IBM, Industry Insights, 2.5 Quintillion Bytes of Data,  July 2013.

[3] Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei

[4] Fortune Magazine, Why Data is the New Oil, July 2016.