Mining the World for Data

It is important to note that the field of data science must have a resemblance to how the mind works. We are extracting knowledge after all, and deriving from it an understanding of the world around us. This is why the field of data science has many aspects of our mental processing. Deep learning and neural networks, for example, all have algorithmic patterns of how the brain works. Additionally, supervised and unsupervised machine learning is very similar to how we process and learn from the information we gather around us. And before these learning and processing can take place, the first step is to collect and mine data.

The world and everything around us is data. This is probably why the human mind’s primary skill is to make sense of this world around us by collecting and processing data. According to the US Library of Medicine [1], pattern processing is “the fundamental basis of most, if not all, unique features of the human brain including intelligence, language, imagination, and invention.” By finding familiar patterns, classifying them into known sets based on past experiences, or grouping them when they don’t make sense and applying generalizations. What we see, feel, and taste are all based on data collected by our five senses, while applying unnoticed mathematical concepts using algorithmic approaches.

 

Data Mining, Why?

Our gadgets collect data in the cloud, terabytes, and terabytes of it. A sea of data with varying potentials, different shades of gold. These gadgets not only collect our emails and chat messages and documents online, but even the way we interact with the application. All of them can be potential for mining valuable information about each interaction and each activity.

Like gold mining, digging through mountains of worthless substances, rocks or sediments, cleaning and sorting them, crushing them, then filtering them through a fine mesh material to reduce them down to where we can visualize that rare golden glow among fine-grained sand and crushed rocks. But what do we mine in data? What is our gold? What shines among the deep sediments of worthless noise in a sea of never-ending system logs and chat messages? Gold, of course, metaphorically speaking. But not the kind of gold you melt in a half inch spiral, into jewelry. Gold, in this case, is what is valuable to a given entity. It all depends on the stakeholders. Organizations, for example, will equate their ability to learn about their customer behavior as gold. To know this is to allow them to target specific products for marketing, and therefore manufacturing and inventory support that is tailored to market demands. In short, profit.

Today’s data is big, really big. Think terabytes upon terabytes of information with hundreds, perhaps even thousands of attributes (also known as features). This big data have four traits; volume, variety, velocity, and veracity – these are known as the four V’s of Big Data.

Volume is the extremely large amounts of data we have generated over the years since the inception of the internet, with thousands of variations in these data, i.e. chat data, emails, photos, videos, and so on. This is variety, the second V. Third is velocity, the speed in how we are generating these data, which we, according to an IBM industry research[2], generate e 2.5 quintillion bytes of data created every day. And last veracity, the trustworthiness of data. Can we rely on them by providing us a valuable source of information when applying predictive analysis allowing us to see expected and repeatable results of our prediction?

There is a fifth V in Big Data characteristics, and that is Value. Some online materials and Big Data articles will include them as an inherent characteristic of Big Data. However, I personally do not consider it as part of Big Data. I consider it as the “outcome” of data mining, the potential of Big Data. As I have referred to previously, it’s the gold we mine in data mining – the knowledge.

Data mining, therefore, turns datasets into knowledge.

I don’t consider Value as part of Big Data but the outcome of it because it is something that data miners must derive out of data, which must be in line with organizational or stakeholder goals. The value must answer what is valuable to them, what can potentially yield profit, either directly or indirectly based on the value gathered.

Deriving value out of Big Data can be a daunting task, given its characteristics. However, breaking down the processes and procedures in small and manageable size will minimize its complexity and allow data miners and stakeholders transparency on how the value was gathered. Knowledge discovery can be broken down into these high-level areas: 1) data pre-processing, 2) data analysis, and 3) data post-processing. The subtasks under each of these areas are illustrated below[3]:

It’s an iterative process from pre-processing to post-processing with post-processing and analysis iterating more in the discovery pipeline.

Data Mining, How?

In data mining, it’s not the shine on a pan of wet dirt being filtered through nets or cloth but patterns that can derive future outcomes, the knowledge we gained from mining. Instead, they are the customer buying patterns that can help marketers identify the next best product offering to certain demographics. It’s the Fitbit tracker; where you’ve been, how, and how fast tells companies about your lifestyle. And with this restaurant locations and coffee shop stops you frequent. Knowing where you go for breakfast, lunch, or dinner coupled with a few data features – weather conditions, for example, allows marketers to know what products you’ll be buying next.

These outcomes depend on three general categories in data mining [3]; pattern mining – finding frequently occurring patterns in sets of data, as the nomenclature suggests. Classification – a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. And data clustering – the process of grouping data objects into sets so that each object in a cluster are similar to one another. It is useful when the data classification is not known and defining classifications of data is not possible. Instead, a model is created to find common attributes of a data observation, which shows the similarity or dissimilarity of data points in a data set. This can lead to the discovery of previously unknown groups or patterns within the data.

One can imagine how limitless data gathering and analysis can be. It is a scary thought, but this is simply what the new world brings us, the world where everything is connected and constantly gathering data.

Therefore…

Data mining creates a multitude of outcomes, with each outcome molded by the miner, the data engineer, to meet stakeholder requirements. Unlike traditional mining, a value can take on a different form with each data set yielding many possible outcomes. In mining, a feature can be added in a data set to allow dynamic outcome, with a slight variation on how the feature set are mined in terms of patterns, classifications, or clustering, the mined value can take on a different form each time.

In a Fortune Magazine article [4], they have defined the sea of data we now have available as the new oil, the new gold rush. It is no wonder that organizations are now collecting data. It has the ability to strengthen artificial intelligence capabilities by providing it with enough training data that can provide highly insightful information. Insights that we have not seen before. Insights that will change the way we do things, from the average number of steps we take in a given day to addressing complex social issues. In fact, most organizations are offering free services to users and consumers for the sole purpose of collecting data that can be parsed and mined for patterns for future analysis. The more data they collect the larger their mountain, their premium land to extract oil or gold.

 

This is my first of many blogs regarding my journey with data mining, and data science in general. I’m planning to blog next each of the topics I have outlined here, covering many aspects of data analysis. Topics around frequent pattern mining, data classification, and cluster analysis and extending them further on how they are used in statistical computing and machine learning to derive gold, that is… knowledge.

 

References:

[1] US National Library of Medicine, Superior pattern processing is the essence of the evolved human brain, July 2014.

[2] IBM, Industry Insights, 2.5 Quintillion Bytes of Data,  July 2013.

[3] Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei

[4] Fortune Magazine, Why Data is the New Oil, July 2016.

Leave a Reply

Your email address will not be published. Required fields are marked *