Frequent Pattern Mining – Basic Concepts

Frequent pattern mining searches for recurring relationships in a given data set for the discovery of interesting associations and correlations between item sets in transactional and relational databases[1]. That sounds technical. Of course, it is. It’s from a textbook. But it’s basically finding patterns in data tuples, a row, a record in a data set. Dataset such as customer transactions with each transaction or payment entry in the database. Pattern mining can be as simple as finding items that are commonly purchased together, such as coffee and creamer. As simple as this example may be, it is the basic premise of what is called “market basket analysis” software used in identifying trends in customer buying behavior, which can lead into the development of a recommendation system strategy in purchasing systems.

Frequent Pattern Mining, How?

These patterns can be represented using association rules. They are thresholds set by domain experts to determine what constitutes an interesting pattern. Using our coffee and creamer example, it is represented as follows:

 

Coffee ⇒ Creamer [support = 2%, confidence = 60%]

Both support and confidence are measures of an association’s interestingness. Is the support or confidence high enough to justify a marketing strategy that can lead to higher sales?

The 2% support in this example means that 2% of all transactions show that coffee and creamer are purchased together, while the 60% confidence represents customers that bought coffee also bought creamer. Typically, support and confidence are considered interesting if they meet a minimum support threshold and a minimum confidence threshold. These are the thresholds set by the domain expert, a person who understands the business.

And this is how support and confidence are calculated:

support (A ⇒ B) = P(A ∪ B)
confidence (A ⇒ B) = P(A | B)

A ⇒ B is read as “if A then B”, with A or B as an item or a set of items. And P(A ∪ B) is read as the probability (P) of A union B (or A together with B). The P(A | B) is read as the probability of A given B, which means that the probability of A occurring in a data set where it also has B (occurring in the same transaction).

Support and confidence are considered strong if they satisfy the minimum support threshold and the maximum support threshold, with frequency representing the number of transactions that contain the itemset. Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong. By convention, we write support and confidence values to occur between 0% and 100%.

In general, association rule mining can be viewed as a two-step process [2]:

  1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup.
  2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence.

To those more familiar with RDBMS (relational database software), like I am, these basic frequent pattern mining concepts can be applied using traditional SQL statements in a relational database. However, SQL is limited when it comes to complex big data type processing. As frequent itemset combinations increase, along with cross-referencing other data or applying additional rules in frequency calculation, scalability becomes issue. Processing, associating, and correlating large amounts of data will require significant computing resources. This is why Python, along with distributed architecture platform like Apache Spark, has become popular in recent years, and also R, the lingua franca [3] of statistical computing, which I’ll be blogging in the near future – computational statistics with Python, and R.

In summary, frequent pattern mining is conceptually simple, but can dramatically grow into highly complex pattern identification. The intuition is simple, patterns of words in an email containing words such as “Viagra” or “free tickets” might be considered as spam. In this blog I covered support and confidence, and their computations using probabilities (although I did not cover how to compute probabilities, the math is straightforward. It involves ratios and percentages.)

What’s next in this blog series is the Apriori algorithm. The most basic algorithm for mining frequent itemsets that builds on the concepts I have outlined above.

 

References:

[1] Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei

[2] Data Mining Concepts and Techniques, 3rd Edition, Chapter 6.1 Basic Concepts . By Jiawei Han, Micheline Kamber, and Jian Pei

[3] Vance, Ashlee (2009-01-06). “Data Analysts Captivated by R’s Power“. New York Times. Retrieved 2018-08-06. “R is also the name of a popular programming language used by a growing number of data analysts inside corporations and academia. It is becoming their lingua franca…”

Leave a Reply

Your email address will not be published. Required fields are marked *