Apriori, an Algorithm for Frequent Pattern Mining

Frequent pattern mining can be implemented in many ways. It can be as simple as performing a scan in a data array and performing a count, while aggregating that total count of another column and validating if it meets a certain threshold. It can be as complex as cross-referencing data across multiple data features and applying conditional logic based on some external dataset. In data science, however, the simplest form of frequent pattern mining is the Apriori algorithm.

Apriori algorithm is the simplest form of frequent pattern mining. It generates strong association rules from frequent itemsets by using prior knowledge of itemset properties. It is an iterative approach that scans the database several times to accumulate count of a frequently occurring property value, L1, which is used to find L2, the second frequently occurring value, then L3 and so on until no more frequently occurring itemsets can be found.

These frequent iterative scans can consume large amounts of system resources and can take a significant amount of time to complete. In my experience, a program that I wrote in Python took three days to run as it scans through several datasets totaling several gigabytes in size. Indeed, it was a naïve solution and there were plenty of areas where the algorithm could have been improved, but it was a class project and I had to shift focus on other areas of the project. As long as it runs and I have time to wait, and I know the output is accurate, why mess with it. That was my thinking. But I digress. Improving the Apriori algorithm is a large subject on its own and is beyond the scope of this blog. Perhaps later.

The Apriori algorithm, however, does have a property that makes it efficient by minimizing the scan space when performing scans. The apriori property reduces the scan space by eliminating all subsets of a non-frequent itemset during a scan. To illustrate this simply, consider the binary search algorithm and how it eliminates half of the data population when a certain value is identified as being less than or equal to the search value, thus significantly minimizing the scan space. This intuition is applicable to the apriori property, which skips data population during a scan by only scanning subsets of frequent itemsets and ignoring the rest, the subsets of non-frequent itemsets.

Formally, the apriori property is: “All nonempty subsets of a frequent itemset must also be frequent.

 

Apriori Algorithm, How?

The algorithm is a 2-step process[1]; the Join step and the Prune step. In the join step, items are “joined” together as an itemset, with each item added incrementally with each iterative scans. A count is performed for each itemset, which is aggregated into memory as Lk, with the subscript “k” indicating a variable for the number of itemsets in a list, L.

In the prune step, all items or itemsets that do not meet minimum support threshold (the minimum number of occurrence designated by a knowledge worker or domain expert), is removed.

These two steps are performed iteratively starting with a count of each item in the dataset, then adding (aka joining) the next item, performing a count for each occurrence of that itemset, then rescanning and “excluding” (aka pruning) the itemset that does not meet minimum support requirements.

In writing, it can be confusing. But the illustration below helps visualize the process.

A scan is performed on a dataset, D, counting each occurrence, Sup. Count, of an itemset, C1. This is box 1 above, with each occurrence of item I1 showing a frequency (Sup. Count) of 6, I2 with a frequency of 7, and so on. This generated table is called Candidate 1, hence, the variable C1.

Next is the pruning step where we generate the itemset L1 from C1, while excluding itemset that does not meet minimum support. In the example diagram above, not itemset is eliminated since the minimum threshold in this example is 2.

The next step is generating Candidate 2, C2, for the next round of iteration (same as above – scanning and joining, then pruning), using L1. This is illustrated in the diagram below:

Using L1, we add the next item in the list creating C2, and performing a scan and count for each itemset, i.e., the number of transactions where the pair {I1, I2}, {I1, I3}, etc. are found. We then compare if the count meets the minimum support requirement and create the next list of itemset, L2, which is a pruned list, a list that only includes itemsets that meet the minimum support requirement.

This iteration of joining and pruning continues until all itemsets, Lk, are generated. Lk can then be used for further processing and analysis.

Going back to the apriori property previously mentioned, and how it relates to this algorithm, the process of pruning is the application of this property. By the nature of the apriori property, an algorithm was created that eliminates further scanning of non-frequently occurring itemsets, and only scanning and counting itemsets that are subsets of a frequently occurring itemset.

That is the Apriori algorithm. However, to make use of the itemsets found, those meeting strong association rules, an association rule can be generated, which can be applied to future transaction datasets. In my previous post I mentioned strong association rules, where strong means that the rules both satisfy the minimum support and the minimum confidence set by a domain expert. Support and Confidence are calculated as follows:

support (A ⇒ B) = P(A ∪ B)
confidence (A ⇒ B) = P(A | B)

To extend this equation further, P(A | B) is computed as:

             Support_count(B ∪ A)
P(A | B) = -------------------------
              Support_count(B)

The support_count is the number of itemsets for a given transaction, with Support_count(B ∪ A) the count of all transactions containing A and/or B. Support_count(B) is simply the count of transactions containing B items.

Association rules can now be generated based on this equation by generating all subsets, s, of the frequent itemset, l. For every s of l we can output the rule “s (l – s)” …

If…

   Support_count(l)
--------------------- 
   Support_count(s)

… is greater than or equal to the minimum confidence threshold.

Therefore…

This basic frequent pattern mining algorithm serves as the basis for more complex pattern mining. With candidate and itemset generation being performed iteratively, statistical computing concepts can be applied in generating association rules, as well as correlations between cross-referenced itemsets with varying datasets.

 

 

References:

[1] Data Mining Concepts and Techniques, 3rd Edition. By Jiawei Han, Micheline Kamber, and Jian Pei

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *