Finding Interesting Associations without Support Pruning

(File Last Modified Wed, May 29, 2002.)

Review: Finding Interesting Associations without Support Pruning

Problem Addressed
Approach Proposed
Validation
Conclusion
Summary and Analysis
Comments
Insight

Review: Finding Interesting Associations without Support Pruning

Problem Addressed

Devising an efficietn algorithm for finding rules that have extremely high confidence(close to 100%), but no support.
The algorithm should have an emphysis on the case where the database is large and data is disk-resident.
In this paper, the primary focus is on the problem of identifying all pairs of columns (both LHS and RHS have only one attribute) with similarity (not confidence) exceeding a threshold.

Approach Proposed

There are a family of algorithms presented for the addressed problem, all of which follow a three-phase approach:

compute hash-signatures: By making a single pass over the table, generate a small hash-signature for each column, which will form a summary of the table that will fit into main memory.
generate candidates: candidate paris are generated from the column signatures in the main memory.
prune candidates: make another pass over the original table, determining for each candidate pair whether it indeed has high similarity.

For the first two phases, 2 families of hashing schemes are developed to compute signatures.

Min-Hashing schemes
Locality-Sensitive schemes

Validation

Synthetic data and real data are used to evaluate the perfomance of the different algorithms.

Synthetic data: columns:10k, rows: 10k-1000k, column densities 1%-5%
Real data: HTTP request log, columns: 13k, rows: >200k, column density<0.01

A news artical data is used to compare their algorithms with apriori by setting a low support threshold(0.1%-0.2%) for apriori.

Extensive experiments are conducted to compare their four algorithms with respect to the trade-offs between speed and accuracy.

Conclusion

Their algorithms out-perform the apriori algorithm by nearly an order of magnitude. At low support(0.1%), apriori is run out of memory.
With regard to the results of comparisons among the four algorithms, I have no way to summarize the complex curves presented with emphysis on the performance related to different settings of their parameters. In general, a slower scheme is recommended to avoid false-negatives, and a fast scheme is more suitable if speed is of greater concern than complete accuracy.

The authors' motivation for seeking high confidence rules without any support requirement is that:most rules of low-support are obvious and well-known, which do not provide new insights. Examples of applications:

copy detection:identifying identical or similar documents and web pages.
clustering: identifying similar vectors in high-dimentsional spaces for clustering data.
collaborative filtering: tracking user behavior and making recommendations to individuals based on similarity of their preferences to those of other users.
detecting causality.

Limitation of the approach:

Their data is restricted to a 0/1 sparse matrix(i.e., all attributes are boolean), as a simplified form of basket data, their is no pre-defined class attribute.
They only deal with pairs of columns, that is, both LHS and RHS can have only one attribute respectively.
They introduce a symmetric or bi-directional measure of interest similarity instead of using confidence: the similarity of Ci and Cj is the fraction of rows, amongst those containing a 1 in either Ci or Cj, that contain a 1 in both Ci and Cj. similarity(Ci,Cj)=|Ci and Cj| / |Ci or Cj|
There exists a trade-off between accuracy and speed. If the hash scheme is required to be very fast and to produce small signatures, it could raise the problem of false negative (highly-similar pairs are missed), on the other hand, to avoid false negative, the false positive (candidate pairs that are not really highly similar) may increase which will increase the time needed in the pruning stage, and possibly, result in a large candidate size that still can not fit in the main memory.

The authors' defense of the large size of rules is as follows:

In most cases, the output rules are not intended for human analysis but for an automated analysis (however, the automated analysis mechanism is not discussed).
The extremely high confidence(close to 100%) will substantially reduce the output size and leave in the rules that are of interest

Comments

This paper proposed a set of fancy and complicated algorithms, all of which are too restricted. Both the input data and the output rules are confined to a limited form. Their research is at a conceptual rather than a practical level. From this point of view, they did not really out-perform the apriori, since they can not generate really useful rules.
They invent another measurement, namely similarity, for mining rules, although they said they can extend one of their four algorithms to discover the confidence association rules of the type Ci->Cj, their output is not the confidence rules, which they claim is their main goal.
The extensions to more than 3 columns and not simple boolean attributes will suffer from an exponential overhead in the number of columns, which will wipe off the efficiency of their algorithms with an emphysis on dealing with disk-reside database.

Insight

By reading this paper, I can see a clear distinction between our research and their approach: they have an emphysis on pure algorithms that deal with: