Tuesday, June 29, 2010

Efficient Parallel Set-Similarity Joins Using MapReduce


Abstract
1. Introduction
2. Preliminaries
2.1 MapReduce
2.2 Parallel Set-Similarity Joins
2.3 Set-Similarity Filtering
3. Self-Join Case
3.1 Stage 1: Token Ordering
3.1.1 Basic Token Ordering (BTO)
3.1.2 Using One Phase to Order Tokens (OPTO)
3.2 Stage 2: RID-Pair Generation
3.2.1 Basic Kernel (BK)
3.2.2 Indexed Kernel (PK)
3.3 Stage 3: Record Join
3.3.1 Basic Record Join (BRJ)
3.3.2 One-Phase Record Join (OPRJ)
4. R-S Join Case
5. Handling Insufficient Memory
- Map-Based Block Processing
- Reduced-Based Block Processing
- Handling R-S Joins
6. Experimental Evaluation
6.1 Self-Join Performance
6.1.1 Self-Join Speedup
6.1.2 Self-Join Scaleup
6.1.3 Self-Join Summary
6.2 R-S Join Performance
6.2.1 R-S Join Speedup
6.2.2 R-S Join Scaleup
7. Related Work
8. Conclusions
9. References

Appendix

A. Self-Join Algorithms
B. Experimental Results
- Self-Join Performance
- R-S Join Performance