Tuesday, July 6, 2010

Ten Simple Rules (Research Field)

Mathematical Concepts

1 The Axiom of Choice
2 The Axiom of Determinacy
3 Bayesian Analysis
4 Braid Groups
5 Buildings
6 Calabi-Yau Manifolds
7 Cardinals
8 Categories
9 Compactness and Compactification
10 Computational Complexity Classes
11 Countable and Uncountable Sets
12 C*-Algebras
13 Curvature
14 Designs
15 Determinants
16 Differential Forms and Integration
17 Dimension
18 Distributions
19 Duality
20 Dynamical Systems and Chaos
21 Elliptic Curves
22 The Euclidean Algorithm and Continued Fractions
23 The Euler and Navier-Stokes Equations
24 Expanders
25 The Exponential and Logarithmic Functions
26 The Fast Fourier Transform
27 The Fourier Transform
28 Fuchsian Groups
29 Function Spaces
30 Galois Groups
31 The Gamma Function
32 Generating Functions
33 Genus
34 Graphs
35 Hamiltonians
36 The Heat Equation
37 Hilbert Spaces
38 Homology and Cohomology
39 Homotopy Groups
40 The Ideal Class Group
41 Irrational and Transcendental Numbers
42 The Ising Model
43 Jordan Normal Form
44 Knot Polynomials
45 K-Theory
46 The Leech Lattice
47 L-Functions
48 Lie Theory
49 Linear and Nonlinear Waves and Solitons
50 Linear Operators and Their Properties
51 Local and Global in Number Theory
52 The Mandelbrot Set
53 Manifolds
54 Matroids
55 Measures
56 Metric Spaces
57 Models of Set Theory
58 Modular Arithmetic
59 Modular Forms
60 Moduli Spaces
61 The Monster Group
62 Normed Spaces and Banach Spaces
63 Number Fields
64 Optimization and Lagrange Multipliers
65 Orbifolds
66 Ordinals
67 The Peano Axioms
68 Permutation Groups
69 Phase Transitions
70 pi
71 Probability Distributions
72 Projective Space
73 Quadratic Forms
74 Quantum Computation
75 Quantum Groups
76 Quaternions, Octonions, and Normed Division Algebras
77 Representations
78 Ricci Flow
79 Riemann Surfaces
80 The Riemann Zeta Function
81 Rings, Ideals, and Modules
82 Schemes
83 The Schrödinger Equation
84 The Simplex Algorithm
85 Special Functions
86 The Spectrum
87 Spherical Harmonics
88 Symplectic Manifolds
89 Tensor Products
90 Topological Spaces
91 Transforms
92 Trigonometric Functions
93 Universal Covers
94 Variational Methods
95 Varieties
96 Vector Bundles
97 Von Neumann Algebras
98 Wavelets
99 The Zermelo-Fraenkel Axioms
100 Mathematical Statistics

Sunday, July 4, 2010

Machine Learning: A Probabilistic Approach


- A new textbook by Kevin Murphy (in preparation,
to be published by MIT Press, c. December 2011)


- Introduction to Machine Learning
- Publications by Kevin Murphy
- UBC Machine Learning Reading Group (MLRG)
- Kevin Murphy @ UBC (great resource for students)
- Kevin Murphy / Teaching
- Graduate, CS 540 (Machine learning) Fall 2008
- Graduate, Stat 406 (Algorithms for Classification and Prediction)

- Causal models as conditional density models
- Product Partition Models for Modelling
Changing Dependency Structure in Time Series



1 Introduction
1.1 Introduction
1.2 Classification
1.2.1 Example
1.2.2 Probabilistic output
1.2.3 Decision trees
1.2.4 K-nearest neighbor classifier
1.2.5 Logistic regression
1.2.6 Linear separability
1.2.7 Transforming the input variables
1.2.8 Kernel functions
1.2.9 Multinomial logistic regression
1.2.10 No free lunch theorem
1.2.11 Real world applications
1.3 Regression
1.3.1 Linear regression
1.3.2 Maximum likelihood and least squares
1.3.3 Real world applications
1.4 Unsupervised learning
1.4.1 Discovering clusters
1.4.2 Discovering latent factors
1.4.3 Discovering graph structure
1.4.4 Discovering relevant variables
1.4.5 Data compression
1.4.6 Data imputation
1.4.7 Anomaly detection
1.5 Other kinds of machine learning
1.5.1 Relational learning
1.5.2 Semi-supervised learning
1.5.3 Reinforcement learning
1.6 Overfitting
1.6.1 Examples of overfitting
1.6.2 The benefits of more data
1.6.3 Regularization
1.7 A probabilistic approach: what and why?
1.7.1 Degrees of Bayesianism
1.7.2 Representing uncertainty
1.7.3 Predicting the future
1.7.4 The mall N, large D problem
1.7.5 Model selection
1.7.6 Discussion and further reading

==============
I Probability
==============

2 Probability I
2.1 Introduction
2.2 What is probability?
2.3 Discrete random variables
2.4 Continuous random variables
2.5 Summaries of distributions
2.6 Some useful continuous distributions

3 Probability II
3.1 Introduction
3.2 Joint probability distributions
3.3 Functions of random variable
3.4 The exponential family *
3.5 Exchangeability *

4 The multivariate Gaussian and friends
4.1 Introduction
4.2 Definition and basic properties
4.3 Mahalanobis distance
4.4 Eigenanalysis of covariance matrices
4.5 Probabilistic inference
4.6 Information form *
4.7 Linear Gaussian systems
4.8 The multivariate Student T distribution *
4.9 Wishart distribution

5 Generative models for classification
5.1 Introduction
5.2 Discriminant analysis
5.3 Robust discriminant analysis
5.4 Naive Bayes classifiers
5.5 Log-sum-exp trick
5.6 Generative vs discriminative classifiers

6 Markov models
6.1 Introduction
6.2 Transition matrix
6.3 Language modeling
6.4 Stationary distribution
6.5 Google's PageRank algorithm
6.6 Conditions under w/c a stationary distribution exists: finite-state case
6.7 Conditions under w/c a stationary distribution exists: general case *

7 Directed graphical models (Unfinished)
7.1 Introduction
7.2 Example: medical diagnosis
7.3 Plate notation
7.4 Inference
7.5 Using DGMs to make generative classifiers
7.6 Conditional independence properties of DAGs
7.7 Casual interpretation of DAGs *

8 Information theory
8.1 Introduction
8.2 Entropy
8.3 Mutual information
8.4 Relative entropy (KL divergence)
8.5 Jensen's inequality and its consequences
8.6 Maximum entropy principle
8.7 Data compression, probability and entropy
8.8 Minimum description length (MDL) principle

==============
II Statistics
==============

9 Maximum likelihood estimation
9.1 Introduction
9.2 MLE for linear regression
9.3 MLE for univariate Gaussian distribution
9.4 MLE for binomial and Bernoulli distributions
9.5 MLE for multinomial and multinoulli distributions
9.6 MLE for a naive Bayes classifier
9.7 MLE for Markov models
9.8 MLE for DGMs
9.9 MLE for multivariate Gaussian
9.10 MLE for discriminant analysis
9.11 Sufficient statistics
9.12 MLE for the exponential family *
9.13 Some problems with maximum likelihood estimation

10 Bayesian inference for single parameter models
10.1 Statistical inference
10.2 Bayesian model of human concept learning
(inference of a discrete parameter)
10.3 Tossing coins (inference for the binomial rate parameter)
10.4 Summarizing posterior distributions
10.5 Why not just use MAP estimations? *
10.6 Priors
10.7 The marginal likelihood (evidence)

11 Bayesian inference for multi parameter models
11.1 Bayesian fitting of discrete variable models
11.2 Bayesian fitting of univariate Gaussian
11.3 Bayesian fitting of multivariate Gaussian *
11.4 Bayesian fitting of discriminant analysis models
11.5 Bayesian fitting of a linear regression model
11.6 Bayesian fitting of the exponential family *

12 Decision theory
12.1 Introduction
12.2 Bayesian decision theory
12.3 Frequentist decision theory *
12.4 Empirical risk minimization *
12.5 The false positive vs false negative tradeoff

13 Bayesian model selection and checking
13.1 Introduction
13.2 Bayes factor
13.3 Testing equality of two Binomial rate parameters
13.4 Testing equality of two Gaussian means (Bayesian T-tests) *
13.5 Testing for differentially expressed genes *
13.6 Testing for Markovianity
13.7 BIC approximation to log marginal likelihood
13.8 Jeffreys-Lindley paradox *
13.9 Model checking
13.10 Posterior-predictive p-values

14 Empirical and Hierarchical Bayes
14.1 Empirical Bayes
14.2 Empirical Bayes for linear regression
14.3 Hierarchical Bayes *
14.4 Multiple binomial variables
14.5 Multiple Gaussian variables
14.6 Bayesian fitting of Markov language models
14.7 Fitting multiple related linear regressions

================
III Computation
================

15 Unconstrained optimization
15.1 Introduction
15.2 Kinds of optimization problems
15.3 Local vs global optima
15.4 Computational complexity of finding an optimum *
15.5 Overview of algorithms
15.6 Examples of optimization problems
15.7 First-order methods
15.8 Second-order methods
15.9 Comparison of algorithms
15.10 Mathlab software
15.11 Gradient-free unconstrained optimization

16 Expectation maximization (EM) algorithm
16.1 Introduction
16.2 Basic idea
16.3 Example: computing MLE for MVN from partially observed data
16.4 Example: MLE for the multivariate Student distribution *
16.5 EM theory *
16.6 Bound optimization / MM algorithms
16.7 EM variants *
16.8 Convergence rate of EM *

17 Deterministic algorithms for approximate inference (Unfinished)
17.1 Introduction
17.2 Laplace approximation
17.3 Variational Bayes *
17.4 Bits-back coding
17.5 Variational free energy
17.6 The mean field method
17.7 VB for univariate Gaussian
17.8 Forward or reverse KL?
17.9 Software for variational inference

18 Stochastic algorithms for approximate inference (Unfinished)
18.1 Introduction
18.2 Generating random samples from a univariate distribution
18.3 Importance sampling (Unfinished)
18.4 Markov Chain Monte Carlo (MCMC)
18.5 Gibbs sampling
18.6 Metropolis Hastings algorithm
18.7 Accuracy of MCMC
18.8 Convergence
18.9 More advanced MCMC algorithms

19 Constrained optimization
19.1 Introduction
19.2 Lagrange multipliers and the KKT conditions
19.3 Examples
19.4 Primal-dual problems *
19.5 Overview of algorithms
19.6 Conjugate duality

20 Discrete optimization
20.1 Introduction
20.2 Dynamic programming
20.3 Branch and bound
20.4 Hill climbing/coordinate ascent
20.5 Stochastic local search
20.6 Tabu search
20.7 Evolutionary algorithms
20.8 Matlab software

============================
IV More supervised learning
============================

21 Linear regression
21.1 Introduction
21.2 Maximum likelihood and least squared revisited
21.3 Robust regression *
21.4 Censored regression *
21.5 Ridge regression revisited
21.6 Bayesian linear regression (unknown o^2)
21.7 Empirical Bayes for linear regression
21.8 Variational Bayes for linear regression *
21.9 Kernel smoothers *
21.10 Semi-parametric regression *

22 Generalized linear models
22.1 Introduction
22.2 Multinomial logistic regression
22.3 Bayesian logistic regression
22.4 Empirical Bayes for logistic regression
22.5 Variational Bayes for logistic regression
22.6 Generalized linear models
22.7 Probit regression *

23 Neural networks
23.1 Introduction
23.2 Kinds of neural networks
23.3 Parameter estimation, or, how to train a neural net
23.4 Implementation *
23.5 Bayesian neural networks *

24 Variable selection
24.1 Introduction
24.2 Detecting marginally relevant variables
24.3 Bayesian variable selection
24.4 Subset selection
24.5 l1 regularization
24.6 Variants on l1 regularization *
24.7 Non-convex regularization
24.8 Automatic relevancy determination (ARD)

25 Sparse kernel machines
25.1 Introduction
25.2 Kernels
25.3 The kernel trick
25.4 Support vector machines
25.5 Experimental comparison of kernel methods
25.6 When should you use kernel methods?

26 Gaussian processes (Unfinished)
26.1 Introduction
26.2 GPs for regression
26.3 GPs for classification (Unfinished)

27 Trees, forests and boosting (Unfinished)
27.1 Introduction

=============================
V More unsupervised learning
=============================

28 Mixture models
28.1 Introduction
28.2 ML estimation
28.3 MAP estimation
28.4 Selecting the number of mixture components
28.5 VB for mixtures of Gaussians
28.6 Non-parametric density estimation
28.7 Using mixture models for classification

29 Latent factor models (Unfinished)
29.1 Introduction
29.2 Factor analysis
29.3 Mixture of factor analyzers
29.4 Principal components analysis (PCA)
29.5 EPCA
29.6 Variational EM for discrete PPCA
29.7 Variational Bayes for PPCA

30 Hidden Markov models
30.1 What are HMMs?
30.2 The main inferential tasks, and some applications
30.3 The forwards-backwards algorithm
30.4 The Viterbi algorithm
30.5 Forwards filtering, backwards sampling
30.6 Parameter estimation
30.7 HMM variants *

31 State space models
31.1 Linear dynamical systems
31.2 Kalman filtering
31.3 Kalman smoothing
31.4 Paramter estimation *
31.5 Inference in nonlinear, non-Gaussian systems *
31.6 Particle filtering

=========================
VI More graphical models
=========================

32 Undirected graphical models (Unfinished)
32.1 Introduction
32.2 Representing a joint distribution with a DGM
32.3 Inference
32.4 Learning (parameter estimation)
32.5 MRFs
32.6 CI properties of UGMs
32.7 Representing the joint distribution of a UGM
32.8 Conditional random fields
32.9 Inference
32.10 Learning (parameter estimation)

33 Inference in graphical models (Unfinished)
33.1 Introduction
33.2 Message passing (belief propagation)
33.3 Message passing (belief propagation) on trees
33.4 Variable elimination
33.5 Junction tree algorithm
33.6 Computational hardness of discrete inference
33.7 Inference in Gaussian models
33.8 Approximate inference
33.9 Mean field revisited
33.10 Loopy belief propagation (Unfinished)
33.11 Expectation propagation (Unfinished)

34 Network structure learning (Unfinished)
34.1 Structure learning for trees
34.2 Structure learning for DAGMs
34.3 Structure learning for Gaussian UGMs
34.4 Structure learning for discrete UGMs
34.5 Dependency networks

===========================
VII Supplementary material
===========================

35 Notation
35.1 General notation
35.2 Graph theory notation
35.3 Probability distributions
35.4 List of commonly used abbreviations
35.5 Big-O notation

36 Non-Bayesian inference (frequentist statistics)
36.1 Introduction
36.2 Sampling distribution of estimators
36.3 Desirable properties of estimators
36.4 Confidence intervals
36.5 Hypothesis testing
36.6 Pathologies of frequentist statistics
36.7 Why isn't everyone a Bayesian?

37 Linear algebra
37.1 Introduction
37.2 Basic notation
37.3 Matrix Multiplication
37.4 Operations of matrices
37.5 Properties of matrices
37.6 Special kinds of matrices
37.7 Cholesky decomposition
37.8 QR decomposition
37.9 Eigenvalue decomposition (EVD)
37.10 Singular value decomposition (SVD)

38 Calculus
38.1 Single variable calculus
38.2 Matrix calculus

Math review (some), needing a handbook


Mathematics Handbook for Science and Engineering

1 Fundamentals, Discrete Mathematics
1.1 Logic
1.2 Set Theory
1.3 Binary Relations and Functions
1.4 Algebraic Structures
1.5 Graph Theory
1.6 Codes

2 Algebra

3 Geometry and Trigonometry

4 Linear Algebra
4.1 Matrices
4.2 Determinants
4.3 Systems of Linear Equations
4.4 Linear Coordinate Transformation
4.5 Eigenvalues, Diagonalization
4.6 Quadratic Forms
4.7 Linear Spaces
4.8 Linear Mappings
4.9 Tensors
4.8 Complex matrices

5 The Elementary Functions

6 Differential Calculus (one variable)
6.1 Some Basic Concepts
6.2 Limits and Continuity
6.3 Derivatives
6.4 Monotonicity, Extremes of Functions

7 Integral Calculus
7.1 Indefinite Integrals
7.2 Definite Integrals
7.3 Applications of Differential and Integral Calculus
7.4 Tables of Indefinite Integrals
7.5 Tables of Definite Integrals

8 Sequences and Series
8.1 Sequences of Numbers
8.2 Sequences of Functions
8.3 Series of Constant Terms
8.4 Series of Functions
8.5 Taylor Series
8.6 Special Sums and Series

9 Ordinary Differential Equations (ODE)
9.1 Differential Equations of the First Order
9.2 Differential Equations of the Second Order
9.3 Linear Differential Equations
9.4 Autonomous systems
9.5 General Concepts and Results
9.6 Linear Differential Equations

10 Multidimensional Calculus
10.1 The Space R^n
10.2 Surfaces. Tangent Planes
10.3 Limits and Continuity
10.4 Partial Derivatives
10.5 Extremes of Functions
10.6 Functions f:R^n -> R^m (R^n -> R^m)
10.7 Double Integrals
10.8 Triple Integrals
10.9 Partial Differential Equations

11 Vector Analysis
11.1 Curves
11.2 Vector Fields
11.3 Line Integrals
11.4 Surface Integrals

12 Orthogonal Series and Special Functions

13 Transforms

14 Complex Analysis

15 Optimizations
15.1 Calculus of Variations
15.2 Linear Optimizations
15.3 Integer and Combinatorial Optimization
15.4 Nonlinear Optimization
15.5 Dynamic Optimization

16 Numerical Analysis

17 Probability Theory
17.1 Basic Probability Theory
17.2 Probability Distributions
17.3 Stochastic Processes
17.4 Algorithms for Calculation of Probability Distributions
17.5 Simulation
17.6 Queueing Systems
17.7 Reliability
17.8 Tables

18 Statistics
18.1 Descriptive Statistics
18.2 Point Estimation
18.3 Confidence Intervals
18.4 Table for Confidence Intervals
18.5 Tests of Significance
18.6 Linear Models
18.7 Distribution-free Methods
18.8 Statistical Quality Control
18.9 Factorial Experiments
18.10 Analysis of life time (failure time) data
18.11 Statistical glossary

Wednesday, June 30, 2010

Graph Theory and Complex Networks

via @Werner / presentation slide


6. Network analysis
6.1 Vertex degrees
Degree distribution
Degree correlations
6.2 Distance statistics
6.3 Clustering coefficient
Some effects of clustering
Local view
Global view
6.4 Centrality

8. Random networks
7.1 Introduction
7.2 Classical random networks
Degree distribution
Other metrics for random graphs
7.3 Small worlds
7.4 Scale-free networks
Fundamentals
Properties of scale-free networks
Related networks

9. Social networks
9.1 Social network analysis: introduction
9.2 Some basic concepts
Centrality and prestige
Structural balance
Cohesive subgroups
Affiliation networks
9.3 Equivalence
Structural equivalence
Automorphic equivalence
Regular equivalence

The Fourth Paradigm: Data-Intensive Scientific Discovery

Head First Data Analysis


1. Introduction to Data Analysis: Break It Down
2. Experiements: Test Your Theories
3. Optimization: Take It to the Max
4. Data Visualization: Pictures Make You Smarter
5. Hypothesis Testing: Say It Ain't So
6. Bayesian Statistics: Get Past First Base
7. Subjective Probabilities: Numerical Belief
8. Heuristics: Analyze Like a Human
9. Histograms: The Shape of Numbers
10. Regression: Prediction
11. Error: Err Well
12. Relational Database: Can You Relate?
13. Cleaning Data: Impose Order

i. Leftovers: Top Ten Things (We Didn't Cover)
ii. Install R: Start R Up!
iii. Install Excel Analysis Tools: The ToolPak

Tuesday, June 29, 2010

Handbook of Statistical Analysis and Data Mining


I. History of Phases of Data Analysis,
Basic Theory, and Data Mining Process

1. The Background for Data Mining Practice
2. Theoretical Considerations for Data Mining
3. The Data Mining Process
4. Data Understanding and Preparation
5. Feature Selection
6. Accessory Tools for Doing Data Mining

II. The Algorithms in Data Mining and Text Mining,
The Organization of the Three Most Common Data
Mining Tools, and Selected Specialized Areas Using
Data Mining

7. Basic Algorithms for Data Mining: A Brief Overview
8. Advanced Algorithms for Data Mining
9. Text Mining and Natural Language Processing
10. The Three Most Common Data Mining Software Tools
11. Classification
12. Numerical Prediction
13. Model Evaluation and Enhancement
14. Medical Informatics
15. Bioinformatics
16. Customer Response Modeling
17. Fraud Detection

III. Tutorials - Step-by-Step Case Studies as A Staring
Point to Learn How to Do Data Mining Analyses

A. How to Use Data Miner Recipe
B. Data Mining for Aviation Safety
C. Predicting Movie Box-Office Receipts
D. Detecting Unsatisfied Customers: A Case Study
E. Credit Scoring
F. Churn Analysis
G. Text Mining: Automobile Brand Review
H. Predictive Process Control: QC-Data Mining
I. Business Administration in a Medical Industry
J. Clinical Psychology: Making Decision About
Best Therapy for a Client
K. Education-Leadership Training for Business
and Education
L. Dentistry: Facial Pain Study
M. Profit Analysis of the German Credit Data
N. Predicting Self-Reported Health Status Using
Artificial Neural Networks

IV. Measuring True Complexity, The "Right Model
for the Right Use", Top Mistakes, and the Future
of Analytics

18. Model Complexity (and How Ensembles Help)
19. The Right Model for the Right Purpose:
When Less is Good Enough
20. Top 10 Data Mining Mistakes
21. Prospects for the Future of Data Mining
and Text Mining as Part of Our Everyday Lives
22. Summary: Our Design

Data Mining: Practical Machine Learning Tools and Techniques


I. Machine learning tools and techniques

1. What's it all about?
2. Input: Concepts, instances, and attributes
3. Output: Knowledge representation
4. Algorithms: The basic methods
- Inferring rudimentary rules
- Statistical modeling
- Divide-and-conquer: Constructing decision trees
- Covering algorithms: Constructing rules
- Mining association rules
- Linear models
- Instance-based learning
- Clustering
5. Credibility: Evaluating what's been learned
6. Implementations: Real machine learning schemes
7. Transformations: Engineering the input and output
8. Moving on: Extensions and applications

Programming Collective Intelligence


1. Introduction to Collective Intelligence
- What is Collective Intelligence?
- What is Machine Learning
- Limits of Machine Learning
- Real-Life Examples
- Other Uses for Learning Algorithms
2. Making Recommendations
- Collaborative Filtering
- Collecting Preferences
- Finding Similar Users
- Recommending Items
- Matching Products
- Building a del.icio.us Link Recommender
- Item-Based Filtering
- Using the MovieLens Dataset
- User-Based or Item-Based Filtering?
3. Discovering Groups
4. Searching and Ranking
5. Optimization
6. Document Filtering
7. Modeling with Decision Trees
8. Building Price Models
- Building a Sample Dataset
- k-Nearest Neighbors
- Weighted Neighbors
- Cross-Validation
- Heterogeneous Variables
- Optimizing the Scale
- Uneven Distributions
- Using Real Data - the eBay API
- When to Use k-Nearest Neighbors
9. Advanced Classifications: Kernel Methods and SVMs
- Matchmaker Dataset
- Difficulties with the Data
- Basic Linear Classification
- Categorical Features
- Scaling the Data
- Understanding Kernel Methods
- Support-Vector Machines
- Using LIBSVM
- Matching on Facebook
10. Finding Independent Features
- A Corpus of News
- Previous Approaches
- Non-Negative Matrix Factorization
- Displaying the Results
- Using Stock Market Data
11. Evolving Intelligence
12. Algorithm Summary
- Bayesian Classifier
- Decision Tree Classifier
- Neural Networks
- Support-Vector Machines
- k-Nearest Neighbors
- Clustering
- Multidimensional Scaling
- Non-Negative Matrix Factorization
- Optimization

A. Third-Party Libraries
B. Mathematical Formulas

Efficient Parallel Set-Similarity Joins Using MapReduce


Abstract
1. Introduction
2. Preliminaries
2.1 MapReduce
2.2 Parallel Set-Similarity Joins
2.3 Set-Similarity Filtering
3. Self-Join Case
3.1 Stage 1: Token Ordering
3.1.1 Basic Token Ordering (BTO)
3.1.2 Using One Phase to Order Tokens (OPTO)
3.2 Stage 2: RID-Pair Generation
3.2.1 Basic Kernel (BK)
3.2.2 Indexed Kernel (PK)
3.3 Stage 3: Record Join
3.3.1 Basic Record Join (BRJ)
3.3.2 One-Phase Record Join (OPRJ)
4. R-S Join Case
5. Handling Insufficient Memory
- Map-Based Block Processing
- Reduced-Based Block Processing
- Handling R-S Joins
6. Experimental Evaluation
6.1 Self-Join Performance
6.1.1 Self-Join Speedup
6.1.2 Self-Join Scaleup
6.1.3 Self-Join Summary
6.2 R-S Join Performance
6.2.1 R-S Join Speedup
6.2.2 R-S Join Scaleup
7. Related Work
8. Conclusions
9. References

Appendix

A. Self-Join Algorithms
B. Experimental Results
- Self-Join Performance
- R-S Join Performance

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs


Abstract
1. Introduction
2. Related Work
3. Why not Traditional Methods?
- Spectral Clustering
- Graph Partitioning Methods
4. EigenSpokes
4.1 A Surprise: Spokes
4.2 Justification and Proofs
4.3 Ubiquity of Spokes
4.4 Recreating Spokes
5. SpokeEn: Exploiting EigenSpokes
5.1 Designing SpokeEn
5.2 Discussion
5.3 Emperical Results
6. Successes with Real-World Graphs
7. Conclusions
References

Design Patterns for Efficient Graph Algorithms in MapReduce


Abstract
1. Introduction
2. MapReduce
3. Graph Algorithms
4. Basic Implementation
4.1 Message Passing
4.2 Local Aggregation
5. Algorithm Optimizations
5.1 In-Mapper Combining
5.2 Schimmy
5.3 Range Partitioning
6. Results
7. Future Works and Conclusions
8. Acknowledgements
9. References

Hadoop Summit 2010 - Presentation Slides & Videos

==========================================
AGENDA
==========================================
(1)Big Data and the Power of Hadoop
Blake Irving, Executive Vice President
and Chief Products Officer, Yahoo!
- Article: Yahoo announces SOX compliance coming for Hadoop
- VIDEO

(2)Hadoop and The Future of Internet Scale Cloud Computing
Shelton Shugar, Senior Vice President, Cloud Computing, Yahoo!
- VIDEO

(3)Scaling Hadoop
Eric Baldeschwieler, Vice President,
Hadoop Software Development, Yahoo!
- VIDEO

(4)Making Hadoop Enterprise Ready with Amazon Elastic MapReduce
Peter Sirota, General Manager, Elastic Map Reduce
- VIDEO

(5)Hadoop Grows Up
Doug Cutting, Cloudera
- VIDEO

(6)Inside Large-Scale Analytics at Facebook
Mike Schroepfer, VP of Engineering, Facebook
- VIDEO

==========================================
DEVELOPERS TRACK
==========================================
(1)Hadoop Security in Detail
Owen O'Malley, Yahoo!
- PRESENTATION SLIDE
- Hadoop Security Preview
- VIDEO

(2)Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook
- PRESENTATION SLIDE
- List of presentations mainly focused on Hive
- HBase Presentations
- PoweredBy Hive (some)
- PoweredBy HBase (some)
- Blog: Integrating Hive and HBase
- VIDEO

(3)Hadoop and Pig at Twitter
Kevin Weil, Twitter
- PRESENTATION SLIDE
- VIDEO

(4)Developer's Most Frequent Hadoop Headaches & How to Address Them
Shevek Mankin, Karmasphere
- PRESENTATION SLIDE
- VIDEO

(5)Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!
- PRESENTATION SLIDE
- VIDEO

(6)Cascalog: an Interactive Query Language for Hadoop
Nathan Marz, BackType
- PRESENTATION SLIDE
- VIDEO

(7)Honu - A Large Scale Streaming Data Collection
and Processing Pipeline
Jerome Boulon, Netflix
- PRESENTATION SLIDE
- VIDEO

(8)Hadoop Frameworks Panel: Pig, Hive, Cascading,
Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird
Moderator: Sanjay Radia, Yahoo!
- Twitter ElephantBird - updated slide
- PRESENTATION SLIDE
- VIDEO

==========================================
APPLICATIONS TRACK
==========================================
(1)Disruptive Applications with Hadoop
Rod Smith, VP, IBM Emerging Internet Technologies
- PRESENTATION SLIDE
- VIDEO

(2)ZettaVox: Content Mining and Analysis Across
Heterogeneous Compute Clouds
Mark Davis, Kitenga
- PRESENTATION SLIDE
- VIDEO

(3)Biometric Databases and Hadoop
Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton
- PRESENTATION SLIDE
- VIDEO

(4)Hadoop - Integration Patterns and Practices
Eric Sammer, Cloudera
- PRESENTATION SLIDE
- VIDEO

(5)Winning the Big Data SPAM Challenge
Stefan Groschupf, Datameer; Florian Leibert, Erich Nachbar
- PRESENTATION SLIDE
- VIDEO

(6)Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn
- PRESENTATION SLIDE
- VIDEO

(7)Online Content Optimization with Hadoop
Amit Phadke, Yahoo!
- PRESENTATION SLIDE
- VIDEO

(8)Hadoop Customer Panel: Amazon Elastic MapReduce
Moderator: Deepak Singh, Amazon Web Services
- VIDEO

==========================================
RESEARCH TRACK
==========================================
(1)Design Patterns for Efficient Graph Algorithms in MapReduce
Jimmy Lin, Michael Schatz, University of Maryland
- PRESENTATION SLIDE
- RESEARCH PAPER
- BOOK
- VIDEO

(2)Mining Billion-node Graphs: Patterns, Generators and Tools
Christos Faloutsos, Carnegie Mellon University
- PRESENTATION SLIDE
- RESEARCH PAPER

(3)XXL Graph Algorithms
Sergei Vassilvitskii, Yahoo! Labs
- PRESENTATION SLIDE

(4)Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine
- PRESENTATION SLIDE
- RESEARCH PAPER
- VIDEO

(5)Exact Inference in Bayesian Networks using MapReduce
Alex Kozlov, Cloudera
- PRESENTATION SLIDE
- VIDEO

(6)Hadoop for Scientific Workloads
Lavanya Ramakrishnan, Lawrence Berkeley National Lab
- PRESENTATION SLIDE
- VIDEO

(7)Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics
- PRESENTATION SLIDE
- VIDEO

(8)Parallel Distributed Image Stacking and Mosaicing with Hadoop
Keith Wiley, University of Washington
- PRESENTATION SLIDE

Related:
- Massive Data
- List of presentations about Hadoop
- Past, 2008 Hadoop Summit slides and videos
- Apache Hadoop Wiki
- Cloudera training videos on Hadoop
- Yahoo! Hadoop Tutorial
- PoweredBy Hadoop
- Google Code University: Distributed Systems
- University of Washington: Scalable Systems Course
- Mapreduce & Hadoop Algorithms in Academic Papers
- Machine Learning on Hadoop
- Reference: Graph Theory and Complex Networks, Maarten van Steen
(via @Werner)
- Mathematics of Batch Processing
- Pig at LinkedIn, Open Source and Understanding Systems

- Yahoo's Commitment to Hadoop and Open Source
- Hadoop Trends, Opportunities, and Challenges
- Multiple Sequence Alignment Using Hadoop
- Key Challenges in Cloud Computing and Yahoo!'s Approach
- Hadoop @ Yahoo! - Internet Scale Data Processing
- Hadoop, Pig, HBase at Twitter
- CDH3 Installation and Configuration Guide

- Testing Hadoop
- Atlassian Clover - code coverage
- Challenges And Uniqueness Of Qe And Re Processes In Hadoop
- Data Management On Grid
- Benchmarking and Optimizing Hadoop
- Data Management on Hadoop @ Yahoo!
- Tuning Hadoop To Deliver Performance To Your Application

Saturday, June 26, 2010

Some activities in a data science team


From @hackingdata:

- author a multistage processing pipeline in Python
- design a hypothesis test
- perform a regression analysis over data samples with R
- design and implement an algorithm for some data-intensive product
or service in Hadoop
- communicate the results of analyses to other members of the org
in a clear and concise fashion

MapReduce: Simplified Data Processing on Large Clusters


Abstract
1. Introduction
2. Programming Model
2.1 Example
2.2 Types
2.3 More Examples
3. Implementation
3.1 Execution Overview
3.2 Master Data Structure
3.3 Fault Tolerance
3.3.1 Worker Failure
3.3.2 Master Failure
3.3.3 Semantics in the Presence of Failures
3.4 Locality
3.5 Task Granularity
3.6 Backup Tasks
4. Refinements
4.1 Partitioning Function
4.2 Ordering Guarantees
4.3 Combiner Function
4.4 Input and Output Types
4.5 Side-effects
4.6 Skipping Bad Records
4.7 Local Execution
4.8 Status Information
4.9 Counters
5. Performance
5.1 Cluster Configuration
5.2 Grep
5.3 Sort
5.4 Effect of Backup Tasks
5.5 Machine Failures
6. Experience
6.1 Large-Scale Indexing
7. Related Work
8. Conclusions

Acknowledgements
References

Thursday, June 24, 2010

Pregel: A System for Large-Scale Graph Processing


Abstract
1. Introduction
2. Model of Computation
3. The C++ API
3.1 Message Passing
3.2 Combiners
3.3 Aggregators
3.4 Topology Mutations
3.5 Input and output
4. Implementation
4.1 Basic architecture
4.2 Fault tolerance
4.3 Worker implementation
4.4 Master implementation
4.5 Aggregators
5. Applications
5.1 PageRank
5.2 Shortest Paths
5.3 Bipartite Matching
5.4 Semi-Clustering
6. Experiments
7. Related Work
8. Conclusions and Future Work
9. Acknowledgements
10. References

FlumeJava: Easy, Efficient Data-Parallel Pipelines


Abstract
1. Introduction
2. Background on MapReduce
3. The FlumeJava Library
3.1 Core Abstractions
3.2 Derived Operations
3.3 Deferred Evaluation
3.4 PObjects
4. Optimizer
4.1 ParallelDo Fusion
4.2 The MapShuffleCombineReduce (MSCR) Operation
4.3 MSCR Fusion
4.4 Overall Optimizer Strategy
4.5 Example: SiteData
4.6 Optimizer Limitations and Future Work
5. Executor
6. Evaluation
6.1 User Adoption and Experience
6.2 Optimizer Effectiveness
6.3 Execution Performance
7. Related Work
8. Conclusion

Hadoop: The Definitive Guide

TOC
1. Meet Hadoop
2. MapReduce
3. The Hadoop Distributed Filesystem
4. Hadoop I/O
5. Developing MapReduce Applications
6. How MapReduce Works
7. MapReduce Types and Formats
8. MapReduce Features
9. Setting Up a Hadoop Cluster
10. Administering Hadoop
11. Pig
12. HBase
13. ZooKeeper
14. Case Studies

A. Installing Apache Hadoop
B. Cloudera's Distribution of Hadoop
C. Preparing the NCDC Weather Data