Practicals - Data Science: June 2010

Wednesday, June 30, 2010

Graph Theory and Complex Networks

6. Network analysis
    6.1 Vertex degrees
        Degree distribution
        Degree correlations
    6.2 Distance statistics
    6.3 Clustering coefficient
        Some effects of clustering
        Local view
        Global view
    6.4 Centrality
 
8. Random networks
    7.1 Introduction
    7.2 Classical random networks
        Degree distribution
        Other metrics for random graphs
    7.3 Small worlds
    7.4 Scale-free networks
        Fundamentals
        Properties of scale-free networks
        Related networks
 
9. Social networks
    9.1 Social network analysis: introduction
    9.2 Some basic concepts
        Centrality and prestige
        Structural balance
        Cohesive subgroups
        Affiliation networks
    9.3 Equivalence
        Structural equivalence
        Automorphic equivalence
        Regular equivalence

The Fourth Paradigm: Data-Intensive Scientific Discovery

Head First Data Analysis


1. Introduction to Data Analysis: Break It Down
2. Experiements: Test Your Theories
3. Optimization: Take It to the Max
4. Data Visualization: Pictures Make You Smarter
5. Hypothesis Testing: Say It Ain't So
6. Bayesian Statistics: Get Past First Base
7. Subjective Probabilities: Numerical Belief
8. Heuristics: Analyze Like a Human
9. Histograms: The Shape of Numbers
10. Regression: Prediction
11. Error: Err Well
12. Relational Database: Can You Relate?
13. Cleaning Data: Impose Order
 
i. Leftovers: Top Ten Things (We Didn't Cover)
ii. Install R: Start R Up!
iii. Install Excel Analysis Tools: The ToolPak

Tuesday, June 29, 2010

Handbook of Statistical Analysis and Data Mining


I. History of Phases of Data Analysis, 
Basic Theory, and Data Mining Process
 
1. The Background for Data Mining Practice
2. Theoretical Considerations for Data Mining
3. The Data Mining Process
4. Data Understanding and Preparation
5. Feature Selection
6. Accessory Tools for Doing Data Mining
 
II. The Algorithms in Data Mining and Text Mining,
The Organization of the Three Most Common Data 
Mining Tools, and Selected Specialized Areas Using
Data Mining
 
7. Basic Algorithms for Data Mining: A Brief Overview
8. Advanced Algorithms for Data Mining
9. Text Mining and Natural Language Processing
10. The Three Most Common Data Mining Software Tools
11. Classification
12. Numerical Prediction
13. Model Evaluation and Enhancement
14. Medical Informatics
15. Bioinformatics
16. Customer Response Modeling
17. Fraud Detection
 
III. Tutorials - Step-by-Step Case Studies as A Staring
Point to Learn How to Do Data Mining Analyses
 
A. How to Use Data Miner Recipe
B. Data Mining for Aviation Safety
C. Predicting Movie Box-Office Receipts
D. Detecting Unsatisfied Customers: A Case Study
E. Credit Scoring
F. Churn Analysis
G. Text Mining: Automobile Brand Review
H. Predictive Process Control: QC-Data Mining
I. Business Administration in a Medical Industry
J. Clinical Psychology: Making Decision About 
Best Therapy for a Client
K. Education-Leadership Training for Business
and Education
L. Dentistry: Facial Pain Study
M. Profit Analysis of the German Credit Data
N. Predicting Self-Reported Health Status Using
Artificial Neural Networks
 
IV. Measuring True Complexity, The "Right Model
for the Right Use", Top Mistakes, and the Future
of Analytics
 
18. Model Complexity (and How Ensembles Help)
19. The Right Model for the Right Purpose:
When Less is Good Enough
20. Top 10 Data Mining Mistakes
21. Prospects for the Future of Data Mining
and Text Mining as Part of Our Everyday Lives
22. Summary: Our Design

Data Mining: Practical Machine Learning Tools and Techniques


I. Machine learning tools and techniques
 
1. What's it all about?
2. Input: Concepts, instances, and attributes
3. Output: Knowledge representation
4. Algorithms: The basic methods
    - Inferring rudimentary rules
    - Statistical modeling
    - Divide-and-conquer: Constructing decision trees
    - Covering algorithms: Constructing rules
    - Mining association rules
    - Linear models
    - Instance-based learning
    - Clustering
5. Credibility: Evaluating what's been learned
6. Implementations: Real machine learning schemes
7. Transformations: Engineering the input and output
8. Moving on: Extensions and applications

Programming Collective Intelligence


1. Introduction to Collective Intelligence
    - What is Collective Intelligence?
    - What is Machine Learning
    - Limits of Machine Learning
    - Real-Life Examples
    - Other Uses for Learning Algorithms
2. Making Recommendations
    - Collaborative Filtering
    - Collecting Preferences
    - Finding Similar Users
    - Recommending Items
    - Matching Products
    - Building a del.icio.us Link Recommender
    - Item-Based Filtering
    - Using the MovieLens Dataset
    - User-Based or Item-Based Filtering?
3. Discovering Groups
4. Searching and Ranking
5. Optimization
6. Document Filtering
7. Modeling with Decision Trees
8. Building Price Models
    - Building a Sample Dataset
    - k-Nearest Neighbors
    - Weighted Neighbors
    - Cross-Validation
    - Heterogeneous Variables
    - Optimizing the Scale
    - Uneven Distributions
    - Using Real Data - the eBay API
    - When to Use k-Nearest Neighbors
9. Advanced Classifications: Kernel Methods and SVMs
    - Matchmaker Dataset
    - Difficulties with the Data
    - Basic Linear Classification
    - Categorical Features
    - Scaling the Data
    - Understanding Kernel Methods
    - Support-Vector Machines
    - Using LIBSVM
    - Matching on Facebook
10. Finding Independent Features
    - A Corpus of News
    - Previous Approaches
    - Non-Negative Matrix Factorization
    - Displaying the Results
    - Using Stock Market Data
11. Evolving Intelligence
12. Algorithm Summary
    - Bayesian Classifier
    - Decision Tree Classifier
    - Neural Networks
    - Support-Vector Machines
    - k-Nearest Neighbors
    - Clustering
    - Multidimensional Scaling
    - Non-Negative Matrix Factorization
    - Optimization

A. Third-Party Libraries
B. Mathematical Formulas

Efﬁcient Parallel Set-Similarity Joins Using MapReduce


Abstract
1. Introduction
2. Preliminaries
    2.1 MapReduce
    2.2 Parallel Set-Similarity Joins
    2.3 Set-Similarity Filtering
3. Self-Join Case
    3.1 Stage 1: Token Ordering
        3.1.1 Basic Token Ordering (BTO)
        3.1.2 Using One Phase to Order Tokens (OPTO)
    3.2 Stage 2: RID-Pair Generation
        3.2.1 Basic Kernel (BK)
        3.2.2 Indexed Kernel (PK)
    3.3 Stage 3: Record Join
        3.3.1 Basic Record Join (BRJ)
        3.3.2 One-Phase Record Join (OPRJ)
4. R-S Join Case
5. Handling Insufficient Memory
    - Map-Based Block Processing
    - Reduced-Based Block Processing
    - Handling R-S Joins
6. Experimental Evaluation
    6.1 Self-Join Performance
        6.1.1 Self-Join Speedup
        6.1.2 Self-Join Scaleup
        6.1.3 Self-Join Summary
    6.2 R-S Join Performance
        6.2.1 R-S Join Speedup
        6.2.2 R-S Join Scaleup
7. Related Work
8. Conclusions
9. References

Appendix

A. Self-Join Algorithms
B. Experimental Results
    - Self-Join Performance
    - R-S Join Performance

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs


Abstract
1. Introduction
2. Related Work
3. Why not Traditional Methods?
    - Spectral Clustering
    - Graph Partitioning Methods
4. EigenSpokes
    4.1 A Surprise: Spokes
    4.2 Justification and Proofs
    4.3 Ubiquity of Spokes
    4.4 Recreating Spokes
5. SpokeEn: Exploiting EigenSpokes
    5.1 Designing SpokeEn
    5.2 Discussion
    5.3 Emperical Results
6. Successes with Real-World Graphs
7. Conclusions
References

Design Patterns for Efﬁcient Graph Algorithms in MapReduce


Abstract
1. Introduction
2. MapReduce
3. Graph Algorithms
4. Basic Implementation
    4.1 Message Passing
    4.2 Local Aggregation
5. Algorithm Optimizations
    5.1 In-Mapper Combining
    5.2 Schimmy
    5.3 Range Partitioning
6. Results
7. Future Works and Conclusions
8. Acknowledgements
9. References

Hadoop Summit 2010 - Presentation Slides & Videos

==========================================
AGENDA
==========================================
(1)Big Data and the Power of Hadoop
Blake Irving, Executive Vice President
and Chief Products Officer, Yahoo!
- Article: Yahoo announces SOX compliance coming for Hadoop
- VIDEO
 
(2)Hadoop and The Future of Internet Scale Cloud Computing
Shelton Shugar, Senior Vice President, Cloud Computing, Yahoo!
- VIDEO
 
(3)Scaling Hadoop
Eric Baldeschwieler, Vice President,
Hadoop Software Development, Yahoo!
- VIDEO
 
(4)Making Hadoop Enterprise Ready with Amazon Elastic MapReduce
Peter Sirota, General Manager, Elastic Map Reduce
- VIDEO
 
(5)Hadoop Grows Up
Doug Cutting, Cloudera
- VIDEO
 
(6)Inside Large-Scale Analytics at Facebook
Mike Schroepfer, VP of Engineering, Facebook
- VIDEO
 
==========================================
DEVELOPERS TRACK
==========================================
(1)Hadoop Security in Detail
Owen O'Malley, Yahoo!
- PRESENTATION SLIDE
- Hadoop Security Preview
- VIDEO
 
(2)Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook
- PRESENTATION SLIDE
- List of presentations mainly focused on Hive
- HBase Presentations
- PoweredBy Hive (some)
- PoweredBy HBase (some)
- Blog: Integrating Hive and HBase
- VIDEO
 
(3)Hadoop and Pig at Twitter
Kevin Weil, Twitter
- PRESENTATION SLIDE
- VIDEO
 
(4)Developer's Most Frequent Hadoop Headaches & How to Address Them
Shevek Mankin, Karmasphere
- PRESENTATION SLIDE
- VIDEO
 
(5)Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!
- PRESENTATION SLIDE
- VIDEO
 
(6)Cascalog: an Interactive Query Language for Hadoop
Nathan Marz, BackType
- PRESENTATION SLIDE
- VIDEO
 
(7)Honu - A Large Scale Streaming Data Collection
and Processing Pipeline
Jerome Boulon, Netflix
- PRESENTATION SLIDE
- VIDEO
 
(8)Hadoop Frameworks Panel: Pig, Hive, Cascading,
Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird
Moderator: Sanjay Radia, Yahoo!
- Twitter ElephantBird - updated slide
- PRESENTATION SLIDE
- VIDEO
  
==========================================
APPLICATIONS TRACK
==========================================
(1)Disruptive Applications with Hadoop
Rod Smith, VP, IBM Emerging Internet Technologies
- PRESENTATION SLIDE
- VIDEO
 
(2)ZettaVox: Content Mining and Analysis Across
Heterogeneous Compute Clouds
Mark Davis, Kitenga
- PRESENTATION SLIDE
- VIDEO
 
(3)Biometric Databases and Hadoop
Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton
- PRESENTATION SLIDE
- VIDEO 
 
(4)Hadoop - Integration Patterns and Practices
Eric Sammer, Cloudera
- PRESENTATION SLIDE
- VIDEO
 
(5)Winning the Big Data SPAM Challenge
Stefan Groschupf, Datameer; Florian Leibert, Erich Nachbar
- PRESENTATION SLIDE
- VIDEO
 
(6)Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn
- PRESENTATION SLIDE
- VIDEO
 
(7)Online Content Optimization with Hadoop
Amit Phadke, Yahoo!
- PRESENTATION SLIDE
- VIDEO
 
(8)Hadoop Customer Panel: Amazon Elastic MapReduce
Moderator: Deepak Singh, Amazon Web Services
- VIDEO
 
==========================================
RESEARCH TRACK
==========================================
(1)Design Patterns for Efficient Graph Algorithms in MapReduce
Jimmy Lin, Michael Schatz, University of Maryland
- PRESENTATION SLIDE
- RESEARCH PAPER
- BOOK
- VIDEO
 
(2)Mining Billion-node Graphs: Patterns, Generators and Tools
Christos Faloutsos, Carnegie Mellon University
- PRESENTATION SLIDE
- RESEARCH PAPER
 
(3)XXL Graph Algorithms
Sergei Vassilvitskii, Yahoo! Labs
- PRESENTATION SLIDE
 
(4)Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine
- PRESENTATION SLIDE
- RESEARCH PAPER
- VIDEO
 
(5)Exact Inference in Bayesian Networks using MapReduce
Alex Kozlov, Cloudera
- PRESENTATION SLIDE
- VIDEO
 
(6)Hadoop for Scientific Workloads
Lavanya Ramakrishnan, Lawrence Berkeley National Lab
- PRESENTATION SLIDE
- VIDEO
 
(7)Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics
- PRESENTATION SLIDE
- VIDEO
 
(8)Parallel Distributed Image Stacking and Mosaicing with Hadoop
Keith Wiley, University of Washington
- PRESENTATION SLIDE
 
Related:
- Massive Data
- List of presentations about Hadoop
- Past, 2008 Hadoop Summit slides and videos
- Apache Hadoop Wiki
- Cloudera training videos on Hadoop
- Yahoo! Hadoop Tutorial
- PoweredBy Hadoop
- Google Code University: Distributed Systems
- University of Washington: Scalable Systems Course
- Mapreduce & Hadoop Algorithms in Academic Papers
- Machine Learning on Hadoop
- Reference: Graph Theory and Complex Networks, Maarten van Steen
  (via @Werner)
- Mathematics of Batch Processing
- Pig at LinkedIn, Open Source and Understanding Systems
 
- Yahoo's Commitment to Hadoop and Open Source
- Hadoop Trends, Opportunities, and Challenges
- Multiple Sequence Alignment Using Hadoop
- Key Challenges in Cloud Computing and Yahoo!'s Approach
- Hadoop @ Yahoo! - Internet Scale Data Processing
- Hadoop, Pig, HBase at Twitter
- CDH3 Installation and Configuration Guide
 
- Testing Hadoop
- Atlassian Clover - code coverage
- Challenges And Uniqueness Of Qe And Re Processes In Hadoop
- Data Management On Grid
- Benchmarking and Optimizing Hadoop
- Data Management on Hadoop @ Yahoo!
- Tuning Hadoop To Deliver Performance To Your Application

Saturday, June 26, 2010

Some activities in a data science team


From @hackingdata:

- author a multistage processing pipeline in Python
- design a hypothesis test
- perform a regression analysis over data samples with R
- design and implement an algorithm for some data-intensive product
  or service in Hadoop
- communicate the results of analyses to other members of the org
  in a clear and concise fashion

MapReduce: Simplified Data Processing on Large Clusters


Abstract
1. Introduction
2. Programming Model
    2.1 Example
    2.2 Types
    2.3 More Examples
3. Implementation
    3.1 Execution Overview
    3.2 Master Data Structure
    3.3 Fault Tolerance
        3.3.1 Worker Failure
        3.3.2 Master Failure
        3.3.3 Semantics in the Presence of Failures
    3.4 Locality
    3.5 Task Granularity
    3.6 Backup Tasks
4. Refinements
    4.1 Partitioning Function
    4.2 Ordering Guarantees
    4.3 Combiner Function
    4.4 Input and Output Types
    4.5 Side-effects
    4.6 Skipping Bad Records
    4.7 Local Execution
    4.8 Status Information
    4.9 Counters
5. Performance
    5.1 Cluster Configuration
    5.2 Grep
    5.3 Sort
    5.4 Effect of Backup Tasks
    5.5 Machine Failures
6. Experience
    6.1 Large-Scale Indexing
7. Related Work
8. Conclusions

Acknowledgements
References

Thursday, June 24, 2010

Pregel: A System for Large-Scale Graph Processing


Abstract
1. Introduction
2. Model of Computation
3. The C++ API
    3.1 Message Passing
    3.2 Combiners
    3.3 Aggregators
    3.4 Topology Mutations
    3.5 Input and output
4. Implementation
    4.1 Basic architecture
    4.2 Fault tolerance
    4.3 Worker implementation
    4.4 Master implementation
    4.5 Aggregators
5. Applications
    5.1 PageRank
    5.2 Shortest Paths
    5.3 Bipartite Matching
    5.4 Semi-Clustering
6. Experiments
7. Related Work
8. Conclusions and Future Work
9. Acknowledgements
10. References

FlumeJava: Easy, Efficient Data-Parallel Pipelines


Abstract
1. Introduction
2. Background on MapReduce
3. The FlumeJava Library
  3.1 Core Abstractions
  3.2 Derived Operations
  3.3 Deferred Evaluation
  3.4 PObjects
4. Optimizer
  4.1 ParallelDo Fusion
  4.2 The MapShuffleCombineReduce (MSCR) Operation
  4.3 MSCR Fusion
  4.4 Overall Optimizer Strategy
  4.5 Example: SiteData
  4.6 Optimizer Limitations and Future Work
5. Executor
6. Evaluation
  6.1 User Adoption and Experience
  6.2 Optimizer Effectiveness
  6.3 Execution Performance
7. Related Work
8. Conclusion

Hadoop: The Definitive Guide

TOC
1. Meet Hadoop
2. MapReduce
3. The Hadoop Distributed Filesystem
4. Hadoop I/O
5. Developing MapReduce Applications
6. How MapReduce Works
7. MapReduce Types and Formats
8. MapReduce Features
9. Setting Up a Hadoop Cluster
10. Administering Hadoop
11. Pig
12. HBase
13. ZooKeeper
14. Case Studies

A. Installing Apache Hadoop
B. Cloudera's Distribution of Hadoop
C. Preparing the NCDC Weather Data

Practicals - Data Science

Wednesday, June 30, 2010

Graph Theory and Complex Networks

The Fourth Paradigm: Data-Intensive Scientific Discovery

Head First Data Analysis

Tuesday, June 29, 2010

Handbook of Statistical Analysis and Data Mining

Data Mining: Practical Machine Learning Tools and Techniques

Programming Collective Intelligence

Efﬁcient Parallel Set-Similarity Joins Using MapReduce

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs

Design Patterns for Efﬁcient Graph Algorithms in MapReduce

Hadoop Summit 2010 - Presentation Slides & Videos

Saturday, June 26, 2010

Some activities in a data science team

MapReduce: Simplified Data Processing on Large Clusters

Friday, June 25, 2010

The Unreasonable Effectiveness of Data

Experiences Evolving a New Analytical Platform: What Works and What's Missing

Thursday, June 24, 2010

Pregel: A System for Large-Scale Graph Processing

FlumeJava: Easy, Efficient Data-Parallel Pipelines

Hadoop: The Definitive Guide