Practicals - Data Science

Wednesday, June 30, 2010

Graph Theory and Complex Networks

6. Network analysis
    6.1 Vertex degrees
        Degree distribution
        Degree correlations
    6.2 Distance statistics
    6.3 Clustering coefficient
        Some effects of clustering
        Local view
        Global view
    6.4 Centrality
 
8. Random networks
    7.1 Introduction
    7.2 Classical random networks
        Degree distribution
        Other metrics for random graphs
    7.3 Small worlds
    7.4 Scale-free networks
        Fundamentals
        Properties of scale-free networks
        Related networks
 
9. Social networks
    9.1 Social network analysis: introduction
    9.2 Some basic concepts
        Centrality and prestige
        Structural balance
        Cohesive subgroups
        Affiliation networks
    9.3 Equivalence
        Structural equivalence
        Automorphic equivalence
        Regular equivalence

The Fourth Paradigm: Data-Intensive Scientific Discovery

Head First Data Analysis


1. Introduction to Data Analysis: Break It Down
2. Experiements: Test Your Theories
3. Optimization: Take It to the Max
4. Data Visualization: Pictures Make You Smarter
5. Hypothesis Testing: Say It Ain't So
6. Bayesian Statistics: Get Past First Base
7. Subjective Probabilities: Numerical Belief
8. Heuristics: Analyze Like a Human
9. Histograms: The Shape of Numbers
10. Regression: Prediction
11. Error: Err Well
12. Relational Database: Can You Relate?
13. Cleaning Data: Impose Order
 
i. Leftovers: Top Ten Things (We Didn't Cover)
ii. Install R: Start R Up!
iii. Install Excel Analysis Tools: The ToolPak

Tuesday, June 29, 2010

Handbook of Statistical Analysis and Data Mining


I. History of Phases of Data Analysis, 
Basic Theory, and Data Mining Process
 
1. The Background for Data Mining Practice
2. Theoretical Considerations for Data Mining
3. The Data Mining Process
4. Data Understanding and Preparation
5. Feature Selection
6. Accessory Tools for Doing Data Mining
 
II. The Algorithms in Data Mining and Text Mining,
The Organization of the Three Most Common Data 
Mining Tools, and Selected Specialized Areas Using
Data Mining
 
7. Basic Algorithms for Data Mining: A Brief Overview
8. Advanced Algorithms for Data Mining
9. Text Mining and Natural Language Processing
10. The Three Most Common Data Mining Software Tools
11. Classification
12. Numerical Prediction
13. Model Evaluation and Enhancement
14. Medical Informatics
15. Bioinformatics
16. Customer Response Modeling
17. Fraud Detection
 
III. Tutorials - Step-by-Step Case Studies as A Staring
Point to Learn How to Do Data Mining Analyses
 
A. How to Use Data Miner Recipe
B. Data Mining for Aviation Safety
C. Predicting Movie Box-Office Receipts
D. Detecting Unsatisfied Customers: A Case Study
E. Credit Scoring
F. Churn Analysis
G. Text Mining: Automobile Brand Review
H. Predictive Process Control: QC-Data Mining
I. Business Administration in a Medical Industry
J. Clinical Psychology: Making Decision About 
Best Therapy for a Client
K. Education-Leadership Training for Business
and Education
L. Dentistry: Facial Pain Study
M. Profit Analysis of the German Credit Data
N. Predicting Self-Reported Health Status Using
Artificial Neural Networks
 
IV. Measuring True Complexity, The "Right Model
for the Right Use", Top Mistakes, and the Future
of Analytics
 
18. Model Complexity (and How Ensembles Help)
19. The Right Model for the Right Purpose:
When Less is Good Enough
20. Top 10 Data Mining Mistakes
21. Prospects for the Future of Data Mining
and Text Mining as Part of Our Everyday Lives
22. Summary: Our Design

Data Mining: Practical Machine Learning Tools and Techniques


I. Machine learning tools and techniques
 
1. What's it all about?
2. Input: Concepts, instances, and attributes
3. Output: Knowledge representation
4. Algorithms: The basic methods
    - Inferring rudimentary rules
    - Statistical modeling
    - Divide-and-conquer: Constructing decision trees
    - Covering algorithms: Constructing rules
    - Mining association rules
    - Linear models
    - Instance-based learning
    - Clustering
5. Credibility: Evaluating what's been learned
6. Implementations: Real machine learning schemes
7. Transformations: Engineering the input and output
8. Moving on: Extensions and applications

Programming Collective Intelligence


1. Introduction to Collective Intelligence
    - What is Collective Intelligence?
    - What is Machine Learning
    - Limits of Machine Learning
    - Real-Life Examples
    - Other Uses for Learning Algorithms
2. Making Recommendations
    - Collaborative Filtering
    - Collecting Preferences
    - Finding Similar Users
    - Recommending Items
    - Matching Products
    - Building a del.icio.us Link Recommender
    - Item-Based Filtering
    - Using the MovieLens Dataset
    - User-Based or Item-Based Filtering?
3. Discovering Groups
4. Searching and Ranking
5. Optimization
6. Document Filtering
7. Modeling with Decision Trees
8. Building Price Models
    - Building a Sample Dataset
    - k-Nearest Neighbors
    - Weighted Neighbors
    - Cross-Validation
    - Heterogeneous Variables
    - Optimizing the Scale
    - Uneven Distributions
    - Using Real Data - the eBay API
    - When to Use k-Nearest Neighbors
9. Advanced Classifications: Kernel Methods and SVMs
    - Matchmaker Dataset
    - Difficulties with the Data
    - Basic Linear Classification
    - Categorical Features
    - Scaling the Data
    - Understanding Kernel Methods
    - Support-Vector Machines
    - Using LIBSVM
    - Matching on Facebook
10. Finding Independent Features
    - A Corpus of News
    - Previous Approaches
    - Non-Negative Matrix Factorization
    - Displaying the Results
    - Using Stock Market Data
11. Evolving Intelligence
12. Algorithm Summary
    - Bayesian Classifier
    - Decision Tree Classifier
    - Neural Networks
    - Support-Vector Machines
    - k-Nearest Neighbors
    - Clustering
    - Multidimensional Scaling
    - Non-Negative Matrix Factorization
    - Optimization

A. Third-Party Libraries
B. Mathematical Formulas

Efﬁcient Parallel Set-Similarity Joins Using MapReduce


Abstract
1. Introduction
2. Preliminaries
    2.1 MapReduce
    2.2 Parallel Set-Similarity Joins
    2.3 Set-Similarity Filtering
3. Self-Join Case
    3.1 Stage 1: Token Ordering
        3.1.1 Basic Token Ordering (BTO)
        3.1.2 Using One Phase to Order Tokens (OPTO)
    3.2 Stage 2: RID-Pair Generation
        3.2.1 Basic Kernel (BK)
        3.2.2 Indexed Kernel (PK)
    3.3 Stage 3: Record Join
        3.3.1 Basic Record Join (BRJ)
        3.3.2 One-Phase Record Join (OPRJ)
4. R-S Join Case
5. Handling Insufficient Memory
    - Map-Based Block Processing
    - Reduced-Based Block Processing
    - Handling R-S Joins
6. Experimental Evaluation
    6.1 Self-Join Performance
        6.1.1 Self-Join Speedup
        6.1.2 Self-Join Scaleup
        6.1.3 Self-Join Summary
    6.2 R-S Join Performance
        6.2.1 R-S Join Speedup
        6.2.2 R-S Join Scaleup
7. Related Work
8. Conclusions
9. References

Appendix

A. Self-Join Algorithms
B. Experimental Results
    - Self-Join Performance
    - R-S Join Performance