Data Mining:
Practical Machine Learning Tools and Techniques

We have written a companion book for the Weka software, now into its third edition, that describes the machine learning techniques that it implements and how to use them. It is structured into three parts. The first part is an introduction to data mining using basic machine learning techniques, the second part describes more advanced machine learning methods, and the third part is a user guide for Weka. The third edition was published in January 2011 by Morgan Kaufmann Publishers (ISBN: 978-0-12-374856-0). Mark Hall has joined Ian Witten and Eibe Frank as co-author for this edition, which has expanded to 629 pages.

"If you have data that you want to analyze and understand, this book and the associated Weka toolkit are an excellent way to start."

-Jim Gray, Microsoft Research

"The authors provide enough theory to enable practical application, and it is this practical focus that separates this book from most, if not all, other books on this subject."

-Dorian Pyle, Director of Modeling at Numetrics

"This book would be a strong contender for a technical data mining course. It is one of the best of its kind."

-Herb Edelstein, Principal, Data Mining Consultant, Two Crows Consulting

"It is certainly one of my favourite data mining books in my library."

-Tom Breur, Principal, XLNT Consulting, Tiburg, Netherlands

Features
  • Explains how data mining algorithms work.
  • Helps you select appropriate approaches to particular problems and to compare and evaluate the results of different techniques.
  • Covers performance improvement techniques, including input preprocessing and combining output from different methods.
  • Shows you how to use the Weka machine learning workbench.
Translations

The book has been translated into German (first edition), Chinese (second and third edition) and Korean (third edition).

Errata

Click here to get to a list of errata.

Teaching material

Slides for Chapters 1-5 of the 3rd edition can be found here.

Slides for Chapters 6-8 of the 3rd edition can be found here

These archives contain .pdf files as well as .odp files in Open Document Format that were generated using OpenOffice 2.0. Note that there are several free office programs now that can read .odp files. There is also a plug-in for Word made by Sun for reading this format. Corresponding information is on this Wikipedia page.

Reviews of the first edition

Review by J. Geller (SIGMOD Record, Vol. 32:2, March 2002).
Review by E. Davis (AI Journal, Vol. 131:1-2, September 2001).
Review by P.A. Flach (AI Journal, Vol. 131:1-2, September 2001).


Table of Contents for the 3rd Edition:
Sections and chapters with new material are marked in red.

Preface

Part I: Practical Machine Learning Tools and Techniques

1. What’s it all about?
1.1 Data Mining and Machine Learning
1.2 Simple Examples: The Weather Problem and Others
1.3 Fielded Applications
1.4 Machine Learning and Statistics
1.5 Generalization as Search
1.6 Data Mining and Ethics
1.7 Further Reading

2. Input: Concepts, instances, attributes
2.1 What’s a Concept?
2.2 What’s in an Example?
2.3 What’s in an Attribute?
2.4 Preparing the Input
2.5 Further Reading

3. Output: Knowledge representation
3.1 Tables
3.2 Linear Models
3.3 Trees
3.4 Rules
3.5 Instance-Based Representation
3.6 Clusters
3.7 Further Reading

4. Algorithms: The basic methods
4.1 Inferring Rudimentary Rules
4.2 Statistical Modeling
4.3 Divide-and-Conquer: Constructing Decision Trees
4.4 Covering Algorithms: Constructing Rules
4.5 Mining Association Rules
4.6 Linear Models
4.7 Instance-Based Learning
4.8 Clustering
4.9 Multi-Instance Learning
4.10 Further Reading
4.11 Weka Implementations

5. Credibility: Evaluating what’s been learned
5.1 Training and Testing
5.2 Predicting Performance
5.3 Cross-Validation
5.4 Other Estimates
5.5 Comparing Data Mining Schemes
5.6 Predicting Probabilities
5.7 Counting the Cost
5.8 Evaluating Numeric Prediction
5.9 The Minimum Description Length Principle
5.10 Applying MDL to Clustering
5.11 Further Reading

Part II: Advanced Data Mining

6. Implementations: Real machine learning schemes
6.1 Decision Trees
6.2 Classification Rules
6.3 Association Rules
6.4 Extending Linear Models
6.5 Instance-Based Learning
6.6 Numeric Prediction with Local Linear Models
6.7 Bayesian Networks
6.8 Clustering
6.9 Semisupervised Learning
6.10 Multi-Instance Learning
6.11 Weka Implementations

7. Data Transformations
7.1 Attribute Selection
7.2 Discretizing Numeric Attributes
7.3 Projections
7.4 Sampling
7.5 Cleansing
7.6 Transforming Multiple Classes to Binary Ones
7.7 Calibrating Class Probabilities
7.8 Further Reading
7.9 Weka Implementations

8. Ensemble Learning
8.1 Combining Multiple Models
8.2 Bagging
8.3 Randomization
8.4 Boosting
8.5 Additive Regression
8.6 Interpretable Ensembles
8.7 Stacking
8.8 Further Reading
8.9 Weka Implementations

9. Moving on: Applications and Beyond
9.1 Applying Data Mining
9.2 Learning from Massive Datasets
9.3 Data Stream Learning
9.4 Incorporating Domain Knowledge
9.5 Text Mining
9.6 Web Mining
9.7 Adversarial Situations
9.8 Ubiquitous Data Mining
9.9 Further Reading

Part III: The Weka Data Mining Workbench

10. Introduction to Weka
10.1 What’s in Weka?
10.2 How Do You Use It?
10.3 What Else Can You Do?

11. The Explorer
11.1 Getting Started
11.2 Exploring the Explorer
11.3 Filtering Algorithms
11.4 Learning Algorithms
11.5 Meta-Learning Algorithms
11.6 Clustering Algorithms
11.7 Association-Rule Learners
11.8 Attribute Selection

12. The Knowledge Flow Interface
12.1 Getting Started
12.2 Knowledge Flow Components
12.3 Configuring and Connecting the Components
12.4 Incremental Learning

13. The Experimenter
13.1 Getting Started
13.2 Simple Setup
13.3 Advanced Setup
13.4 The Analyze Panel
13.5 Distributing Processing over Several Machines

14. The Command-Line Interface
14.1 Getting Started
14.2 The Structure of Weka
14.3 Command-Line Options

15. Embedded Machine Learning
15.1 A Simple Data Mining Application

16. Writing New Learning Schemes
16.1 An Example Classifier
16.2 Conventions for Implementing Classifiers

17. Tutorial Excercises for the Weka Explorer
17.1 Introduction to the Explorer Interface
17.2 Nearest-Neighbor Learning and Decision Trees
17.3 Classification Boundaries
17.4 Preprocessing and Parameter Tuning
17.5 Document Classification
17.6 Mining Association Rules

References
Index