Statistical Pattern Recognition
Statistical pattern recognition relates to the use of statistical techniques for analysing data measurements in order to extract information and make justified decisions.  It is a very active area of study and research, which has seen many advances in recent years. Applications such as data mining, web searching, multimedia data retrieval, face recognition, and cursive handwriting recognition, all require robust and efficient pattern recognition techniques.

This third edition provides an introduction to statistical pattern theory and techniques, with material drawn from a wide range of fields, including the areas of engineering, statistics, computer science and the social sciences. The book has been updated to cover new methods and applications, and includes a wide range of techniques such as Bayesian methods, neural networks, support vector machines, feature selection and feature reduction techniques.Technical descriptions and motivations are provided, and the techniques are illustrated using real examples.

Statistical Pattern Recognition, 3rd Edition:

  • Provides a self-contained introduction to statistical pattern recognition.
  • Includes new material presenting the analysis of complex networks.
  • Introduces readers to methods for Bayesian density estimation.
  • Presents descriptions of new applications in biometrics, security, finance and condition monitoring.
  • Provides descriptions and guidance for implementing techniques, which will be invaluable to software engineers and developers seeking to develop real applications
  • Describes mathematically the range of statistical pattern recognition techniques.
  • Presents a variety of exercises including more extensive computer projects.

The in-depth technical descriptions make the book suitable for senior undergraduate and graduate students in statistics, computer science and engineering.  Statistical Pattern Recognition is also an excellent reference source for technical professionals.  Chapters have been arranged to facilitate implementation of the techniques by software engineers and developers in non-statistical engineering fields.

www.wiley.com/go/statistical_pattern_recognition

1100319677
Statistical Pattern Recognition
Statistical pattern recognition relates to the use of statistical techniques for analysing data measurements in order to extract information and make justified decisions.  It is a very active area of study and research, which has seen many advances in recent years. Applications such as data mining, web searching, multimedia data retrieval, face recognition, and cursive handwriting recognition, all require robust and efficient pattern recognition techniques.

This third edition provides an introduction to statistical pattern theory and techniques, with material drawn from a wide range of fields, including the areas of engineering, statistics, computer science and the social sciences. The book has been updated to cover new methods and applications, and includes a wide range of techniques such as Bayesian methods, neural networks, support vector machines, feature selection and feature reduction techniques.Technical descriptions and motivations are provided, and the techniques are illustrated using real examples.

Statistical Pattern Recognition, 3rd Edition:

  • Provides a self-contained introduction to statistical pattern recognition.
  • Includes new material presenting the analysis of complex networks.
  • Introduces readers to methods for Bayesian density estimation.
  • Presents descriptions of new applications in biometrics, security, finance and condition monitoring.
  • Provides descriptions and guidance for implementing techniques, which will be invaluable to software engineers and developers seeking to develop real applications
  • Describes mathematically the range of statistical pattern recognition techniques.
  • Presents a variety of exercises including more extensive computer projects.

The in-depth technical descriptions make the book suitable for senior undergraduate and graduate students in statistics, computer science and engineering.  Statistical Pattern Recognition is also an excellent reference source for technical professionals.  Chapters have been arranged to facilitate implementation of the techniques by software engineers and developers in non-statistical engineering fields.

www.wiley.com/go/statistical_pattern_recognition

69.0 In Stock
Statistical Pattern Recognition

Statistical Pattern Recognition

Statistical Pattern Recognition

Statistical Pattern Recognition

eBook

$69.00 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers

LEND ME® See Details

Overview

Statistical pattern recognition relates to the use of statistical techniques for analysing data measurements in order to extract information and make justified decisions.  It is a very active area of study and research, which has seen many advances in recent years. Applications such as data mining, web searching, multimedia data retrieval, face recognition, and cursive handwriting recognition, all require robust and efficient pattern recognition techniques.

This third edition provides an introduction to statistical pattern theory and techniques, with material drawn from a wide range of fields, including the areas of engineering, statistics, computer science and the social sciences. The book has been updated to cover new methods and applications, and includes a wide range of techniques such as Bayesian methods, neural networks, support vector machines, feature selection and feature reduction techniques.Technical descriptions and motivations are provided, and the techniques are illustrated using real examples.

Statistical Pattern Recognition, 3rd Edition:

  • Provides a self-contained introduction to statistical pattern recognition.
  • Includes new material presenting the analysis of complex networks.
  • Introduces readers to methods for Bayesian density estimation.
  • Presents descriptions of new applications in biometrics, security, finance and condition monitoring.
  • Provides descriptions and guidance for implementing techniques, which will be invaluable to software engineers and developers seeking to develop real applications
  • Describes mathematically the range of statistical pattern recognition techniques.
  • Presents a variety of exercises including more extensive computer projects.

The in-depth technical descriptions make the book suitable for senior undergraduate and graduate students in statistics, computer science and engineering.  Statistical Pattern Recognition is also an excellent reference source for technical professionals.  Chapters have been arranged to facilitate implementation of the techniques by software engineers and developers in non-statistical engineering fields.

www.wiley.com/go/statistical_pattern_recognition


Product Details

ISBN-13: 9781119961406
Publisher: Wiley
Publication date: 10/13/2011
Sold by: JOHN WILEY & SONS
Format: eBook
Pages: 672
File size: 13 MB
Note: This product may take a few minutes to download.

About the Author

Dr Andrew Robert Webb, Senior Researcher, QinetiQ Ltd, Malvern, UK.

Dr Keith Derek Copsey, Senior Researcher, QinetiQ Ltd, Malvern, UK.

Read an Excerpt

Statistical Pattern Recognition


By Andrew R. Webb

John Wiley & Sons

ISBN: 0-470-84513-9


Chapter One

Introduction to statistical pattern recognition

Overview

Statistical pattern recognition is a term used to cover all stages of an investigation from problem formulation and data collection through to discrimination and classification, assessment of results and interpretation. Some of the basic terminology is introduced and two complementary approaches to discrimination described.

1.1 Statistical pattern recognition

1.1.1 Introduction

This book describes basic pattern recognition procedures, together with practical applications of the techniques on real-world problems. A strong emphasis is placed on the statistical theory of discrimination, but clustering also receives some attention. Thus, the subject matter of this book can be summed up in a single word: 'classification', both supervised (using class information to design a classifier - i.e. discrimination) and unsupervised (allocating to groups without class information - i.e. clustering).

Pattern recognition as a field of study developed significantly in the 1960s. It was very much an interdisciplinary subject, covering developments in the areas of statistics, engineering, artificial intelligence, computer science, psychology and physiology, among others. Some people entered the field with a real problem to solve. The large numbers of applications, ranging from the classical ones suchas automatic character recognition and medical diagnosis to the more recent ones in data mining (such as credit scoring, consumer sales analysis and credit card transaction analysis), have attracted considerable research effort, with many methods developed and advances made. Other researchers were motivated by the development of machines with 'brain-like' performance, that in some way could emulate human performance. There were many over-optimistic and unrealistic claims made, and to some extent there exist strong parallels with the growth of research on knowledge-based systems in the 1970s and neural networks in the 1980s.

Nevertheless, within these areas significant progress has been made, particularly where the domain overlaps with probability and statistics, and within recent years there have been many exciting new developments, both in methodology and applications. These build on the solid foundations of earlier research and take advantage of increased computational resources readily available nowadays. These developments include, for example, kernel-based methods and Bayesian computational methods.

The topics in this book could easily have been described under the term machine learning that describes the study of machines that can adapt to their environment and learn from example. The emphasis in machine learning is perhaps more on computationally intensive methods and less on a statistical approach, but there is strong overlap between the research areas of statistical pattern recognition and machine learning.

1.1.2 The basic model

Since many of the techniques we shall describe have been developed over a range of diverse disciplines, there is naturally a variety of sometimes contradictory terminology. We shall use the term 'pattern' to denote the p-dimensional data vector [chi] =[([chi].sub.1], ..., [[chi].sub.p].sup.T of measurements ([sup.T] denotes vector transpose), whose components [chi].i] are measurements of the features of an object. Thus the features are the variables specified by the investigator and thought to be important for classification. In discrimination, we assume that there exist ITLITL groups or classes, denoted [[omega].sub.1], ..., [[omega].sub.ITLITL], and associated with each pattern [chi] is a categorical variable z that denotes the class or group membership; that is, if z = i, then the pattern belongs to [[omega].sub.i], i [member of] {1, ..., ITLITL}.

Examples of patterns are measurements of an acoustic waveform in a speech recognition problem; measurements on a patient made in order to identify a disease (diagnosis); measurements on patients in order to predict the likely outcome (prognosis); measurements on weather variables (for forecasting or prediction); and a digitised image for character recognition. Therefore, we see that the term 'pattern', in its technical meaning, does not necessarily refer to structure within images.

The main topic in this book may be described by a number of terms such as pattern classifier design or discrimination or allocation rule design. By this we mean specifying the parameters of a pattern classifier, represented schematically in Figure 1.1, so that it yields the optimal (in some sense) response for a given pattern. This response is usually an estimate of the class to which the pattern belongs. We assume that we have a set of patterns of known class {([[chi].sub.i], [z.sub.i]), i = 1, ..., n} (the training or design set) that we use to design the classifier (to set up its internal parameters). Once this has been done, we may estimate class membership for an unknown pattern [chi].

The form derived for the pattern classifier depends on a number of different factors. It depends on the distribution of the training data, and the assumptions made concerning its distribution. Another important factor is the misclassification cost - the cost of making an incorrect decision. In many applications misclassification costs are hard to quantify, being combinations of several contributions such as monetary costs, time and other more subjective costs. For example, in a medical diagnosis problem, each treatment has different costs associated with it. These relate to the expense of different types of drugs, the suffering the patient is subjected to by each course of action and the risk of further complications.

Figure 1.1 grossly oversimplifies the pattern classification procedure. Data may undergo several separate transformation stages before a final outcome is reached. These transformations (sometimes termed preprocessing, feature selection or feature extraction) operate on the data in a way that usually reduces its dimension (reduces the number of features), removing redundant or irrelevant information, and transforms it to a form more appropriate for subsequent classification. The term intrinsic dimensionality refers to the minimum number of variables required to capture the structure within the data. In the speech recognition example mentioned above, a preprocessing stage may be to transform the waveform to a frequency representation. This may be processed further to find formants (peaks in the spectrum). This is a feature extraction process (taking a possible nonlinear combination of the original variables to form new variables). Feature selection is the process of selecting a subset of a given set of variables.

Terminology varies between authors. Sometimes the term 'representation pattern' is used for the vector of measurements made on a sensor (for example, optical imager, radar) with the term 'feature pattern' being reserved for the small set of variables obtained by transformation (by a feature selection or feature extraction process) of the original vector of measurements. In some problems, measurements may be made directly on the feature vector itself. In these situations there is no automatic feature selection stage, with the feature selection being performed by the investigator who 'knows' (through experience, knowledge of previous studies and the problem domain) those variables that are important for classification. In many cases, however, it will be necessary to perform one or more transformations of the measured data.

In some pattern classifiers, each of the above stages may be present and identifiable as separate operations, while in others they may not be. Also, in some classifiers, the preliminary stages will tend to be problem-specific, as in the speech example. In this book, we consider feature selection and extraction transformations that are not application specific. That is not to say all will be suitable for any given application, however, but application-specific preprocessing must be left to the investigator.

1.2 Stages in a pattern recognition problem

A pattern recognition investigation may consist of several stages, enumerated below. Further details are given in Appendix D. Not all stages may be present; some may be merged together so that the distinction between two operations may not be clear, even if both are carried out; also, there may be some application-specific data processing that may not be regarded as one of the stages listed. However, the points below are fairly typical.

1. Formulation of the problem: gaining a clear understanding of the aims of the investigation and planning the remaining stages.

2. Data collection: making measurements on appropriate variables and recording details of the data collection procedure (ground truth).

3. Initial examination of the data: checking the data, calculating summary statistics and producing plots in order to get a feel for the structure.

4. Feature selection or feature extraction: selecting variables from the measured set that are appropriate for the task. These new variables may be obtained by a linear or nonlinear transformation of the original set (feature extraction). To some extent, the division of feature extraction and classification is artificial.

5. Unsupervised pattern classification or clustering. This may be viewed as exploratory data analysis and it may provide a successful conclusion to a study. On the other hand, it may be a means of preprocessing the data for a supervised classification procedure. 6. Apply discrimination or regression procedures as appropriate. The classifier is designed using a training set of exemplar patterns.

7. Assessment of results. This may involve applying the trained classifier to an independent test set of labelled patterns.

8. Interpretation.

The above is necessarily an iterative process: the analysis of the results may pose further hypotheses that require further data collection. Also, the cycle may be terminated at different stages: the questions posed may be answered by an initial examination of the data or it may be discovered that the data cannot answer the initial question and the problem must be reformulated.

The emphasis of this book is on techniques for performing steps 4, 5 and 6.

1.3 Issues

The main topic that we address in this book concerns classifier design: given a training set of patterns of known class, we seek to design a classifier that is optimal for the expected operating conditions (the test conditions).

There are a number of very important points to make about the sentence above, straightforward as it seems. The first is that we are given a finite design set. If the classifier is too complex (there are too many free parameters) it may model noise in the design set. This is an example of over-fitting. If the classifier is not complex enough, then it may fail to capture structure in the data. An example of this is the fitting of a set of data points by a polynomial curve. If the degree of the polynomial is too high, then, although the curve may pass through or close to the data points, thus achieving a low fitting error, the fitting curve is very variable and models every fluctuation in the data (due to noise). If the degree of the polynomial is too low, the fitting error is large and the underlying variability of the curve is not modelled.

Thus, achieving optimal performance on the design set (in terms of minimising some error criterion perhaps) is not required: it may be possible, in a classification problem, to achieve 100% classification accuracy on the design set but the generalisation performance - the expected performance on data representative of the true operating conditions (equivalently, the performance on an infinite test set of which the design set is a sample) - is poorer than could be achieved by careful design. Choosing the 'right' model is an exercise in model selection.

In practice we usually do not know what is structure and what is noise in the data. Also, training a classifier (the procedure of determining its parameters) should not be considered as a separate issue from model selection, but it often is.

A second point about the design of optimal classifiers concerns the word 'optimal'. There are several ways of measuring classifier performance, the most common being error rate, although this has severe limitations. Other measures, based on the closeness of the estimates of the probabilities of class membership to the true probabilities, may be more appropriate in many cases. However, many classifier design methods usually optimise alternative criteria since the desired ones are difficult to optimise directly. For example, a classifier may be trained by optimising a squared error measure and assessed using error rate.

Finally, we assume that the training data are representative of the test conditions. If this is not so, perhaps because the test conditions may be subject to noise not present in the training data, or there are changes in the population from which the data are drawn (population drift), then these differences must be taken into account in classifier design.

1.4 Supervised versus unsupervised

There are two main divisions of classification: supervised classification (or discrimination) and unsupervised classification (sometimes in the statistics literature simply referred to as classification or clustering).

In supervised classification we have a set of data samples (each consisting of measurements on a set of variables) with associated labels, the class types. These are used as exemplars in the classifier design.

Why do we wish to design an automatic means of classifying future data? Cannot the same method that was used to label the design set be used on the test data? In some cases this may be possible. However, even if it were possible, in practice we may wish to develop an automatic method to reduce labour-intensive procedures. In other cases, it may not be possible for a human to be part of the classification process. An example of the former is in industrial inspection. A classifier can be trained using images of components on a production line, each image labelled carefully by an operator. However, in the practical application we would wish to save a human operator from the tedious job, and hopefully make it more reliable. An example of the latter reason for performing a classification automatically is in radar target recognition of objects. For vehicle recognition, the data may be gathered by positioning vehicles on a turntable and making measurements from all aspect angles.

Continues...


Excerpted from Statistical Pattern Recognition by Andrew R. Webb Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents

Preface xix

Notation xxiii

1 Introduction to Statistical Pattern Recognition 1

1.1 Statistical Pattern Recognition 1

1.1.1 Introduction 1

1.1.2 The Basic Model 2

1.2 Stages in a Pattern Recognition Problem 4

1.3 Issues 6

1.4 Approaches to Statistical Pattern Recognition 7

1.5 Elementary Decision Theory 8

1.5.1 Bayes’ Decision Rule for Minimum Error 8

1.5.2 Bayes’ Decision Rule for Minimum Error – Reject Option 12

1.5.3 Bayes’ Decision Rule for Minimum Risk 13

1.5.4 Bayes’ Decision Rule for Minimum Risk – Reject Option 15

1.5.5 Neyman–Pearson Decision Rule 15

1.5.6 Minimax Criterion 18

1.5.7 Discussion 19

1.6 Discriminant Functions 20

1.6.1 Introduction 20

1.6.2 Linear Discriminant Functions 21

1.6.3 Piecewise Linear Discriminant Functions 23

1.6.4 Generalised Linear Discriminant Function 24

1.6.5 Summary 26

1.7 Multiple Regression 27

1.8 Outline of Book 29

1.9 Notes and References 29

Exercises 31

2 Density Estimation – Parametric 33

2.1 Introduction 33

2.2 Estimating the Parameters of the Distributions 34

2.2.1 Estimative Approach 34

2.2.2 Predictive Approach 35

2.3 The Gaussian Classifier 35

2.3.1 Specification 35

2.3.2 Derivation of the Gaussian Classifier Plug-In Estimates 37

2.3.3 Example Application Study 39

2.4 Dealing with Singularities in the Gaussian Classifier 40

2.4.1 Introduction 40

2.4.2 Na¨ive Bayes 40

2.4.3 Projection onto a Subspace 41

2.4.4 Linear Discriminant Function 41

2.4.5 Regularised Discriminant Analysis 42

2.4.6 Example Application Study 44

2.4.7 Further Developments 45

2.4.8 Summary 46

2.5 Finite Mixture Models 46

2.5.1 Introduction 46

2.5.2 Mixture Models for Discrimination 48

2.5.3 Parameter Estimation for Normal Mixture Models 49

2.5.4 Normal Mixture Model Covariance Matrix Constraints 51

2.5.5 How Many Components? 52

2.5.6 Maximum Likelihood Estimation via EM 55

2.5.7 Example Application Study 60

2.5.8 Further Developments 62

2.5.9 Summary 63

2.6 Application Studies 63

2.7 Summary and Discussion 66

2.8 Recommendations 66

2.9 Notes and References 67

Exercises 67

3 Density Estimation – Bayesian 70

3.1 Introduction 70

3.1.1 Basics 72

3.1.2 Recursive Calculation 72

3.1.3 Proportionality 73

3.2 Analytic Solutions 73

3.2.1 Conjugate Priors 73

3.2.2 Estimating the Mean of a Normal Distribution with Known Variance 75

3.2.3 Estimating the Mean and the Covariance Matrix of a Multivariate Normal Distribution 79

3.2.4 Unknown Prior Class Probabilities 85

3.2.5 Summary 87

3.3 Bayesian Sampling Schemes 87

3.3.1 Introduction 87

3.3.2 Summarisation 87

3.3.3 Sampling Version of the Bayesian Classifier 89

3.3.4 Rejection Sampling 89

3.3.5 Ratio of Uniforms 90

3.3.6 Importance Sampling 92

3.4 Markov Chain Monte Carlo Methods 95

3.4.1 Introduction 95

3.4.2 The Gibbs Sampler 95

3.4.3 Metropolis–Hastings Algorithm 103

3.4.4 Data Augmentation 107

3.4.5 Reversible Jump Markov Chain Monte Carlo 108

3.4.6 Slice Sampling 109

3.4.7 MCMC Example – Estimation of Noisy Sinusoids 111

3.4.8 Summary 115

3.4.9 Notes and References 116

3.5 Bayesian Approaches to Discrimination 116

3.5.1 Labelled Training Data 116

3.5.2 Unlabelled Training Data 117

3.6 Sequential Monte Carlo Samplers 119

3.6.1 Introduction 119

3.6.2 Basic Methodology 121

3.6.3 Summary 125

3.7 Variational Bayes 126

3.7.1 Introduction 126

3.7.2 Description 126

3.7.3 Factorised Variational Approximation 129

3.7.4 Simple Example 131

3.7.5 Use of the Procedure for Model Selection 135

3.7.6 Further Developments and Applications 136

3.7.7 Summary 137

3.8 Approximate Bayesian Computation 137

3.8.1 Introduction 137

3.8.2 ABC Rejection Sampling 138

3.8.3 ABC MCMC Sampling 140

3.8.4 ABC Population Monte Carlo Sampling 141

3.8.5 Model Selection 142

3.8.6 Summary 143

3.9 Example Application Study 144

3.10 Application Studies 145

3.11 Summary and Discussion 146

3.12 Recommendations 147

3.13 Notes and References 147

Exercises 148

4 Density Estimation – Nonparametric 150

4.1 Introduction 150

4.1.1 Basic Properties of Density Estimators 150

4.2 k-Nearest-Neighbour Method 152

4.2.1 k-Nearest-Neighbour Classifier 152

4.2.2 Derivation 154

4.2.3 Choice of Distance Metric 157

4.2.4 Properties of the Nearest-Neighbour Rule 159

4.2.5 Linear Approximating and Eliminating Search Algorithm 159

4.2.6 Branch and Bound Search Algorithms: kd-Trees 163

4.2.7 Branch and Bound Search Algorithms: Ball-Trees 170

4.2.8 Editing Techniques 174

4.2.9 Example Application Study 177

4.2.10 Further Developments 178

4.2.11 Summary 179

4.3 Histogram Method 180

4.3.1 Data Adaptive Histograms 181

4.3.2 Independence Assumption (Na¨ive Bayes) 181

4.3.3 Lancaster Models 182

4.3.4 Maximum Weight Dependence Trees 183

4.3.5 Bayesian Networks 186

4.3.6 Example Application Study – Na¨ive Bayes Text Classification 190

4.3.7 Summary 193

4.4 Kernel Methods 194

4.4.1 Biasedness 197

4.4.2 Multivariate Extension 198

4.4.3 Choice of Smoothing Parameter 199

4.4.4 Choice of Kernel 201

4.4.5 Example Application Study 202

4.4.6 Further Developments 203

4.4.7 Summary 203

4.5 Expansion by Basis Functions 204

4.6 Copulas 207

4.6.1 Introduction 207

4.6.2 Mathematical Basis 207

4.6.3 Copula Functions 208

4.6.4 Estimating Copula Probability Density Functions 209

4.6.5 Simple Example 211

4.6.6 Summary 212

4.7 Application Studies 213

4.7.1 Comparative Studies 216

4.8 Summary and Discussion 216

4.9 Recommendations 217

4.10 Notes and References 217

Exercises 218

5 Linear Discriminant Analysis 221

5.1 Introduction 221

5.2 Two-Class Algorithms 222

5.2.1 General Ideas 222

5.2.2 Perceptron Criterion 223

5.2.3 Fisher’s Criterion 227

5.2.4 Least Mean-Squared-Error Procedures 228

5.2.5 Further Developments 235

5.2.6 Summary 235

5.3 Multiclass Algorithms 236

5.3.1 General Ideas 236

5.3.2 Error-Correction Procedure 237

5.3.3 Fisher’s Criterion – Linear Discriminant Analysis 238

5.3.4 Least Mean-Squared-Error Procedures 241

5.3.5 Regularisation 246

5.3.6 Example Application Study 246

5.3.7 Further Developments 247

5.3.8 Summary 248

5.4 Support Vector Machines 249

5.4.1 Introduction 249

5.4.2 Linearly Separable Two-Class Data 249

5.4.3 Linearly Nonseparable Two-Class Data 253

5.4.4 Multiclass SVMs 256

5.4.5 SVMs for Regression 257

5.4.6 Implementation 259

5.4.7 Example Application Study 262

5.4.8 Summary 263

5.5 Logistic Discrimination 263

5.5.1 Two-Class Case 263

5.5.2 Maximum Likelihood Estimation 264

5.5.3 Multiclass Logistic Discrimination 266

5.5.4 Example Application Study 267

5.5.5 Further Developments 267

5.5.6 Summary 268

5.6 Application Studies 268

5.7 Summary and Discussion 268

5.8 Recommendations 269

5.9 Notes and References 270

Exercises 270

6 Nonlinear Discriminant Analysis – Kernel and Projection Methods 274

6.1 Introduction 274

6.2 Radial Basis Functions 276

6.2.1 Introduction 276

6.2.2 Specifying the Model 278

6.2.3 Specifying the Functional Form 278

6.2.4 The Positions of the Centres 279

6.2.5 Smoothing Parameters 281

6.2.6 Calculation of the Weights 282

6.2.7 Model Order Selection 284

6.2.8 Simple RBF 285

6.2.9 Motivation 286

6.2.10 RBF Properties 288

6.2.11 Example Application Study 288

6.2.12 Further Developments 289

6.2.13 Summary 290

6.3 Nonlinear Support Vector Machines 291

6.3.1 Introduction 291

6.3.2 Binary Classification 291

6.3.3 Types of Kernel 292

6.3.4 Model Selection 293

6.3.5 Multiclass SVMs 294

6.3.6 Probability Estimates 294

6.3.7 Nonlinear Regression 296

6.3.8 Example Application Study 296

6.3.9 Further Developments 297

6.3.10 Summary 298

6.4 The Multilayer Perceptron 298

6.4.1 Introduction 298

6.4.2 Specifying the MLP Structure 299

6.4.3 Determining the MLP Weights 300

6.4.4 Modelling Capacity of the MLP 307

6.4.5 Logistic Classification 307

6.4.6 Example Application Study 310

6.4.7 Bayesian MLP Networks 311

6.4.8 Projection Pursuit 313

6.4.9 Summary 313

6.5 Application Studies 314

6.6 Summary and Discussion 316

6.7 Recommendations 317

6.8 Notes and References 318

Exercises 318

7 Rule and Decision Tree Induction 322

7.1 Introduction 322

7.2 Decision Trees 323

7.2.1 Introduction 323

7.2.2 Decision Tree Construction 326

7.2.3 Selection of the Splitting Rule 327

7.2.4 Terminating the Splitting Procedure 330

7.2.5 Assigning Class Labels to Terminal Nodes 332

7.2.6 Decision Tree Pruning – Worked Example 332

7.2.7 Decision Tree Construction Methods 337

7.2.8 Other Issues 339

7.2.9 Example Application Study 340

7.2.10 Further Developments 341

7.2.11 Summary 342

7.3 Rule Induction 342

7.3.1 Introduction 342

7.3.2 Generating Rules from a Decision Tree 345

7.3.3 Rule Induction Using a Sequential Covering Algorithm 345

7.3.4 Example Application Study 350

7.3.5 Further Developments 351

7.3.6 Summary 351

7.4 Multivariate Adaptive Regression Splines 351

7.4.1 Introduction 351

7.4.2 Recursive Partitioning Model 351

7.4.3 Example Application Study 355

7.4.4 Further Developments 355

7.4.5 Summary 356

7.5 Application Studies 356

7.6 Summary and Discussion 358

7.7 Recommendations 358

7.8 Notes and References 359

Exercises 359

8 Ensemble Methods 361

8.1 Introduction 361

8.2 Characterising a Classifier Combination Scheme 362

8.2.1 Feature Space 363

8.2.2 Level 366

8.2.3 Degree of Training 368

8.2.4 Form of Component Classifiers 368

8.2.5 Structure 369

8.2.6 Optimisation 369

8.3 Data Fusion 370

8.3.1 Architectures 370

8.3.2 Bayesian Approaches 371

8.3.3 Neyman–Pearson Formulation 373

8.3.4 Trainable Rules 374

8.3.5 Fixed Rules 375

8.4 Classifier Combination Methods 376

8.4.1 Product Rule 376

8.4.2 Sum Rule 377

8.4.3 Min, Max and Median Combiners 378

8.4.4 Majority Vote 379

8.4.5 Borda Count 379

8.4.6 Combiners Trained on Class Predictions 380

8.4.7 Stacked Generalisation 382

8.4.8 Mixture of Experts 382

8.4.9 Bagging 385

8.4.10 Boosting 387

8.4.11 Random Forests 389

8.4.12 Model Averaging 390

8.4.13 Summary of Methods 396

8.4.14 Example Application Study 398

8.4.15 Further Developments 399

8.5 Application Studies 399

8.6 Summary and Discussion 400

8.7 Recommendations 401

8.8 Notes and References 401

Exercises 402

9 Performance Assessment 404

9.1 Introduction 404

9.2 Performance Assessment 405

9.2.1 Performance Measures 405

9.2.2 Discriminability 406

9.2.3 Reliability 413

9.2.4 ROC Curves for Performance Assessment 415

9.2.5 Population and Sensor Drift 419

9.2.6 Example Application Study 421

9.2.7 Further Developments 422

9.2.8 Summary 423

9.3 Comparing Classifier Performance 424

9.3.1 Which Technique is Best? 424

9.3.2 Statistical Tests 425

9.3.3 Comparing Rules When Misclassification Costs are Uncertain 426

9.3.4 Example Application Study 428

9.3.5 Further Developments 429

9.3.6 Summary 429

9.4 Application Studies 429

9.5 Summary and Discussion 430

9.6 Recommendations 430

9.7 Notes and References 430

Exercises 431

10 Feature Selection and Extraction 433

10.1 Introduction 433

10.2 Feature Selection 435

10.2.1 Introduction 435

10.2.2 Characterisation of Feature Selection Approaches 439

10.2.3 Evaluation Measures 440

10.2.4 Search Algorithms for Feature Subset Selection 449

10.2.5 Complete Search – Branch and Bound 450

10.2.6 Sequential Search 454

10.2.7 Random Search 458

10.2.8 Markov Blanket 459

10.2.9 Stability of Feature Selection 460

10.2.10 Example Application Study 462

10.2.11 Further Developments 462

10.2.12 Summary 463

10.3 Linear Feature Extraction 463

10.3.1 Principal Components Analysis 464

10.3.2 Karhunen–Lo'eve Transformation 475

10.3.3 Example Application Study 481

10.3.4 Further Developments 482

10.3.5 Summary 483

10.4 Multidimensional Scaling 484

10.4.1 Classical Scaling 484

10.4.2 Metric MDS 486

10.4.3 Ordinal Scaling 487

10.4.4 Algorithms 490

10.4.5 MDS for Feature Extraction 491

10.4.6 Example Application Study 492

10.4.7 Further Developments 493

10.4.8 Summary 493

10.5 Application Studies 493

10.6 Summary and Discussion 495

10.7 Recommendations 495

10.8 Notes and References 496

Exercises 497

11 Clustering 501

11.1 Introduction 501

11.2 Hierarchical Methods 502

11.2.1 Single-Link Method 503

11.2.2 Complete-Link Method 506

11.2.3 Sum-of-Squares Method 507

11.2.4 General Agglomerative Algorithm 508

11.2.5 Properties of a Hierarchical Classification 508

11.2.6 Example Application Study 509

11.2.7 Summary 509

11.3 Quick Partitions 510

11.4 Mixture Models 511

11.4.1 Model Description 511

11.4.2 Example Application Study 512

11.5 Sum-of-Squares Methods 513

11.5.1 Clustering Criteria 514

11.5.2 Clustering Algorithms 515

11.5.3 Vector Quantisation 520

11.5.4 Example Application Study 530

11.5.5 Further Developments 530

11.5.6 Summary 531

11.6 Spectral Clustering 531

11.6.1 Elementary Graph Theory 531

11.6.2 Similarity Matrices 534

11.6.3 Application to Clustering 534

11.6.4 Spectral Clustering Algorithm 535

11.6.5 Forms of Graph Laplacian 535

11.6.6 Example Application Study 536

11.6.7 Further Developments 538

11.6.8 Summary 538

11.7 Cluster Validity 538

11.7.1 Introduction 538

11.7.2 Statistical Tests 539

11.7.3 Absence of Class Structure 540

11.7.4 Validity of Individual Clusters 541

11.7.5 Hierarchical Clustering 542

11.7.6 Validation of Individual Clusterings 542

11.7.7 Partitions 543

11.7.8 Relative Criteria 543

11.7.9 Choosing the Number of Clusters 545

11.8 Application Studies 546

11.9 Summary and Discussion 549

11.10 Recommendations 551

11.11 Notes and References 552

Exercises 553

12 Complex Networks 555

12.1 Introduction 555

12.1.1 Characteristics 557

12.1.2 Properties 557

12.1.3 Questions to Address 559

12.1.4 Descriptive Features 560

12.1.5 Outline 560

12.2 Mathematics of Networks 561

12.2.1 Graph Matrices 561

12.2.2 Connectivity 562

12.2.3 Distance Measures 562

12.2.4 Weighted Networks 563

12.2.5 Centrality Measures 563

12.2.6 Random Graphs 564

12.3 Community Detection 565

12.3.1 Clustering Methods 565

12.3.2 Girvan–Newman Algorithm 568

12.3.3 Modularity Approaches 570

12.3.4 Local Modularity 571

12.3.5 Clique Percolation 573

12.3.6 Example Application Study 574

12.3.7 Further Developments 575

12.3.8 Summary 575

12.4 Link Prediction 575

12.4.1 Approaches to Link Prediction 576

12.4.2 Example Application Study 578

12.4.3 Further Developments 578

12.5 Application Studies 579

12.6 Summary and Discussion 579

12.7 Recommendations 580

12.8 Notes and References 580

Exercises 580

13 Additional Topics 581

13.1 Model Selection 581

13.1.1 Separate Training and Test Sets 582

13.1.2 Cross-Validation 582

13.1.3 The Bayesian Viewpoint 583

13.1.4 Akaike’s Information Criterion 583

13.1.5 Minimum Description Length 584

13.2 Missing Data 585

13.3 Outlier Detection and Robust Procedures 586

13.4 Mixed Continuous and Discrete Variables 587

13.5 Structural Risk Minimisation and the Vapnik–Chervonenkis Dimension 588

13.5.1 Bounds on the Expected Risk 588

13.5.2 The VC Dimension 589

References 591

Index 637

What People are Saying About This

From the Publisher

"In the end I must add that this book is so appealing that I often found myself lost in the reading, pausing the overview of the manuscript in order to look more into some presented subject, and not being able to continue until I had finished seeing all about it.” (Zentralblatt MATH, 1 December 2012)

From the B&N Reads Blog

Customer Reviews