Elsevier

Analytica Chimica Acta

Volume 760, 14 January 2013, Pages 25-33
Analytica Chimica Acta

Sample size planning for classification models

https://doi.org/10.1016/j.aca.2012.11.007Get rights and content

Abstract

In biospectroscopy, suitably annotated and statistically independent samples (e.g. patients, batches, etc.) for classifier training and testing are scarce and costly. Learning curves show the model performance as function of the training sample size and can help to determine the sample size needed to train good classifiers. However, building a good model is actually not enough: the performance must also be proven. We discuss learning curves for typical small sample size situations with 5–25 independent samples per class. Although the classification models achieve acceptable performance, the learning curve can be completely masked by the random testing uncertainty due to the equally limited test sample size. In consequence, we determine test sample sizes necessary to achieve reasonable precision in the validation and find that 75–100 samples will usually be needed to test a good but not perfect classifier. Such a data set will then allow refined sample size planning on the basis of the achieved performance. We also demonstrate how to calculate necessary sample sizes in order to show the superiority of one classifier over another: this often requires hundreds of statistically independent test samples or is even theoretically impossible. We demonstrate our findings with a data set of ca. 2550 Raman spectra of single cells (five classes: erythrocytes, leukocytes and three tumour cell lines BT-20, MCF-7 and OCI-AML3) as well as by an extensive simulation that allows precise determination of the actual performance of the models in question.

Highlights

► We compare sample size requirements for classifier training and testing. ► Number of training samples: determine from learning curve. ► Test sample size: specify confidence interval width or model to compare to. ► Classifier testing needs far more samples than training. ► Start with at least 75 cases per class, then refine sample size planning.

Introduction

Sample size planning is an important aspect in the design of experiments. While this study explicitly targets sample size planning in the context of biospectroscopic classification, the ideas and conclusions apply to a much wider range of applications. Biospectroscopy suffers from extreme scarcity of statistically independent samples, but small sample size problems are common also in many other fields of application.

In the context of biospectroscopic studies, suitably annotated and statistically independent samples for classifier training and validation frequently are rare and costly. Moreover, the classification problems are often rather ill-posed (e.g. diseased vs. non-diseased). In these situations, particular classes are extremely rare, and/or large sample sizes are necessary to cover classes that are rather ill-defined like “not this disease” or “out of specification”. In addition, ethical considerations often restrict the studied number of patients or animals.

Even though the data sets often consist of thousands of spectra, the statistically relevant number of independent cases is often extremely small due to “hierarchical” structure of the biospectroscopic data sets: many spectra are taken of the same specimen, and possibly multiple specimen of the same patient are available. Or, many spectra are taken of each cell, and a number of cells is measured for each cultivation batch, etc. In these situations, the number of statistically independent cases is given by the sample size on the highest level of the data hierarchy, i.e. patients or cell culture batches. All these reasons together lead to sample sizes that are typically in the order of magnitude between 5 and 25 statistically independent cases per class.

Learning curves describe the development of the performance of chemometric models as function of the training sample size. The true performance depends on the difficulty of the task at hand and must therefore be measured by preliminary experiments. Estimation of necessary sample sizes for medical classification has been done based on learning curves [1], [2] as well as on model based considerations [3], [4]. In pattern recognition, necessary training sample sizes have been discussed for a long time (e.g. [5], [6], [7]).

However, building a good model is not enough: the quality of the model needs to be demonstrated.

One may think of training a classifier as the process of measuring the model parameters (coefficients, etc.). Likewise, testing a classifier can be described as a measurement of the model performance. Like other measured values, both the parameters of the model and the observed performance are subject to systematic (bias) and random (variance) uncertainty.

Classifier performance is often expressed in fractions of test cases, counted from different parts of the confusion matrix, see Fig. 1. These ratios summarize characteristic aspects of performance like sensitivity (SensA: “How well does the model recognize truly diseased samples?”, Fig. 1(b)), specificity (SpecA: “How well does the classifier recognize the absence of the disease?”, Fig. 1(c)), positive and negative predictive values (PPVA/NPVA: “Given the classifier diagnoses disease/non-disease, what is the probability that this is true?”, Fig. 1(d) and (e)). Sometimes further ratios, e.g. the overall fraction of correct predictions or misclassifications, are used.

The predictive values, while obviously of more interest to the user of a classifier than sensitivity and specificity, cannot be calculated without knowing the relative frequencies (prior probabilities) of the classes.

From the sample size point of view, one important difference between these different ratios is the number of test cases ntest that appears in the denominator. This test sample size plays a crucial role in determining the random uncertainty of the observed performance pˆ (see below). Particularly in multi-class problems, this test sample size varies widely: the number of test cases truly belonging to the different classes may differ, leading to different and rather small test sample sizes for determining the sensitivity p of the different classes. On contrast, the overall fraction of correct or misclassified samples use all tested samples in the denominator.

The specificity is calculated from all samples that truly do not belong to the particular class (Fig. 1(c)). Compared to the sensitivities, the test sample size in the denominator of the specificities is therefore usually larger and the performance estimate more precise (with the exception of binary classification, where the specificity of one class is the sensitivity of the other). Thus small sample size problems in the context of measuring classifier performance are better illustrated with sensitivities. It should also be kept in mind that the specificity often corresponds to an ill-posed question: “Not class A” may be anything. Yet not all possibilities of a sample truly not belonging to class A are of the same interest. In multi-class set-ups, the specificity will often pool easy distinctions with more difficult differential diagnoses. In our application [8], [9], the specificity for recognizing a cell does not come from the BT-20 cell line pools e.g. the fact that it is not an erythrocyte (which can easily be determined by eye without any need for chemometric analysis) with the fact that it does not come from the MCF-7 cell line, which is far more similar (yet from a clinical point of view possibly of low interest as both are breast cancer cell lines) and the clinically important fact that it does not belong to the OCI-AML3 leukaemia. This pooling of all other classes has important consequences. Increasing numbers of test cases in easily distinguished classes (erythrocytes) will lead to improved specificities without any improvement for the clinically relevant differential diagnoses. Also, it must be kept in mind that random predictions (guessing) already lead to specificities that seem to be very good. For our real data set with five different classes, guessing yields specificities between 0.77 and 0.85. Reported sensitivities should also be read in relation to guessing performance, but neglecting to do so will not cause an intuitive overestimation of the prediction quality: guessing sensitivities are around 0.20 in our five-class problem.

Examining the non-diagonal parts of the confusion table instead of specificities avoids these problems. If reported as fractions of test cases truly belonging to that class, then all elements of the confusion table behave like the sensitivities on the diagonal, if reported as fractions of cases predicted to belong to that class, the entries behave like the positive predictive values (again on the diagonal).

Literature guidance on how to obtain low total uncertainty and how to validate different aspects of model performance is available [10], [11], [12], [13], [14]. In classifier testing, usually several assumptions are implicitly made which are closely related to the behaviour of the performance measurements in terms of systematic and random uncertainty.

Classification tests are usually described as Bernoulli-process (repeated coin throwing, following a binomial distribution): ntest samples are tested, and thereof k successes (or errors) are observed. The true performance of the model is p, and its point estimate ispˆ=kntestwith varianceVarkntest=p(1p)ntest

In small sample size situations, resampling strategies like the bootstrap or repeated/iterated k-fold cross validation are most appropriate. These strategies estimate the performance by setting aside a (small) part of the samples for independent testing and building a model without these samples, the surrogate model. The surrogate model is then tested with the remaining samples. The test results are refined by repeating/iterating this procedure a number of times. Usually, the average performance over all surrogate models is reported. This is an unbiased estimate of the performance of models with the same training sample size as the surrogate models [11], [1]. Note that the observed variance over the surrogate models possibly underestimates the true variance of the performance of models trained with ntrain training cases [1]. This is intuitively clear if one thinks of a situation where the surrogate models are perfectly stable, i.e. different surrogate models yield the same prediction for any given case. No variance is observed between different iterations of a k-fold cross validation. Yet, the observed performance is still subject to the random uncertainty due to the finite test sample size of the underlying Bernoulli process.

Usually, the performance measured with the surrogate models is used as approximation of the performance of a model trained with all samples, the final model. The underlying assumption is that setting aside of the surrogate test data does not affect the model performance. In other words, the learning curve is assumed to be flat between the training sample size of the surrogate model and training sample size of the final model. The violation of this assumption causes the well-known pessimistic bias of resampling based validation schemes.

The results of testing many surrogate models are usually pooled. Strictly speaking, pooling is allowed only if the distributions of the pooled variables are equal. The description of the testing procedure as Bernoulli process allows pooling if the surrogate models have equal true performance p. In other words, if the predictions of the models are stable with respect to perturbed training sets, i.e. if exchanging of a few samples does not lead to changes in the prediction. Consequently, model instability causes additional variance in the measured performance.

Here, we discuss the implications of these two aspects of sample size planning with a Raman-spectroscopic five-class classification problem: the recognition of five different cell types that can be present in blood. In addition to the measured data set, the results are complemented by a simulation which allows arbitrary test precision.

Section snippets

Raman spectra of single cells

Raman spectra of five different types of cells that could be present in blood are used in this study. Details of the preparation, measurements and the application have been published previously [8], [9]. The data were measured in a stratified manner, specifying roughly equal numbers of cells per class beforehand, and do not reflect relative frequencies of the different cells in a target patient population. Thus, we cannot calculate predictive values for our classifiers.

For this study, the

Learning curves

The learning curve describes the performance of a given classifier for a problem as function of the training sample size [10]. The prediction errors a classifier makes may be divided into four categories:

  • 1.

    the irreducible or Bayes error

  • 2.

    the bias due to the model setup,

  • 3.

    additional systematic deviations (bias), and

  • 4.

    random deviations (variance)

The best possible performance that can be achieved with a given model setup consists of the Bayes error, i.e. the best possible performance for the best

Sample size requirements for classifier testing

Thus, the precise measurement of the classifier performance turns out to be more complicated in such small sample size situations. Sample size planning for classification therefore needs to take into account also the sample size requirements for the testing of the classifier. We will discuss here two important scenarios that allow estimating required test sample sizes: firstly, specifying an acceptable width for a confidence interval of the performance measure and secondly the number of test

Summary

Using a Raman spectroscopic five class classification problem as well as simulated data based on the real data set, we compared the sample sizes needed to train good classifiers with sample sizes needed to demonstrate that the obtained classifiers work well. Due to the smaller test sample size, sensitivities are more difficult to determine precisely than specificities or overall hit rates.

Using typical small sample sizes of up to 25 samples per class, we calculated learning curves (sensitivity

Acknowledgements

Graphics were generated using ggplot2 [31].

Financial support by the European Union via the Europäischer Fonds für Regionale Entwicklung (EFRE) and the Thüringer Ministerium für Bildung, Wissenschaft und Kultur (project B714-07037) as well as the funding by BMBF (FKZ 01EO1002) is highly acknowledged.

References (31)

  • A. Jain et al.

    Dimensionality and sample size considerations in pattern recognition practice

  • C. Beleites et al.

    Variance reduction in estimating classification error using sparse datasets

    Chem. Intell. Lab. Syst.

    (2005)
  • S. Mukherjee et al.

    Estimating dataset size requirements for classifying DNA microarray data

    J. Comput. Biol.

    (2003)
  • R.L. Figueroa et al.

    Predicting sample size required for classification performance

    BMC Med. Inform. Decis. Mak.

    (2012)
  • K.K. Dobbin et al.

    Sample size planning for developing classifiers using high-dimensional DNA microarray data

    Biostatistics

    (2007)
  • K.K. Dobbin et al.

    How large a training set is needed to develop a classifier for microarray data?

    Clin Cancer Res

    (2008)
  • S. Raudys et al.

    Small sample size effects in statistical pattern recognition: recommendations for practitioners

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1991)
  • H.M. Kalayeh et al.

    Predicting the required number of training samples

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1983)
  • U. Neugebauer et al.

    Towards detection and identification of circulating tumour cells using Raman spectroscopy

    Analyst

    (2010)
  • U. Neugebauer et al.

    Identification and differentiation of single cells from peripheral blood by Raman spectroscopic imaging

    J. Biophotonics

    (2010)
  • T. Hastie et al.

    The Elements of Statistical Learning Data mining, Inference and Prediction

    (2009)
  • E.R. Dougherty et al.

    Performance of error estimators for classification

    Curr. Bioinform.

    (2010)
  • R. Kohavi

    A study of cross-validation and bootstrap for accuracy estimation and model selection

  • K.H. Esbensen et al.

    Principles of proper validation: use and abuse of re-sampling for validation

    J. Chemometr.

    (2010)
  • R Development Core Team

    R: A Language and Environment for Statistical Computing

    (2011)
  • Cited by (352)

    View all citing articles on Scopus

    Paper presented at the XIII Conference on Chemometrics in Analytical Chemistry (CAC 2012), Budapest, Hungary, 25–29 June 2012.

    View full text