Sample size planning for classification models☆
Graphical abstract
Highlights
► We compare sample size requirements for classifier training and testing. ► Number of training samples: determine from learning curve. ► Test sample size: specify confidence interval width or model to compare to. ► Classifier testing needs far more samples than training. ► Start with at least 75 cases per class, then refine sample size planning.
Introduction
Sample size planning is an important aspect in the design of experiments. While this study explicitly targets sample size planning in the context of biospectroscopic classification, the ideas and conclusions apply to a much wider range of applications. Biospectroscopy suffers from extreme scarcity of statistically independent samples, but small sample size problems are common also in many other fields of application.
In the context of biospectroscopic studies, suitably annotated and statistically independent samples for classifier training and validation frequently are rare and costly. Moreover, the classification problems are often rather ill-posed (e.g. diseased vs. non-diseased). In these situations, particular classes are extremely rare, and/or large sample sizes are necessary to cover classes that are rather ill-defined like “not this disease” or “out of specification”. In addition, ethical considerations often restrict the studied number of patients or animals.
Even though the data sets often consist of thousands of spectra, the statistically relevant number of independent cases is often extremely small due to “hierarchical” structure of the biospectroscopic data sets: many spectra are taken of the same specimen, and possibly multiple specimen of the same patient are available. Or, many spectra are taken of each cell, and a number of cells is measured for each cultivation batch, etc. In these situations, the number of statistically independent cases is given by the sample size on the highest level of the data hierarchy, i.e. patients or cell culture batches. All these reasons together lead to sample sizes that are typically in the order of magnitude between 5 and 25 statistically independent cases per class.
Learning curves describe the development of the performance of chemometric models as function of the training sample size. The true performance depends on the difficulty of the task at hand and must therefore be measured by preliminary experiments. Estimation of necessary sample sizes for medical classification has been done based on learning curves [1], [2] as well as on model based considerations [3], [4]. In pattern recognition, necessary training sample sizes have been discussed for a long time (e.g. [5], [6], [7]).
However, building a good model is not enough: the quality of the model needs to be demonstrated.
One may think of training a classifier as the process of measuring the model parameters (coefficients, etc.). Likewise, testing a classifier can be described as a measurement of the model performance. Like other measured values, both the parameters of the model and the observed performance are subject to systematic (bias) and random (variance) uncertainty.
Classifier performance is often expressed in fractions of test cases, counted from different parts of the confusion matrix, see Fig. 1. These ratios summarize characteristic aspects of performance like sensitivity (SensA: “How well does the model recognize truly diseased samples?”, Fig. 1(b)), specificity (SpecA: “How well does the classifier recognize the absence of the disease?”, Fig. 1(c)), positive and negative predictive values (PPVA/NPVA: “Given the classifier diagnoses disease/non-disease, what is the probability that this is true?”, Fig. 1(d) and (e)). Sometimes further ratios, e.g. the overall fraction of correct predictions or misclassifications, are used.
The predictive values, while obviously of more interest to the user of a classifier than sensitivity and specificity, cannot be calculated without knowing the relative frequencies (prior probabilities) of the classes.
From the sample size point of view, one important difference between these different ratios is the number of test cases ntest that appears in the denominator. This test sample size plays a crucial role in determining the random uncertainty of the observed performance (see below). Particularly in multi-class problems, this test sample size varies widely: the number of test cases truly belonging to the different classes may differ, leading to different and rather small test sample sizes for determining the sensitivity p of the different classes. On contrast, the overall fraction of correct or misclassified samples use all tested samples in the denominator.
The specificity is calculated from all samples that truly do not belong to the particular class (Fig. 1(c)). Compared to the sensitivities, the test sample size in the denominator of the specificities is therefore usually larger and the performance estimate more precise (with the exception of binary classification, where the specificity of one class is the sensitivity of the other). Thus small sample size problems in the context of measuring classifier performance are better illustrated with sensitivities. It should also be kept in mind that the specificity often corresponds to an ill-posed question: “Not class A” may be anything. Yet not all possibilities of a sample truly not belonging to class A are of the same interest. In multi-class set-ups, the specificity will often pool easy distinctions with more difficult differential diagnoses. In our application [8], [9], the specificity for recognizing a cell does not come from the BT-20 cell line pools e.g. the fact that it is not an erythrocyte (which can easily be determined by eye without any need for chemometric analysis) with the fact that it does not come from the MCF-7 cell line, which is far more similar (yet from a clinical point of view possibly of low interest as both are breast cancer cell lines) and the clinically important fact that it does not belong to the OCI-AML3 leukaemia. This pooling of all other classes has important consequences. Increasing numbers of test cases in easily distinguished classes (erythrocytes) will lead to improved specificities without any improvement for the clinically relevant differential diagnoses. Also, it must be kept in mind that random predictions (guessing) already lead to specificities that seem to be very good. For our real data set with five different classes, guessing yields specificities between 0.77 and 0.85. Reported sensitivities should also be read in relation to guessing performance, but neglecting to do so will not cause an intuitive overestimation of the prediction quality: guessing sensitivities are around 0.20 in our five-class problem.
Examining the non-diagonal parts of the confusion table instead of specificities avoids these problems. If reported as fractions of test cases truly belonging to that class, then all elements of the confusion table behave like the sensitivities on the diagonal, if reported as fractions of cases predicted to belong to that class, the entries behave like the positive predictive values (again on the diagonal).
Literature guidance on how to obtain low total uncertainty and how to validate different aspects of model performance is available [10], [11], [12], [13], [14]. In classifier testing, usually several assumptions are implicitly made which are closely related to the behaviour of the performance measurements in terms of systematic and random uncertainty.
Classification tests are usually described as Bernoulli-process (repeated coin throwing, following a binomial distribution): ntest samples are tested, and thereof k successes (or errors) are observed. The true performance of the model is p, and its point estimate iswith variance
In small sample size situations, resampling strategies like the bootstrap or repeated/iterated k-fold cross validation are most appropriate. These strategies estimate the performance by setting aside a (small) part of the samples for independent testing and building a model without these samples, the surrogate model. The surrogate model is then tested with the remaining samples. The test results are refined by repeating/iterating this procedure a number of times. Usually, the average performance over all surrogate models is reported. This is an unbiased estimate of the performance of models with the same training sample size as the surrogate models [11], [1]. Note that the observed variance over the surrogate models possibly underestimates the true variance of the performance of models trained with ntrain training cases [1]. This is intuitively clear if one thinks of a situation where the surrogate models are perfectly stable, i.e. different surrogate models yield the same prediction for any given case. No variance is observed between different iterations of a k-fold cross validation. Yet, the observed performance is still subject to the random uncertainty due to the finite test sample size of the underlying Bernoulli process.
Usually, the performance measured with the surrogate models is used as approximation of the performance of a model trained with all samples, the final model. The underlying assumption is that setting aside of the surrogate test data does not affect the model performance. In other words, the learning curve is assumed to be flat between the training sample size of the surrogate model and training sample size of the final model. The violation of this assumption causes the well-known pessimistic bias of resampling based validation schemes.
The results of testing many surrogate models are usually pooled. Strictly speaking, pooling is allowed only if the distributions of the pooled variables are equal. The description of the testing procedure as Bernoulli process allows pooling if the surrogate models have equal true performance p. In other words, if the predictions of the models are stable with respect to perturbed training sets, i.e. if exchanging of a few samples does not lead to changes in the prediction. Consequently, model instability causes additional variance in the measured performance.
Here, we discuss the implications of these two aspects of sample size planning with a Raman-spectroscopic five-class classification problem: the recognition of five different cell types that can be present in blood. In addition to the measured data set, the results are complemented by a simulation which allows arbitrary test precision.
Section snippets
Raman spectra of single cells
Raman spectra of five different types of cells that could be present in blood are used in this study. Details of the preparation, measurements and the application have been published previously [8], [9]. The data were measured in a stratified manner, specifying roughly equal numbers of cells per class beforehand, and do not reflect relative frequencies of the different cells in a target patient population. Thus, we cannot calculate predictive values for our classifiers.
For this study, the
Learning curves
The learning curve describes the performance of a given classifier for a problem as function of the training sample size [10]. The prediction errors a classifier makes may be divided into four categories:
- 1.
the irreducible or Bayes error
- 2.
the bias due to the model setup,
- 3.
additional systematic deviations (bias), and
- 4.
random deviations (variance)
Sample size requirements for classifier testing
Thus, the precise measurement of the classifier performance turns out to be more complicated in such small sample size situations. Sample size planning for classification therefore needs to take into account also the sample size requirements for the testing of the classifier. We will discuss here two important scenarios that allow estimating required test sample sizes: firstly, specifying an acceptable width for a confidence interval of the performance measure and secondly the number of test
Summary
Using a Raman spectroscopic five class classification problem as well as simulated data based on the real data set, we compared the sample sizes needed to train good classifiers with sample sizes needed to demonstrate that the obtained classifiers work well. Due to the smaller test sample size, sensitivities are more difficult to determine precisely than specificities or overall hit rates.
Using typical small sample sizes of up to 25 samples per class, we calculated learning curves (sensitivity
Acknowledgements
Graphics were generated using ggplot2 [31].
Financial support by the European Union via the Europäischer Fonds für Regionale Entwicklung (EFRE) and the Thüringer Ministerium für Bildung, Wissenschaft und Kultur (project B714-07037) as well as the funding by BMBF (FKZ 01EO1002) is highly acknowledged.
References (31)
- et al.
Dimensionality and sample size considerations in pattern recognition practice
- et al.
Variance reduction in estimating classification error using sparse datasets
Chem. Intell. Lab. Syst.
(2005) - et al.
Estimating dataset size requirements for classifying DNA microarray data
J. Comput. Biol.
(2003) - et al.
Predicting sample size required for classification performance
BMC Med. Inform. Decis. Mak.
(2012) - et al.
Sample size planning for developing classifiers using high-dimensional DNA microarray data
Biostatistics
(2007) - et al.
How large a training set is needed to develop a classifier for microarray data?
Clin Cancer Res
(2008) - et al.
Small sample size effects in statistical pattern recognition: recommendations for practitioners
IEEE Trans. Pattern Anal. Mach. Intell.
(1991) - et al.
Predicting the required number of training samples
IEEE Trans. Pattern Anal. Mach. Intell.
(1983) - et al.
Towards detection and identification of circulating tumour cells using Raman spectroscopy
Analyst
(2010) - et al.
Identification and differentiation of single cells from peripheral blood by Raman spectroscopic imaging
J. Biophotonics
(2010)
The Elements of Statistical Learning Data mining, Inference and Prediction
Performance of error estimators for classification
Curr. Bioinform.
A study of cross-validation and bootstrap for accuracy estimation and model selection
Principles of proper validation: use and abuse of re-sampling for validation
J. Chemometr.
R: A Language and Environment for Statistical Computing
Cited by (352)
How do residents perceive ecosystem service benefits received from urban streets? A case study of Guangzhou, China
2024, Journal of Cleaner ProductionHysteretic model parameters for seismic performance assessment of cyclically degraded reinforced concrete columns
2024, Soil Dynamics and Earthquake EngineeringA machine learning approach to identifying non-parental caregivers' risk for harsh caregiving towards infants in daycare centers
2024, Early Childhood Research QuarterlySpontaneous Raman bioimaging – Looking to 2050
2024, Vibrational SpectroscopyMulti-horizon well performance forecasting with temporal fusion transformers
2024, Results in Engineering
- ☆
Paper presented at the XIII Conference on Chemometrics in Analytical Chemistry (CAC 2012), Budapest, Hungary, 25–29 June 2012.