Kirstine Smith, Statistical Pioneer
Figure 1. Kirstine Smith in 1896 or 1897. Public domain.
One of her papers created a new area of statistics. Another precipitated such a large disagreement between Ronald Fisher and Karl Pearson that they ended up feuding for the rest of their lives. Why is Kirstine Smith, whom Pearson described in 1916 as “one of the most brilliant of the younger Danish statisticians,” not better known as one of the pioneers of statistics?
Smith’s First Statistics Paper
Kirstine Smith (Figure 1), born in Denmark on April 12, 1878, worked on problems of physical oceanography at the International Bureau for Marine Exploration for eleven years after receiving her master of arts degree from the University of Copenhagen. In 1916, she entered the doctoral program at University College London, where Karl Pearson had founded the world’s first university statistics department in 1911.
Within a few months she published her first statistics paper. Smith (1916) looked at the problem of fitting a statistical distribution to frequency data and argued that the “best” estimates of the distribution’s parameters (for example, the mean and variance from a normal distribution) result from minimizing the chi-squared goodness of fit statistic.*
The paper appeared in Biometrika, the statistics journal that Karl Pearson had co-founded in 1901 and edited until shortly before his death in 1936. At the end of the paper Smith thanked Professor Pearson “for his aid throughout the work.”
Ronald Fisher, then a 26-year-old secondary school teacher** who had published a couple of papers in statistics, sent Pearson a one-page note criticizing Smith’s paper. Fisher wanted the note to be published in Biometrika, but perhaps he should have been more politic in attacking research that had clearly been done under Pearson’s aegis. The note repeatedly referred to Smith’s ‘improvements’ (each time putting the word in quotes), and concluded that there is “something exceedingly arbitrary in a criterion which depends entirely upon the manner in which the data happen to be grouped” (Pearson, 1968).
Fisher’s criticisms were unfair, because many datasets consist of frequency counts (the individual “exact” measurements are not available) and Fisher provided no argument that his method worked better. Pearson pointed out in his professional response that Smith was not using arbitrary groupings but those in the data available to her. He concluded by inviting Fisher to submit for publication a rigorous defense of the method Fisher deemed superior but added “if I were to publish your note, it would have to be followed by another note that it missed the point.”
Stigler (2005) writes that no reply from Fisher to this letter survives, but this rejection from Pearson, as well as other perceived slights, clearly festered. When Fisher did develop and publish his theory for maximum likelihood, he began Section 1 with a swipe at Pearson: “This anomalous state of statistical science is strikingly exemplified by a recent paper … in which one of the most eminent of modern statisticians [Pearson] presents what purports to be a general proof of Bayes’ postulate, a proof which, in the opinion of a second statistician of equal eminence, ‘seems to rest upon a very peculiar—not to say hardly supposable—relation’” (Fisher, 1922, pp. 310-311). Fisher returned to Kirstine Smith’s paper in Section 12, where — without admitting any error on his part or any contribution on hers — he acknowledged that the method of minimum chi-squared distance was essentially equivalent to maximum likelihood for frequency data, thereby annulling the criticism in the note he had sent to Pearson.
The story does not end there, however. Fisher feuded with Karl Pearson until Karl’s death, and then continued feuding with Karl’s son Egon. Salsburg (2001) tells the stories of the often juvenile sniping among these statistical giants, who are responsible for many of the fundamental methods used in statistics. One wonders how much more they might have accomplished had they joined forces.
And what of Kirstine Smith? Berkson (1980) revisited the issue of minimum chi-squared versus maximum likelihood and concluded that minimum chi-squared is the “basic principle of estimation.”*** Berkson cited Fisher’s papers, reminisces by Egon Pearson, and 16 previous publications of Berkson himself. Who is NOT cited by Berkson or any of the eminent discussants of his paper? Kirstine Smith, whose 1916 article launched the debate. By 1980, Smith’s contribution to the discussion had been largely forgotten.
Optimal Design of Experiments
Suppose you want to study the relationship between the amount of a new fertilizer you apply and crop yield. Your experimental budget allows you to apply fertilizer to n fields and measure the crop yield for each field. Let xi denote the amount of fertilizer to be applied to field i, and let yi be the crop yield from field i at the end of the growing season. What amount of fertilizer should you apply to each field?
Smith’s doctoral dissertation research, published in Biometrika in 1918, studied the problem of where to place experimental observations for x if one wants to fit a polynomial function
After the crop yields are measured for the values of x used in the experiment, the results can be used to estimate the parameters b0, b1, ... bk and then to predict the crop yield that is expected when one applies x amount of fertilizer.
She expressed her goal as follows:
In all sorts of experiments which are not simple repetitions but have at least one varying essential circumstance or indefinite variate the experimentalist is confronted with a choice in regard to the values of that variate. If the experiments be quite simple the question may be without great importance; but when their requirements as to time or expenditure come into account the problem arises, how the observations should be chosen in order that a limited number of them may give the maximum amount of knowledge (Smith, 1918, p. 1).
Figure 2 shows sample polynomial functions with k = 1, 2, and 3. Where should one place the values of x for each of the n fields to obtain the “maximum amount of knowledge” about each of these functions?
Figure 2. Examples of polynomial functions for k=1 (straight line), k=2 (quadratic), and k=3 (cubic).
Some possible designs, for n = 12, are shown in Figure 3. For illustration, we assume that the experimental region consists of the interval between –1 and 1, and we can perform an experiment and measure y at any x value in that range. Figure 3(a) shows a design in which all of the points are at the extremes of the possible values of x — half of the experiments will be conducted at the smallest possible value of x and the other half will be conducted at the largest possible value of x. Figure 3(b) shows a design with one third of the experiments conducted at each of the points x = –1, x = 0, and x = 1. The design points in Figure 3(e) are uniformly spaced across the region, and those in Figure 3(f) are randomly generated from a uniform distribution.
Figure 3. Candidate designs for estimating a polynomial function.
Before reading on, look at the designs in Figure 3. Which design would you choose, and why?
Now let’s look at what Smith did. First, she had to define what it meant for a design to “give the maximum amount of knowledge.” Smith (1918) decided that the knowledge is maximized when you get the most precision for predicting y from the function (in other words, minimize the standard deviation of the predicted value of y). Since that standard deviation varies with x, she chose designs for each polynomial function that minimize the largest possible standard deviation of the predicted value of y across the range of x. This criterion is known today as G-optimality (G stands for global).
She then set up the system of equations needed to find those G-optimal designs for a general polynomial of degree k, and, after 14 pages of linear algebra, derived the optimal designs for k = 1 to k = 6. The optimal design is different for each k, and Smith discovered that the optimal design for a polynomial of degree k always has (k+1) distinct values of x.
When fitting a straight line, Smith found that the design in Figure 3(a), with half of the observations at each endpoint of the experimental region, was optimal. This makes sense, since if you are trying to draw a straight line on a wall by connecting two points, you want to have those points as far apart as possible. Figure 3(b) shows the optimal design for a quadratic function, and Figure 3(c) shows the optimal design if you know the polynomial is a cubic function. Her work in deriving these designs is an impressive piece of mathematics.
But Smith’s contribution went beyond this. She recognized that the optimal designs that she derived are only optimal if the function has the form of the specified polynomial. The design in Figure 3(a) is optimal for estimating a straight line, but you would not be able to tell the actual function is a straight line or a quadratic or a higher degree polynomial because there are no design points in the middle — you cannot even fit a quadratic function to the data if all of the x values are at –1 and 1. Similarly, the design in Figure 3(b) is optimal for fitting a quadratic function, but you can’t fit a cubic polynomial to the data. If you truly know nothing about the shape of the function, the optimal design is actually a uniform distribution across the region (Figure 3(e) or (f); see Müller, 1984).
Smith recognized this problem, and wrote:
It appears that the distribution of observations which fulfils this demand consists of specially placed groups in number just sufficient to determine the constants of the function. We shall accordingly pay attention also to the desirability usually present of ascertaining the form of function by means of the observations. As might be expected we find that the standard deviations obtained from a uniform continuous distribution of observations increase towards the ends of the range. By choosing a uniform continuous distribution with additional clusters at the ends of the range we shall try to find a compromise between the two desiderata of a low maximum of standard deviation and of a uniform distribution (Smith, 1918, p. 2).
Smith derived the optimal designs to guide experimenters, but she also recommended that they include design points that could help ascertain the form of the function. She thus also launched the research area of robust experimental design — developing designs that are efficient if the assumptions are met but can also be used to check those assumptions (see Herzberg, 1982 for an illuminating description of robust design).
I would like to be able to report that the statistical community immediately recognized Smith’s 1918 paper for the groundbreaking work it was, but unfortunately that does not seem to be the case. Her paper went largely unnoticed for the next 40 years, until Hoel (1958) noted that little research had been done on the problem of choosing x values at which to take observations — except, he wrote, for the work of K. Smith many years ago.
Smith never saw the fruits of her design research. After earning her doctorate, she returned to Denmark and worked in fisheries research at the Carlsberg Laboratory in Copenhagen. In 1925, she left Carlsberg and became a secondary school teacher (Guttorp and Lindgren, 2009). She died in 1939, about 20 years before her design work began to be cited in the burgeoning research area of optimal experimental design.
Recognizing Smith’s Contributions
When I learned about optimal design in graduate school, I was taught about the fundamental contributions of Wald (1943), Chernoff (1953), and Kiefer and Wolfowitz (1959), but no one mentioned Smith’s work. I had seen Smith (1918) cited in Kiefer and Wolfowitz (1959) and Silvey (1980), but had not looked it up since the results I needed to conduct my research were all found in later papers.
I delved into Smith’s 1918 paper in December 2024, while I was preparing a talk for the International Association of Survey Statisticians on “Multiple Frame Methods and Designs for Combining Data Sources.” My talk emphasized the need for optimal and robust design research for data collection systems that rely on multiple sources, and I mentioned that Kirstine Smith had inaugurated this discussion on design more than a century ago.
Many papers and websites now credit Smith with founding the area of optimal design. St. John and Draper (1975), reviewing the state of optimal design research, cited Smith as the first person to define a criterion and obtain optimal designs. That acknowledgement has been echoed in many subsequent reviews (see, for example, Steinberg and Hunter, 1984; Rady et al., 2009; Rainforth et al., 2024). The Wikipedia article on optimal experimental design mentions her as an “early contributor” to the field and the Wikipedia article on Smith says: “She is credited with the creation of the field of optimal design of experiments.”
Her work on statistical inference has also been recognized. Stigler (2005), discussing Fisher’s development of maximum likelihood theory, devoted a large section of the paper to Kirstine Smith. Stigler (2006) suggested that the paper by Smith (1916) and Karl Pearson’s response to Fisher’s note were the catalysts for the rigorous development of maximum likelihood: “it was only when Pearson challenged him in the highest tradition of scientific editing in 1916 that Fisher paused to consider the case to be made for his methods, and saw that it was deficient…. Without Pearson’s question, it seems extremely unlikely Fisher would have seen how his surprising discovery of sufficiency was the key to a whole theory of estimation, and a broad framework for addressing statistical questions more broadly.”
In Chance magazine, Penny Reynolds and Chaitra Nagaraja are currently running a series on three women who revolutionized statistical practice. The first installment, published online on March 3, 2025, features Kirstine Smith (Reynolds, 2025).
And just as I was about to publish this post, the April issue of Amstat News arrived in my mailbox. Page 12 includes Kirstine Smith in its list of “famous statisticians born in April.”
Copyright (2025) Sharon L. Lohr
Footnotes and References
*Here’s an example. Suppose you have measurements of height in cm for 200 men, but each height is rounded to the nearest multiple of 5. The frequency table is given below:
Rounded Height (cm)       | Frequency |
---|---|
160 | 5 |
165 | 15 |
170 | 41 |
175 | 56 |
180 | 53 |
185 | 24 |
190 | 4 |
195 | 1 |
200 | 1 |
We would like to find the estimates of the mean μ and standard deviation σ of the normal distribution that best describe the data from the table. If we had exact (or approximately exact, since we can never measure anything to infinite precision) values for the measurements, we would simply estimate μ and σ using the mean and standard deviation from the sample. But these data are in categories, and the sample mean (175.9) and standard deviation (6.7764) of the rounded values may be a little off.
Here is how the method of minimum chi-squared works. A rounded value of 170 cm means that the height of each person in that category is between 167.5 cm and 172.5 cm. For any posited values of μ and σ, we can calculate the probability that a normally distributed random variable is in that interval, and thus calculate the number of men we would expect to have heights in that interval as the probability multiplied by the sample size of 200. The chi-squared statistic is the sum of (observed frequency - expected frequency)2/(expected frequency) for the partition defined by the rounding.
For example, using our sample values of 175.9 and 6.7764, and allowing the first and last categories to extend to infinity, we get a chi-squared statistic of 7.32. The chi-squared statistic measures the distance between the observed frequencies and the frequencies expected from a normal distribution with the specified mean and standard deviation, so the smaller the chi-squared statistic, the better the hypothesized distribution fits the data. If we use different parameters, say μ = 173 and σ = 8, we get a chi-squared statistic of 37.4, so these parameters do not provide as good a fit as 175.9 and 6.7764.
But how to find the values of μ and σ that minimize the chi-squared statistic? One could calculate the chi-squared statistic for lots of different values of {μ, σ} and choose the parameters that give the smallest value. That is easily done on today's computers, but would have taken Kirstine Smith a long, long time. Instead, she used calculus to minimize the chi-squared statistic.
Here's an exercise for students. Find the derivatives of the chi-squared statistic with respect to μ and σ and give the system of equations whose solution will be the values of μ and σ that minimize the statistic. (When I solved these numerically, I obtained mean 176.13627 and standard deviation 7.00911, which give a chi-squared statistic of 6.24.)
**Stigler (2005, p. 33) wrote: “By all accounts he [Fisher] was a poor teacher; he did not like his duties and the students did not understand him.”
***Part of the reason for Berkson’s conclusion is that maximum likelihood estimators do not always exist. And when they exist, they are not always consistent (Neyman and Scott, 1948 presented a simple example of an inconsistent maximum likelihood estimator that shocked the statistical community at the time). But these are unusual cases. In most situations, maximum likelihood estimators are consistent and asymptotically achieve the minimum possible variance — in other words, when the regularity conditions are met, maximum likelihood estimators are the best estimators you can find. Rao (1980, p. 484) wrote in the discussion that “in view of the general applicability of ML, its large sample properties and its superiority in small samples in a variety of situations, a better title to Berkson’s paper might be ‘ML, sometimes MC’ and not ‘MC, not ML.’”
Berkson, J. (1980). Minimum chi-square, not maximum likelihood!. The Annals of Statistics, 8(3), 457-469.
Chernoff, H. (1953). Locally optimum designs for estimating parameters. The Annals of Mathematical Statistics, 24, 586-602.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, 222(594-604), 309-368.
Guttorp, P., & Lindgren, G. (2009). Karl Pearson and the Scandinavian school of statistics. International Statistical Review, 77(1), 64-71.
Herzberg, A.M. (1982). The robust design of experiments: A review. Serdica, 8, 223-228.
Hoel, P.G. (1958). Efficiency problems in polynomial estimation. The Annals of Mathematical Statistics, 29, 1134-1145.
Kiefer, J., & Wolfowitz, J. (1959). Optimum designs in regression problems. The Annals of Mathematical Statistics, 30(2), 271-294.
Müller, H. G. (1984). Optimal designs for nonparametric kernel regression. Statistics & Probability Letters, 2(5), 285-290.
Neyman, J., & Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrica, 16, 1-32.
Pearson, E. S. (1968). Studies in the history of probability and statistics. XX: Some early correspondence between W. S. Gosset, R. A. Fisher and Karl Pearson, with notes and comments. Biometrika, 55(3), 445-457.
Rady, E. A., Abd El-Monsef, M. M. E., & Seyam, M. M. (2009). Relationships among several optimality criteria. Interstat, 15(6), 1-11.
Rainforth, T., Foster, A., Ivanova, D. R., & Bickford Smith, F. (2024). Modern Bayesian experimental design. Statistical Science, 39(1), 100-114.
Rao, C.R. (1980). Discussion of ‘Minimum chi-square, not maximum likelihood!.’ The Annals of Statistics, 8(3), 482-484.
Reynolds, P. S. (2025). Best of three—Three women who revolutionized statistical practice. Chance, 38(1), 59-62.
Salburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York: W. H. Freeman.
Silvey, S.D. (1980). Optimal Design. London: Chapman & Hall.
Smith, K. (1916). On the ‘best’ values of the constants in frequency distributions. Biometrika, 11(3), 262-276.
Smith, K. (1918). On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika 12(1/2), 1–85.
Steinberg, D.M. & Hunter, W.G. (1984). Experimental design: Review and comment. Technometrics, 26, 71-97.
Stigler, S.M. (2005). Fisher in 1921. Statistical Science, 20(1), 32-49.
Stigler, S.M. (2006). How Ronald Fisher became a mathematical statistician. Mathématiques et sciences humaines, 176, 23-30.
Wald, A. (1943). On the efficient design of statistical investigations. The Annals of Mathematical Statistics, 14, 134-140.