The idle ramblings of a Jack of some trades, Master of none

After Charles Darwin's publication of the theory of evolution, many scientists took up the mathematical study of racial differences in mankind. Given large datasets compiled by anthropologists of measurements made of a wide variety of peoples, the question was to determine if there were objective metrics by which people could be classified into races. Such investigations were not only intellectual, but also in many cases driven by notions of racial superiority and eugenics.

The measurements themselves were copious and exhaustive, incorporating - for the head alone, for example - such characteristics as cephalic index, head length, head breadth, nasal length, nasal breadth, and nasal index. Simple statistics such as the average and the standard deviation were clearly not sufficient to distinguish between one group and another, particularly when the measurement error alone resulted in overlapping values for two classes. In a brilliant series of papers published in a new journal called Biometrika, Karl Pearson and associates showed there were multidimensional measures whereby anthropological data could be analysed. More specifically, they could be used to 'assess similarity or dissimilarity between two populations' 1.

An example of a prevalent question at the time was whether Ancient Egyptians had any social affinity with Hindus. Some researchers had claimed that the physical resemblance between the two peoples implied a partial colonisation from one country to the other. Karl Pearson introduced the coefficient of racial likeness, or CRL, to analyse this issue, 'one of the first quantitative procedures to measure the admixture proportions, or the proportions which 'hybrid' populations derive from their various ancestors'2.

Given two populations of size n and n', with the mean of the ith measurement in the first population mi, and that in the second population m'i, and si2 the pooled variance in the ith measurements, the CRL C is given by:

Almost immediately, the CRL was used by M. L. Tildesley in her analysis of Burmese craniums in 19213. Shortly thereafter, critiques of the methodology began to appear in the literature - not only by anthropologists but also statisticians. A chief criticism was that the method assumed that the various metrics (e.g. head length, head weight, nose length) were independent of each other, a rather generous assumption. Another complaint was that the CRL was dependent on the sample size of observations, and that it provided only a degree of certainty that there was divergence between the groups, but could not quantify exactly how much divergence there was.

In 1925, an Indian statistician named P. C. Mahalanobis began an investigation into the question of racial differentiation in his native state of Bengal. He looked at the results of an anthropometric survey of Anglo-Indians (people of mixed European and Indian ancestry) conducted in 1891 to answer (in his words) the following questions:
How are these 200 Anglo-Indians in Calcutta related to the different caste-groups in Bengal? Are they more closely allied with the Hindus or the Mohammedans? Do they show a greater affinity with the higher castes of Bengal or with the lower castes? ... 4
In order not to be swayed by size differences in the various characteristics he used, Mahalanobis computed standardised values for them.
The characteristics differed by scale and variability. That is, Mahalanobis might have considered a half-inch difference in nasal length between two groups of skulls a significant difference whereas he considered the same difference in head length to be insignificant. Mahalanobis normalized differences in each characteristic by the characteristic’s standard deviation and then squared and summed the normalized differences, thus generating one composite distance measure that was invariant to the variability of each dimension. 5
This first Mahalanobis metric suffered from the same defect as Pearson's as it didn't consider the correlation between the various characteristics. In 1936, he introduced his famous concept of statistical differentiation that came to be known after him, the Mahalanobis distance.

If the human skull can be described by n characteristics that are measurable, an individual's skull can be represented as an n-dimensional vector. Now take a set of measurements belonging to one anthropological unit (say, Calcutta Brahmins), and compute its centre, or mean vector m, and its covariance matrix S. Then, if we want to classify a hitherto unclassified skull measurement y, what we do is compute its Mahalanobis distance D from the Calcutta Brahmin set's centre:

And we compute the distance against the centres of other classes, say, Calcutta Muslims or lower castes, and we decide that our unknown skull falls into that class from which it has the least distance D.

To simplify, assume that a skull can be characterised by two metrics, skull length and skull breadth. Then each skull can be represented by a point in two-dimensional space. If we plot our data of Calcutta Brahmin skulls and Calcutta Muslim skulls, we might find (if these are indeed two distinct anthropological classes) that our graph has two distinct clusters in it (Figure 1 from Kritzman and Li (reference below)):

In Kritzman and Li's words, then:
Suppose we compare a skull of unknown origin, represented by the square in Figure 1, with the two groups and categorize it. In terms of Euclidean distance, it lies closer to the center of Group 2 than to the center of Group 1. The Mahalanobis distance, however, would consider this skull more similar to Group 1 because its characteristics are less unusual in light of the more inclusive scatter plot of Group 1’s characteristics.
So what did Mahalanobis conclude from his investigation? First of all, he said, the Anglo-Indians in his sample derived (on the Indian side) from Biharis, Lepchas (of Sikkim), possibly from the Punjab, and none at all from the Northwest Frontier or the Chotanagpur tribals. He also noted that they seemed to derive from unions of higher-caste Indians and Europeans, adding that 'cultural status evidently played a large part in determining Indo-European Union.'

From a broader investigation into the anthropological classes of Bengal, Mahalanobis was able to arrive conclude: 
Summing up we find that intermixture within Bengal, i.e. intra-provincial intermixture has varied with the degree of cultural proximity, so that for Brahmins the amount of intermixture with other castes has been in proportion to the social standing of the caste concerned. Influence from outside Bengal, i.e., inter-provincial intermixture has followed two well-defined and clearly distinguished streams, one from the castes of Northern India (chiefly from Bihar and the Punjab) and the other from the aboriginal tribes of Chotanagpur. The influence of the Northern Indian castes decreases and that of the aboriginal tribes of Chotanagpur increases as we go down to the social scale... . None of the castes analysed here show much resemblence with any of the aboriginal tribes of the east... . Mohammedans (also) show a highly mixed character. They appear to be originally largely derived from Bihar but have intermixed extensively in Bengal; they do not show any resemblance with the Punjab Pathans. 6
As it happens, not all the anthropological conclusions of that 1925 paper are held valid today. Mahalanobis was correct in his assertion that Bengal Brahmins resemble other Bengal castes more than Brahmins elsewhere in India. However, later datasets have invalidated his claim that only the Brahmins among the people of Bengal have admixtures from the Punjab. 'Moreover, as far as the Anglo-Indian community is concerned, it is now believed that Mahalanobis had probably confined his study to a sample from the upper stratum of the community, and hence his conclusion of resemblance to upper caste Hindus is applicable to the upper class Anglo-Indians only'. 7

These days, Mahalanobis is venerated by many people not for his anthropological research or its conclusions; rather, it is the methodology he developed that is considered his greatest contributions to the sum of human knowledge. Even today, the Mahalanobis distance is part of the armoury of every scientist who needs to classify multidimensional data. As you can probably infer from the reference list, this includes not just statisticians, but also anthropologists, social scientists, and financial engineers. 

Quite a lot of good, in short, has come out of what once was eugenics research.


1. S. Dasgupta, "Evolution of the D2-Statistic of Mahalanobis", Sankhyā: The Indian Journal of Statistics, 1993, Special Volume 55, Series A, Pt 3, p 442.
2. M. Tapper, In the Blood: Sickle Cell Anemia and the Politics of Race, University of Pennsylvania Press, 1999.
3. M. L. Tildesley, "A First Study of the Burmese Skull", Biometrika, 13, 1921, 247-251
4. P. C. Mahalanobis, "Analysis of Race-mixture in Bengal", Journal of the Asiatic Society of Bengal, 23, 301-333.
5. M. Kritzman and Yuanzhen Li, "Skulls, Financial Turbulence, and Risk Management", Financial Analysts Journal, 66(5), 2010.
6. S. Dasgupta, as above, p 447.
7. J. K. Ghosh, "Mahalanobis and the Art and Science of Statistics: The Early Days", Indian Journal of History of Science, 29(1), 1994.


Post a Comment