History of quantitative linguistics

The Beginning


The first scientific counts of units of language or text were published already in the 19th century. In Germany, Förstemann (1846, 1852) and Drobisch (1866), in Russia, Bunjakovskij (1847), in France Bourdon (1892), in Italy, Mariotti (1880) and in the USA, probably Sherman (1888) performed frequency studies as a means of linguistic description. First theoretical insight after many years of merely descriptive counts of various kinds, is due to the Russian mathematician A. A. Markov who created the base of the theory of Markov chains in 1913. This mathematical model of the sequential (syntagmatic) dependence among units in linear concatenation in the form of transition probabilities was – despite its mathematical significance – of only little use for linguistics. Some of the linguistics areas, such as syntax, were considered principally inappropriate for the application of this model type as they require recursive models in order to represent the self-embedding structures in syntax. Hence, applications of Markov chains were restricted to some efforts with respect to textual units and to phonology. (In modern computational linguistics and in natural language processing, however, Markov chains are a central component of the corresponding methods of language technology (“Hidden Markov Models”; cf. e.g., Brants 1999)).

Development
Later, quantitative studies of linguistic material were, in the first place, a consequence of practical demands: efforts to improve second language training and optimisation of stenographic systems are examples. Early quantitative observations and corresponding mathematical models in the field of vocabulary are originated from works by J.B. Estoup (1916), G.U. Yule (1924) und E.U. Condon (1928).

The unveiled interrelations between frequency of words and the ranks of the frequency class (alternatively: between frequency and the number of words in the given frequency class) were systematically investigated by the above-mentioned founder of QL, George Kingley Zipf. He was the first to set up a theoretical model in order to explain the observations and to find a mathematical formula for the corresponding function – the famous “Zipf’s Law”. Zipf and others observed the same kind of dependence between rank and frequency (or size) on data from a number of scientific and every-day phenomena. Among his publications, his books “The Psycho-Biology of Language. An Introduction to Dynamic Philology” (1935) and “Human Behavior and the Principle of Least Effort” (1949) are considered the most important ones. Zipf formulated (in different terms) innovative thoughts on self-organisation, the principle of language economy and fundamental properties of linguistic laws long before modern systems theory arose, even without any preparatory work of others. His ideas, such as the “principle of least effort” and the “forces of unification and diversification” are important still today (even if they suffer from certain defects and mistakes) and they belong to the few things contemporary linguists know about QL. Later, Zipf’s model was conceptually and mathematically improved and enhanced by Benoît Mandelbrot (1953, 1959, 1961a, 1961b), the world-wide celebrated originator of fractal geometry. Zipf’s body of thought fecundated various scientific disciplines and enjoys increasing publicity again.

Shannon and Weaver (1949) applied information theory to linguistics and raised a storm of calculations on diverse language phenomena. Many linguists responded well to this novel approach, among them in particular Gustav Herdan (e.g., 1954, 1956, 1960, 1962, 1964 1966, 1969), Rajmund.G. Piotrowski (also spelled Piotrovskij) (1959, 1968, 1979) and Walter Meyer-Eppler (1959). The corresponding experiments and calculations, however, turned out to be of little help for a deeper insight, as the technical concept of information does not take into account linguistic semantics. Basically, not much more could be achieved than the measurement of entropy and redundancy, and thus the storm waned.

Physicist Wilhelm Fucks (1955) was responsible for a turn towards theoretical considerations in German QL. He set up, among others, a mathematical model of word length distributions and performed various investigations into language, literature, and music. In France, Charles Muller (1973, 1979) created a novel approach to study the vocabulary of a text, which is still popular today. In Russia, Zipfian linguistics was conducted particularly by Michail V. Arapov (1988; Arapov / Cherc 1983) in Moscow, who based his models of the dynamics of texts and of language development on the analysis of rank order. In Georgia, a group around Jurij K. Orlov (cf. 1982a, 1982b) established a tradition of studies into the statistical structure of texts based on the Zipf-Mandelbrot Law. The Estonian researcher Juhan Tuldava (1995, 1998) is famous for his mathematical methods of analysis of numerous text phenomena.

In the course of the history of QL, every now and then scientists from other disciplines, mainly mathematicians, worked on the application of mathematical models to and the development of methods for linguistic problems. Those of the results of their efforts which were based on linguistically substantiated fundaments enjoy continued existence, such as Mandelbrot’s work (1959), which prevailed against the competition with models without such a linguistic justification, whereas, e.g. Yule’s or Williams’ (1946, 1956, 1964) works are hardly mentioned any more.

Today, Quantitative Linguistics is a well-developed scientific discipline with a broad applicational impact.