Jerzy Woronczak

Jerzy Woronczak (1923-2003) is one of the most important Polish scholars in the field of quantitative linguistics, and, in fact, one of the founders of this discipline in Poland.



Contribution to quantitative linguistics
There are three areas in which one must examine Jerzy Woronczak’s work in the field of quantitative linguistics (cf. Pawłowski, Sambor 2004, Kamińska-Szmaj 2004): The first pertains to the scientific merits of his own studies, the second to the knowledge he imparted to his students (master’s and doctor’s degree candidates), and the third to the value of the popularisation of his achievements against the background of the rather traditionalistic currents predominant in Polish linguistics.

Here it should be sufficient to state that during his active scientific career, J. Woronczak (as a participant, patron, reviewer, or adviser) took part in practically all the scientific initiatives in Poland connected with mathematical applications in linguistic research. With regard to the third area (popularisation), we must admit that the studies on the application of mathematics to problems of history and literary theory, which he undertook in the 1960’s, testified not only to his vast knowledge, which defied precise labelling, but also to his civil courage. One should not forget that for the traditional representatives of the humanities of the time, combining the poetics and aesthetics of literature with mathematics was a sign of unaccountable intellectual bravado, with a dash of humbug. Such studies were treated as peculiarly eccentric, based on the appropriation of a methodology foreign to the discipline, leading in the best case to reductionism, and therefore a simplification of complex linguistic material. Referring to the issue of numerical methods in prosody and versification (one of Woronczak’s favourite topics), M.R. Mayenowa remarked that “Statistical methods of versification analysis often arouse hostility even where there are no such basic reservations as to the principle; what is more, they arouse objections also there where, in their simplest form, they are always applied. [...] It seems that the basis of the protest is sheer psychological, and one should not wonder. The professional who has devoted many years to mastering the traditional language of his discipline can only with difficulty and great humility accept a situation in which someone discusses his discipline in another tongue.” (Mayenowa 1965: 170) One of the reasons for the success of statistical linguistics in Poland was the fact that Woronczak was able to speak “in another tongue” about the traditional issues of philology and linguistics and set an example for the younger generation of scholars. Below we shall discuss the most important quantitative works of Jerzy Woronczak according to thematic groups. It is worth adding that in the 1970’s, he gradually turned to his original interests, namely early mediaeval history, the antiquity, biblical studies and, most of all, Hebrew studies and the history of Polish Jews. From this time on, quantitative themes appeared mostly in the subjects of the theses and dissertations of his students, predominantly involving, by the way, the Bible and/or ancient texts.

WORONCZAK’S STUDIES ON STYLOMETRY

It is worth reminding that the tradition of stylometric research in Poland goes back to the end of the XIXth century. One of the fathers of stylometry was Polish Hellenist W. Lutosławski, who coined the term “stylometry” and defined its general rules (Lutosławski 1897 [1983]). In the 1950’s, W. Kuraszkiewicz, an expert in Slavonic studies, suggested using numerical measures of lexical richness (Kuraszkiewicz, Łukaszewicz 1951). His coefficient, like the one of Guiraud to which it is similar, has no practical significance today, but it played an important role in promoting mathematical methods among Polish linguists. Woronczak turned to the problems of stylometry at the beginning of the 1960’s. In contrast to his predecessors, though, he applied significantly more refined and effective mathematical tools. It should be emphasized that his work had both a theoretical (the analytical derivation of estimators) and a practical (applications in the solving of real problems in linguistics) aspect. It was his goal to discover unbiased estimators of indices of lexical richness which were sensitive to lexical variety, but independent of the length of the text fragment under investigation (Woronczak 1965b). Starting from the so-called Good’s measures (Good 1953), which express the probability of randomly selecting m elements belonging to one and the same class in m independent samplings from a general population.

(1)

Woronczak derived equations for the estimators $$c_m$$ for m = 2 and m = 3:

(2)

(3)

where $$f_i$$ is the frequency of the i-th word-form, and N the length of the sample1.

Equation (4) is a generalisation of equations (2) and (3). Its author, though, did not recommend calculating its value for m > 3 (Woronczak 1976).

(4)

1 Equation (2) was also derived by G. Herdan. Both scholars demonstrated the similarity of c2 and Yule’s K-characteristic. Applying the parameters B and from Mandelbrot’s equation, Woronczak then derived equations for the expected length of a given text’s vocabulary and the expected size of the class of words of a given frequency (ibid.). The estimators (2) and (3) were initially verified by their author (Woronczak 1965b), but he admitted in another article that the set-up of the test was not entirely satisfactory (Woronczak 1976: 167). That is also why they were submitted to further verification on an extensive corpus (Pawłowski 1994). The dynamics of change in the values of several indices of lexical richness were compared in a corpus of French literary texts (prose by Romain Gary), the length of which was gradually increased from 20’000 to 600’000 words. The measure for evaluating an index was the dispersion2 of its value with increasing length of text. One already observes a significant improvement in index stability of log TTR, though the Dugast and Yule indices, as well as those of Woronczak, proved to be the most stable (Tab.1).

2Standard deviation divided by the mean

Table 1

Indices of lexical richness and their dispersions



Woronczak (1976) also showed that there is a connection between the values of the estimators  and   and the lexical cohesion of a text. Analyzing the dynamics of the mean variations of these estimators with ever-increasing sample length of a continuous text (for ), he noticed that the estimators first increased in value with increasing N, but then stabilized, despite the geometric progression of N. The value of N at which the relative stabilization of the indices   and   takes place (or their maximum values), marks precisely the limit of the lexical cohesion of the text, indicating at the same time the average length of the fragments, which are closed to a certain degree with respect to vocabulary and theme. The test which Woronczak conducted on texts by St. Fulgentius and St. Augustine confirmed this hypothesis. The Augustinian text, which was addressed to an uneducated social class and therefore written in a simple manner, produced an N limit of ca. 45 words, while that for the more difficult and literary Fulgentius text was ca. 128 words.

CORPUS RESEARCH

The beginnings of research using corpora in Poland must be associated with the preparation of the Frequency Dictionary of Modern Polish (Słownik Frekwencyjny Polszczyzny Współczes¬nej, hereinafter SFPW) in the 1960’s and 70’s, modelled on the Juilland dictionaries (Kurcz et al. 1990). Woronczak was, next to J. Sambor, one of the chief initiators and authors of this undertaking of several years’ duration (Lewicki, Sambor 1969). The SFPW was compiled on the basis of a sampling of 500’000 words encompassing five functional styles (genres): scientific texts, small press items, commentary on current affairs, literary prose, and drama. The fundamental indicators describing the frequency distribution of a lexeme in the stylistic categories were the Juilland measures F, D, and U. The empirical data contained in the SFPW became the basis for several analyses of Polish (see: Kamińska-Szmaj 1988, 1989, 1990; Sambor 1971; Hammerl 1989; Pawłowski 1999a, 1999b). It is worth mentioning that the current SFPW corpus has been converted to digital form and is available on the Internet (http://www.mimuw.edu.pl/polszczyzna/, cf. Ogrodniczuk 2003).

MULTIDIMENSIONAL ANALYSIS

Although Woronczak did not extensively apply this type of methodology, he was fully aware of the possibilities which multidimensional analysis had to offer in the taxonomy of textual objects. He knew the works of J. Czekanowski3, whom he met in Wrocław on several occasions during seminars on applied mathematics organized by H. Steinhaus. In 1962 he published a study, where multidimensional scaling in a rudimentary form was applied to establish the origin and filiations of Bogurodzica (the oldest Polish literary text). Using 56 text features, he classified all the Bogurodzica’s remaining versions (coming from the period of XV–XVII century). This helped him conclusively settle the perennial dispute over the originality and chronology of Bogurodzica’s stanzas (Woronczak 1962 [1993]). Woronczak also mentioned his discussions with A. Kolmogorow4 on the topic of spatial representations of "linguistic objects" and encouraged the author of these lines to conduct a taxonomy of Polish poetic texts.

3 In the 40s Jan Czekanowski introduced multivariate methods in anthropology and linguistics (for further information see: Adam Pawłowski, Jan Czekanowski (1882–1965) – a pioneer of multidimensional taxonomy. To be published in one of the forthcoming issues of Glottometrics).

4 Most likely during a conference on the versification of Slavic languages organised in Warsaw by the Institute of Literary Research of the Polish Academy of Sciences in August of 1964 (see Mayenowa 1965).

THE STATISTICAL LAWS OF LANGUAGE IN WORONCZAK’S RESEARCH

The American linguist J. K. Zipf is recognized as the initiator of studies on the statistical laws of language. Other scholars, such as J.N. Baudouin de Courtenay (Baudouin de Courtenay 1927 [1990]: 549), also anticipated their appearance. The dependencies Zipf discovered between the frequencies of expressions, their lengths, number of meanings, and rank are generally known as "Zipf’s laws". They stimulated the search for other linguistic laws within the framework of a broad paradigm of systems theory or cognitive science (Hammerl, Sambor 1993).

Woronczak studied Zipf’s fundamental law, which describes the relationship between the rank of a word in a list and its frequency (Woronczak 1967). Starting with the equations of Estoup, Joos, and Mandelbrot, he developed an analytical description of the quantitative structure of the vocabulary of a complete text, treating it as a sampling from the general population of the language, and derived equations for the expected size of the vocabulary of a text with a length of N word-forms and for the expected number of words with an assigned frequency (ibid. 2259). He also considered generalising the equations he obtained for an infinite text of length $$N\longrightarrow\infty $$ and rank $$N\longrightarrow\infty $$. It must be added that Woronczak’s above-mentioned generalisations had never been the subject of empirical verification and were of deductive-theoretical nature.

STUDIES ON VERSIFICATION AND POETICS

As an expert on the literature, versification, and musical notation of the Middle Ages, Woronczak devoted many of his studies to research into texts in Old Polish (1958 [1993], 1960, 1965a, 1993) and in Old Czech (1963 [1993]). He approached this topic in his typical manner, i.e. both from a philological and a quantitative perspective. The statistical models he elaborated and the tests he employed were never goals in themselves, nor, consequently, were the linguistic materials he used merely a pretext for the abstract solutions often encountered in formalistic approaches. It is certainly this balance between philological-linguistic content and mathematical formalism which resulted in this aspect of Woronczak’s work becoming an especially valuable element of his scientific legacy. We will discuss here just some of his most representative works devoted to versification.

In 1960 his analysis of the distributions of the verse lengths of asyllabic Slavonic poetry of the 15th – 16th centuries appeared (Woronczak 1960). For the sake of comparison he described the numerical distribution of the length of sentences in Polish prose; this proved to be a gamma distribution with a large right-sided asymmetry. He then found that the variance in length of asyllabic verses was less than that of sentences in prose and decreased with time, which was an indication of the gradual formation of the Polish syllabic system. This gradual formation process of Polish syllabic verse was the leitmotif of Woronczak’s studies of the Biernat from Lublin’s writings (1958 [1993]).

While in controversy with the theses of Czech mediaevalists over the structure of the Old Czech versification in the Dalimil Chronicles (org. Staročeská Kronika tak Řečeneho Dalimila), Woronczak submitted the hypothesis that if one proceeded from the opening chapters of the chronicles towards its end, one would be able to observe the process of its development into prose, in that the structures of its versification and rhythm would gradually become less rigid (Woronczak 1963 [1993]). He maintained that the beginning fragments of the chronicles, which speak of the pre-Christian era, temporally remote and unknown to the annalist, would be versified in a more orderly manner. He explained this phenomenon through two causes. First, the beginning chapters may have contained quotations from a surviving oral literary tradition introduced into the text. One must remember that the majority of medieval texts were originally transmitted orally, these being easy to remember by their regular, formulaic structure, which served not only an aesthetic, but also, and perhaps foremost, a mnemonic function. Secondly, one could imagine that the author, writing of events remote in time and not familiar to the contemporary audience, might, as the need arose, alter the content to fit the linguistic form rather than the form to fit the content, making it in this way more splendid. The opposite situation would prevail in the last fragments, presenting contemporary events which have not yet been consolidated into an oral tradition and which demanded adherence to facts, the rules of correct versification being of secondary importance.

Woronczak conducted the verification of this hypothesis employing the test of runs, which is a technique which allows defining the degree of randomness of a numerical series. The data he used were the lengths of subsequent verses. The tests confirmed the agreement of the hypothesis with the structure of the Dalimil Chronicles.

CONCLUSIONS

If one were to consider the number of his publications as the only criterion in evaluating Jerzy Woronczak’s achievements in the field of quantitative linguistics, the result would be modest. His determination to promote his achievements in international journals was also, by today’s standards, too slight and not proportional to their scientific value. But do these strictly utilitarian measures embrace the totality of scientific output? Time has shown that the main distinguishing feature of Woronczak’s work is its depth, quality and originality. For in the overwhelming number of cases, the Professor was able to find the optimal point of balance at which philological and linguistic issues do not disappear in a thicket of mathematical formalism, but preserve their cognitive value and freshness even for the demanding specialists in the given discipline. And that is perhaps the last lesson which he taught his students.

Source
Adam Pawłowski: Glottometrics 8, 2004, 79-89