is a probabilistic model of the occurrence of linguistic units in text passages.
The Russian linguist Reveka M. Frumkina was the first to systematically investigate the distribution of words in text blocks of fexed length. Later, also the occurrence of syntactic structures and syntactic functions was analysed.
Data are ontained by counting the number of occurrences of the unit under study in each of the passages of a text. The lengths of the passages should be determined according to the overall probability of the unit, e.g. 100 words for the analysis of frequent words. The number of passages with x occurrences of the given unit is considered as a random variable. The probability of the unit is denoted by p, the probability of occurrence of any other unit is 1-p = q. The probability p is also a random variable since the application of a word is not independent of its co-text. Under the assumption that p is distributed according to the Beta distribution, the formula
is obtained. This model has been applied to
- Determination of the class of the unit (e.g., part of speech of a word)
- Identification of text passages with respect to terminological or semantic criteria
- Determination of keywords
- Measurement of stylistic parameters
- Diagnosis of psychic diseases
- Construction of learning automata
Altmann, G. (1988) Wiederholungen in Texten. Bochum: Brockmeyer.
Best, K.-H. (2005). Sprachliche Einheiten in Textblöcken. Glottometrics 9, 1-12.
Köhler, R. (2001). The distribution of some syntactic construction types in text blocks. In Uhlířova, L., Wimmer, G., Altmann, G., Köhler, R. (Eds.), Text as a linguistic paradigm: levels, constituents, constructs. Festschrift in honour of Ludek Hřebíček: 136-148. Trier: WVT.
Piotrowski, R.G. (1984). Text, Computer, Mensch. Bochum: Brockmeyer.
Paškovskij, V.E., Srebrjanskaja, I.I. (1971). Statističeskie ocenki pis´mennoj reči bol´nych šizofreniej. In: Inženernaja lingvistika. Leningrad: Nauka.