Thursday, March 05, 2026

kangaroo

We all know the about of how the kangaroo got its name: one of the English sailors (or even Cook himself) asked a native what that strange creature was called and the native replied "gangarru" (or some such) which was then adopted as the name for the creature and which later turned out to mean "I don't understand". We all know that this story is, of course, a myth.

So earlier today I was reviewing the data for the publication of a paper I and Stefano presented at the 2024 conference of the Association Internationale de Dialectologie Arabe. The paper is a pilot study of sorts, attempting a phylogenetic analysis of Arabic varieties. The idea - adopted from biology - is to treat the varieties as taxa and selected linguistic features as nucleotide sequences or some such. Then you just annotate the data, plug it into one of the standard applications and presto! you got yourself a nice visualization of the relationship between the varieties, maybe even some Bayesian modelling of the same. It has been done before, e.g. for creole languages, so it makes perfect sense to try to do it for Arabic. Doubly so if there has recently been a survey of linguistic features of Arabic varieties and the authors have provided you with the raw data. A bit of Python jiggery-pokery and William's your mother's male sibling.

Famous last words.

I do not mean to disparage Manfred Woidich and Peter Behnstedt (may he rest in peace) in any way, their four-volume Wortatlas der arabischen Dialekte is a monumental achievement in Arabic dialectology. Plus let's face it, data management is not easy under the best of conditions, let alone with linguistic data of this type. It is true though that even with the raw data underlying the Wortatlas at one's fingertips, wrangling the data into shape is a complicated endeavor. For one, there were issues of technical nature, where all the data were stored in one of them God-forsaken ante-diluvian formats. The export worked, but since the Arabic text was stored in Odin-only-knows what encoding, the thousands of entries painstakingly recorded by Behnstedt are now lost to the ___???? hell, to say nothing of their attestations consisting of dead links to long-defunct internet fora painstakingly collected and annotated by Behnstedt. It is a giant loss and I hope the data is recoverable.

Then there's the nature of the data which, to be fair, reflects the practices of modern Arabic dialectology: the field is simultaneously is concerned with both the description of the current state of the variety in question AND history and relationship to other varieties; as a result, it often fails to do both properly. Take the the concept 'tomorrow' (item 460 in Wortatlas IV) where we learn that the corresponding words in modern varieties fall into three groups based on their root/etymology: root BKR (A), word ġudwah 'early part of the day' (B) and aṣ-ṣubḥ '(early) morning' (C). We then get lists of which derivations occur where, but without any indicaton as to what is the prevalent form. And so for example for Antiochia, the data set has both bikra (A) and ġadik (B). Which is it, which do I pick as THE form? And this is a simple example, it gets a whole lot more difficult with other lexical items.

And speaking of Antiochia, what do even take as taxa? All of you have surely noticed that while in the description of the entire endeavor I referred to Arabic varieties, in the previous paragraph, I discussed the practices of Arabic dialectology. As I once (before nuking my Twitter account) tried to explain to an idiot, in Arabic linguistics, we use the term 'dialect' for historical, political and sociolinguistic reasons for what in Romance, Slavic or Germanic linguistics we would call separate languages belonging to the Romance (Slavic, Germanic...) branch of Indo-European. So yes, it would be more appropriate to speak of Egyptian, Tunisian or Yemeni, except... Well, all of them co-exist in a special relationship with Modern Standard Arabic and, more importantly, there is no one/dominant/prevalent/okfineiwillsayit standard variety of, say, Moroccan or Iraqi in the same way that there is a dominant/prevalent/cmonpeopledontmakemesayitagainFINE standard German, Polish or French. And so in, say, Egypt, Cairo Arabic may be what people think of when they think of Egyptian, but there are other subvarieties of the Egyptian variety which are just as important and who is to say which is THE Egyptian, surely not the assholes in Cairo! So yes, Lebanese, why not, but then not just one.

Which brings me to the actual question: how do you divide the varieties? The prototypical way to do it in any dialectology is go by a location, except those range from entire regions through towns and villages, all the way to neighborhoods. But this is Arabic dialectology, so we have tribal varieties, confessional varieties and the specter of bedouin varieties hanging over the whole thing. And on top of that, there is the fact that despite literally centuries of Arabic dialectology, there are only handful of Arabic varieties that are actually well described. Sure, you can throw a metaphorical stone and metaphorically hit a book on this or that Arabic variety that bills itself as a grammar. Upon closer reflection, however, you will find that they contain a detailed description of phonology and morphology, but little to nothing on syntax. And then there's of course the aforementioned issue with the raw data underlying Wortatlas where the "place" (roughly our "taxon") column sometimes has the present-day country name like, "Sudan" (or "Usbekistna"), sometimes a region like "Ḥaḍramawt" or "Sinai" or even "Trucial Coast", sometimes a country with a placename, e.g. "Sudan/Šukriyya".

So in the end, we made a decision to focus on

  1. A selection of function words from Wortatlas IV (= features), 35 in total. These contain such items like 'who', 'what', 'when', 'never', 'yes' and the existential predicate, each of them bearing a three-digit code that starts with 4.
  2. 63 varieties where we could get the data for at least 15 of the features (= taxa).

The taxa were arranged by - mostly - country: AF = Afghanistan, AL = Algeria, BH = Bahrain, CH = Chad, EG = Egypt, IL = Israel, IN = Iran, IR = Iraq, JO = Jordan, KN = Kinubi (that's the 'mostly' part), KU = Kuwait, LB = Lebanon, LY = Libya, MK = Morocco, MT = Malta, NG = Nigeria, OM = , PL = Palestine, SA = Saudi Arabia, SU = Sudan, SY = Syria, TR = Turkey, TU = Tunisia, UZ = Uzbekistan and YE = Yemen. Where data was available for a particular village/region/tribe, the taxon was named XX-<village/region/tribe>, where we only had data for the country, we labelled it XX-general. Then for each taxon, we gathered all the features, decided which to pick a representative one whenever necessary (always going with the more common option) and to each unique feature, we assigned a code. Sounds simple, but oh boy, was it not.

And so earlier today I was reviewing that data, now imported into EDICTOR, when I came across this entry (click to embiggen):

 

'Tis weird, I thought, for item 473 is the existential predicate which is in most Arabic varieties derived from the preposition 'in'. The taxon here is Kinubi, a creolized variety which is admittedly very different from all the others, but surely not that different. So I checked the data and sure enough, there it was: 

es gibt
dé kalám táki má
Ki-Nubi Ki-Nubi
QUEST

The "QUEST" note means this piece of data came from a questionnaire. The lack of spaces in the EDICTOR entry is simply the result of normalization - in some entries, multiple options were given, so that is fine. Or, well, not actually, because normally the data would not have this many options/spaces and anyway this looks like a full sentence and besides, that last word looks like negation... So I went in and checked a bunch of sources and turns out it is a sentence:

dé kalám tá-ki má

this word/thing POSS-2SG NEG

'this is not your word/thing'

I have no idea how this ended up in the data considering the existential predicate in Kinubi is, unsurprisingly, . Perhaps this particular sequence of words has an idiomatic meaning in Kinubi, something like 'we don't say that'. Perhaps someone wasn't paying attention. In any case, this, ladies and gentlemen, is how we almost ended up with a version of kangaroo in our phylogenetic analysis. You know what, maybe I should keep it in, as a trap street of sorts, to keep reviewers on their toes.

In case you're wondering what we came up with, below is a preview consisting of only 28 taxa (click to embiggen). The main take-away is that by Jove, it works, or at least that the basic NeighborNet algorithm matches what we would expect. For example, there are clear groupings of North African (6-7 o' clock), Levantine (9-10 o'clock) and Iraqi and Gulf (3 o'clock) varieties. Kinubi and Juba, another creolized variety, go together (and are not that far from Baggara), as do Egyptian and Sudanese varieties. Maltese, of course, stands on its own.

For more, watch this space. But now, back to data wrangling. 

 

Thursday, February 26, 2026

melhor

The best I can tell, on or around May 12th 2011, the Brazilian government agency Ação Educativa released for general use a new textbook in their series Viver, Apprender ("Live, Learn"). Numbered vol. 6 and authored by Heloíse Ramos, the textbook was published under the title Por uma vida melhor ("For a better life"), paid for by the Brazilian Ministry of Education (MEC) and distributed to both children and adult students all over the Lusofonia. It consists of 6 units which cover various topics related to (in order) Portuguese language, English, art and literature, history, geography, natural sciences and math. The first chapter of the first unit, sensibly titled Escriver é differente de falar ("Writing is different than speaking"), addresses a number of language-related issues, from sociolinguistics (the difference between the spoken and the written norm), through orthography and phonology (stress), all the way to syntax. When the contents of the chapter - particularly this last part, a short section on agreement - became generally known, merda hit the ventilador.

"A book used by the Ministry of Education teaches students to speak incorrectly," [moved link] bellowed one headline. Another decried "the pedagogy of ignorance." "Brazil decides to criminalize those who speak correctly and want to teach others to do so as well" [moved link] [1], warned an op-ed by the former president of Brazil José Sarney"it's a crime, A CRIME, to preserve incorrect Portuguese" [2], insisted senator (and former minister of education) Cristóvam Buarque. And Janice Ascari, Regional Attorney General, accused (albeit only on her blog) all responsible for distributing the book of "comitting a crime against our youth." [3] And with such leaders, you can very well imagine what the rank-and-file members of the quickly assembled posse of self-appointed protectors of the Portuguese language had to say about Por uma vida melhor and its author.

By now you're wondering what were the heinous crimes and unspeakable evils the book aimed to corrupt the Portuguese-speaking youth with. For the original version of the first chapter, try here (pdf), the offending passages on agreement can be found on p. 14-16. For those of you who are not yet fluent in Portuguese (seriously, what are you doing with your lives), I have prepared an English translation of the section on agreement (below and here). As always, forgive the poor quality and disregard some of the terminological choices. Predictably, I had some difficulty with the terms norma/variedade culta and norma/variedade popular and in the end, I chose to translate them as "standard Portuguese" and "vernacular" respectively.


 

In summary, we have learned that:
1. There are (at least) two varieties of Brazilian Portuguese - standard Brazilian Portuguese and  vernacular Brazilian Portuguese, and
2. one of the differences between them is how they handle agreement.
3. It is ok to use either variety, as long as it's appropriate for the occasion.
4. There are people who will judge you based on how you speak.

That the first is true should be evident to anyone with a even passing familiarity with Brazilian Portuguese. As for the second, Azevedo's Portuguese: A Linguistic Introduction (CUP 2005, p. 226-227) sums up the situation as follows (emphasis mine):

7.3.2.1 Non-agreement in the noun phrase
Standard nominal agreement (4.1.1) requires pluralization of adjectives and determiners accompanying a plural noun. In the vernacular, however, pluralization is more erratic; in the extreme case, the plural marker is moved to the left-most determiner and the noun and other accompanying formants remain in the singular ...
Although lack of agreement is strongly condemned by prescriptive grammars, examples from educated speakers ... show that application of the pluralization rule tends to vary according to the level of formality...

7.3.2.2 Non-agreement in the verb phrase
Standard verbal agreement (4.1.1) requires a conjugated verb to match its subject in person and number. Non-agreement in V(ernacular)B(razilian)P(ortuguese) is related to the reduction of verb paradigms to three, two or even a single form ...
Although cooccurrence of verbal non-agreement and nominal non-agreement is strongly condemned by prescriptive grammars, it occurs in the colloquial speech of educated informants ...

Even those critics who insisted - with almost superhuman inability to perceive irony - that "There's only one Portuguese language" [wayback link] [4] are very well aware that there are differences, even profound ones, in the way people use Portuguese in Brasil. In fact, few of the enraged voices denied that those who say "os livro" speak different Portuguese. It's just that they don't call it "Portuguese without agreement in number" or "non-standard Portuguese", they call it "incorrect Portuguese" or (like the irony-proof superhero cited above) refer to the "butchering/murdering" of Portuguese [5]. We all know this song; the chorus singing it - and the speed with which they picked it up - does nothing but prove the validity of the third and fourth lessons drawn from the passage.

To be fair, some of the criticism of Por uma vida melhor raised a different issue, a more legitimate one, that of language and social stratification. The introduction to the first chapter addresses this directly:
Contudo,  é importante  saber  o  seguinte:  as duas  variantes  são  eficientes  como  meios  de comunicação. A classe dominante utiliza a norma culta principalmente por ter maior acesso à  escolaridade  e  por  seu  uso  ser  um  sinal  de prestígio. Nesse sentido, é comum que se atribua um preconceito social em relação à variante popular, usada pela maioria dos brasileiros. Esse  preconceito  não  é  de  razão  linguística, mas social. Por isso, um falante deve dominar as diversas variantes porque cada uma tem seu lugar na comunicação cotidiana.
It is, however, important to note the following: both varieties are (equally) efficient as modes of communication. The dominant class uses the high register primarily to gain access to education and to signal prestige. In this respect, it is common to approach the vernacular with a certain social prejudice, even though the vernacular is used by the majority of Brasilians. This prejudice has nothing to do with linguistics and everything to do with social stratification. Consequently, a speaker needs to be in command of both varieties since each has its place in everyday communication.


"The dominant class" naturally objected to being characterized as such. The venerable Brazilian linguist Evanildo Bechara in an interview with the Brazilian magazine Veja titled "Em defesa da gramática" ("In defense of grammar") decried the use of "sociolinguistic theories outside of the confines of academia" [6]. We have all encountered this type of thinking about the relationship between academia and the public, especially recently, but it is still shocking to hear an academic say that out loud. Bechara also described the observation that the standard language is a tool of domination used by the elites as "political orthodoxy" and "an obstacle for the country" [7]; make of that what you will. The aforemementioned former senator Cristóvam Buarque insisted that people like Heloíse Ramos who point out the differences in how various groups of people speak AND say that it is ok (depending on situation), actually create two Portugueses: "the Portuguese of condos and shopping malls" and "the Portuguese of the streets and the fields." [8] This biportuguesism, concludes Buarque, strengthens the Brazilian apartheid [9]. You know this song, too. Mr. Buarque seems to be - or have been, he is old and rich now - a fellow leftist, but that does not matter. The song he is singing goes like this: "there have never been any divisions / until you started talking about them." And we know those who sing it and why.

All this happened 15 years ago and the last I checked, Ação Educativa assembled a file summarizing the debate. I have not been following the developments; I don't know what happened to Por uma vida melhor or what and how Heloíse Ramos is doing. There is a Novo Viver, Aprender, with a new chapter dedicated to language and digital literacy, which sounds great. The entire story remains a striking example of the kind of public response you get when conservativism, classism, power and ignorance of all matters language clash, which is why I document it here for posterity.

A coda: while fixing all the broken links, I came across this article from 2019. It is titled "All Portuguese spoken in Brasil is correct" and the title is a quote by Marcos Bagno, a bona fide linguist. His Wikipedia page includes a reference to Por uma vida melhor in the context of his work on linguistic discrimination. It also turns out that he took part in a televized debate on the subject mentioned in the file and that he is the author of the excellent Gramática de bolso do português brasileiro I picked up in Paris in the Librairie Portugaise et Brésilienne (the one next to Emily's apartment) a while back. This grammar contains such wonderful sections as "Orthography is not a part of language" and "Lexicogrammar" which convincingly argues against the dumb idea of syntax as a separate entity from the lexicon. So maybe all this ado about Por Uma Vida Melhor were the last pangs of the old way of understanding language and things are looking up for Brazilian Portuguese.


Notes:
[1] "... o Brasil resolve criminalizar quem fala corretamente e quer ensinar a que os outros também o façam."
[2] The video seems to be privated, alas.
[3] "Vocês estão cometendo um crime contra os nossos jovens ..."
[4] "Só existe um português, que é o certo".
[5] "Tive muitos ... que assassinam a língua portuguesa cotidianamente." = "There were many ... who butchered/murdered the Portuguese language on a daily basis." 
[6] "As teorias da sociolinguística jamais deveriam ter deixado as fronteiras da academia". 
[7] "Dizer que la lengua culta é um instrumento de dominação de elites é uma ortodoxia política e um obstáculo para o país". 
[8] "Português dos condomínios e dos shoppings e o Português das ruas e dos campos." Italics in the original.
[9] "Permitir duas línguas é fortalecer o apartheid brasileiro." = "To allow two languages is to strengthen the Brazilian apartheid." Italics in the original. 

Thursday, February 19, 2026

warda

I would like to correct one of my previous statements: it is not entirely the case that I only like Columbo; I also like other things, for example Maltese and just about everything Umberto Eco has ever written. And so when I found out that a Maltese translation of The Name of the Rose had been published, all I could say was tace et cape pecuniam meam. A few weeks back I had the opportunity to escape the Central European cold and go to Malta. Naturally, one of my priorities was to obtain a copy of Isem il-Warda. And now that I hold it my hands (after having catalogued it and provided it with a protective cover), I am ... not really confident in the quality of the translation.

The first signs are right there on the cover, more specifically the back of it which contains a brief bio of the author. It informs us that 

Umberto Eco (1932-2016) twieled ġewwa Alessandria fil-Piemonte.
Umberto Eco (1932-2016) was born in Alessandria in Piedmont. 

I am of course not a native speaker of Maltese, but over the last sev... twel... ohmygodreally twenty-two years of my engagement with the language, I have developed a good feeling for it, and this phrase strikes me as strange. More specifically, it is the use of the preposition that is strange. As Stolz et al. (2017: 457) point out, Maltese exhibits toponymic zero marking, which is just a fancy way of saying that if you want to say something happened at a named place, you typically do not use a preposition. For example:

Dun Joe Caruana twieled il-Mellieħa nhar it-2 ta' Awissu, 1960.

Dun Joe Caruana was born in Mellieħa at noon on the 2nd of August, 1960. 

That this sentence also contains an adverbial of time without a preposition is just a happy coincidence I will look into later. For now, note that this being a human language, there is at least some variation, and so this is also perfectly good Maltese:

Patri Serafin twieled fix-Xewkija fit-23 ta' Awwissu tas-sena 1932.

Father Serafin was born in Xewkija on the 23rd of August of the year 1932.

There is a third option available, the preposition ġo also meaning 'in'. And it is a well-established option, as evident from the fact that it is featured in No. 69 from Ilg and Stumme's Maltesische Volkslieder (Leipzig 1909), p. 27. I am reproducing the verse in question in modern standard orthography.

ara x'ġara ġo Ħal-Qormi

look what happened in Ħal-Qormi 

Interestingly, ġo does not seem to be used with the verb twieled (or its feminine form twieldet). When I ran a corpus search on the two verb forms and extracted 300 random examples, only three options cropped up:

Preposition Count
ø 37
f' 42
ġewwa 3

This low frequency of ġewwa + NOUN_PROP supports my feeling and to check, I went to the local digital watercooler and asked native speakers. The vast majority of them shared my suspicion of it, some describing it as an Ingliżata, i.e. a calque from English (with a hint of negative sentiment conveyed by the suffix -ata). Many had never seen it and cried apage satanas itlaq ja xitan; some have pointed out that this is a feature of (assumingly bad) journalistic style - after all, the preposition actually means 'inside (of)' a 3-dimensional enclosed object. Corpus data partially bear these observations out: even if we just consider the two prepositions and the two most frequent toponyms - which, unsurprisingly, turn out to be the Maltese names for Malta and Gozo - ġewwa is by far the minority option:

f'Malta ġewwa Malta
    105.214 700 
f'Għawdex ġewwa Għawdex
    43.065 1.273 

So far so good. The text type (genre) analysis of the use of ġewwa with Malta, however, identifies a different culprit for this Ingliżata. As a shock to no one, it's the politicians who are responsible for this crime against the Maltese language.

Rank Text type Absolute frequency %
1 parliament debates 548 78.06%
2 newspaper 147 20.94%
3 non-fiction 6 0.85%
4 fiction 1 0.14%

Excellent, that all makes sense, so it's bad Maltese spread by politicians. And as with all such things, the origin of this abomination is in the influence of English where the function of the English 'in' was calqued onto the Maltese ġewwa. Done, dusted, all explained. Except...

You see, the corpus I have been using so far is one designed to cover the language of the first two decades of the 21st century (plus or minus). As such, it does not contain older works of literature, such as those written by Ġużè Muscat Azzopardi and Anton Manwel Caruana, where especially the latter is noted for his purism (i.e. exclusion of words of non-Arabic origin). I do have a corpus that includes these works and a few clicks and key presses later, I can confirm that ġewwa is used with toponyms in works from the late 19th century as well. Like this one from Muscat Azzopardi's 1881 Viku Mason:

Wara jumejn, kien magħluq ġewwa l-Imdina...

After two days, he was locked in Mdina...

Or this one from his 1909 Nazju Ellul (where the term il-Belt 'the city' refers to the location that only laws and tourist guides call 'Valletta').

Imma ġewwa l-Belt ġriet ix-xniegħa bejn in-nies tagħna...

But in Valletta the rumour spread among our people...

Recall that the original meaning of ġewwa is 'inside (of)', i.e. within a 3-dimensional enclosed object; in fact, ġewwa also doubles as an adverb with that very meaning. The prototypical noun to be used with ġewwa is 'house', 'building' or 'school'. And its use makes perfect sense in both examples above when you consider the physical nature of the locations: Mdina is a walled city located on a hill, while Valletta is on a peninsula with a single point of entry. Both can thus be viewed as 3-dimensional enclosed objects.

The same is not the case for the following example from Caruana's 1889 Ineż Farruġ. Here ġewwa is used with a name of a locality that is not surrounded by walls or the sea:

... kien ġej ma' missieru minn ġewwa r-Rabat...

... he was coming with his father from inside of Rabat...

And of course, in this case, we are not dealing with location, but rather movement from. The use of ġewwa is still at the very least redundant - minn itself would suffice - but maybe it serves to indicate that the character came from the center of the village and not, say, from some farmhouse on its outskirts.

Be that as it may, I am now much less confident that the use of ġewwa with toponyms can solely be blamed on bad Maltese spoken by politicians or the influence of English. The current use we see might very well be just the extension of use that was first limited to specific contexts (as with Rabat) or even specific locations (as with Valletta or Mdina). So maybe my scepticism regarding the quality of the translation of Eco's The Name of the Rose into Maltese is misplaced.

But then I opened the book to the first page. As I'm sure you remember, it begins with the author's description of how he came across the absolutely 100% totes very real manuscript, naturally, that he then somehow lost and now translates for us from his notes and memory - incidentally, a very popular trope in modern literature that turns out to have long history. This narrative is anchored by two dates: the Warsaw Pact invasion of Czechoslovakia and the date of the book in question. This is what we read in the Maltese translation:

Fis-16 t'Awwissu 1969 xi ħadd għaddieli ktieb ta' awtur jismu abate Vallet, Le manuscrit de Dom Adson de Melk, traduit en français d'après l'édition de Dom J. Mabillon (Aux Presses de l'Abbaye de la Source, Paris 1982).

The Prague Spring and the subsequent invasion took place in 1968. Eco's book translated here came out in 1980.

Here is how Weaver's English translation renders this passage (emphasis mine):

ON AUGUST 16, 1968, I WAS HANDED A BOOK WRITTEN by a certain Abbé Vallet, Le Manuscrit de Dom Adson de Melk, traduit en français d’après l’édition de Dom J. Mabillon (Aux Presses de l’Abbaye de la Source, Paris, 1842). 

Considering that this is the first page, this does not bode well...

Thursday, February 12, 2026

ishoyahb

In the history of native Syriac linguistic tradition [1]Išoʕyahḇ Bar Malkōn (d. early 13th century) is the odd man out. It is not that he is unknown or forgotten: his grammatical works are preserved in a not insignificant number of manuscript copies and his name is listed with other grammarians in overviews of Syriac literature compiled by modern scholars, as well as his contemporaries. Of the latter, the testimony of ʕAbdīšōʕ Bar Brīḵā's (d. 1318) Catalogue of Books is particularly telling: where Eliya of Ṭirhan (d. 1049) and Yōḥanan Bar Zoʕbī (d. 13th century) are described as having composed grammars or grammatical treatises,  of Išoʕyahḇ Bar Malkōn and his grammatical works we only learn the following:

ܡܳܪܝ ܝܶܫܘܽܥܰܝܲܗܒ ܒܰܪ ܡܰܠܟܳܘܢ ܕܰܨܘܒܳܐ ܐܝܺܬ ܠܶܗ ܫ̈ܘܽܐܳܠܐ ܓܪܰܡܡܰܛܝܺܩܳܝܶܐ

"Mār Išoʕyahḇ bar Malkōn of Ṣōḇā [Nisibis]: he has some grammatical questions..."

Whether this refers to a specific genre, is meant to be read generally or anything else, that's it as far as grammar is concerned. This lack of specificity with regard to Bar Malkōn’s work as a grammarian is also typical for modern sources. When consulting one, the reader typically learns no more than that he authored at least one treatise on points and one grammar (both unedited) [2], and that in his grammatical analysis, he followed the Arabic model [3]. One prominent example is Baumstark who describes Bar Malkōn’s grammar as “sachlich ganz die Methode der arabischen Grammatik befolgend” (“in terms of content, it entirely follows the methodology of Arabic grammar”) [4]. Over time, this simple observation - repeated uncritically - morphed into a judgment and finally into a condemnation: Talmon notes of Išoʕyahḇ bar Malkōn – and his contemporaries (or fellow travelers) like Yōḥannan bar Zoʕbī and Eliya of Ṭirhan – that they “exhibit either a servile attitude to Arabic grammar or poor coverage of grammatical issues.” [5]

Talmon's "poor coverage" remark is particularly silly. For one, the comparison made here is to Jacob of Edessa's grammar of Syriac which is notorious for - not to put a too fine point on it - BEING ALMOST ENTIRELY LOST. Secondly, "poor coverage" is a relative term, even this day and age, doubly so in the 13th century. But most importantly, none of Išoʕyahḇ Bar Malkōn works have been edited or analyzed in any detail, so there is simply no way for Talmon to know.

In fact, that Talmon's (and, by extension, that of those whose judgment he relies on) assessment of Bar Malkōn is wholly wrong can be gleaned from even the most cursory of interactions with the latter's grammatical works. This applies especially to Bar Malkōn magnum opus, a grammar of Syriac titled Ktābā d-manhrānūṯā ba-gramaṭīqī sūryāytā/Kitāb al-ʔīḍāḥ fī naḥw as-suryānī (“Book of elucidation in Syriac grammar”, henceforth: Kitāb al-ʔīḍāḥ), extant in at least four manuscripts: 

  1. Paris BnF Syr. 262 (1v-112r; 16th century) 
  2. Paris BnF Syr. 370 (2r-96r; 1569) (olim Seert 101)
  3. Berlin SBB Ms. or. quart. 1050 (2v-106v; 17th century)
  4. Florence Laur. Or. 419 (1r-96r; 1589) 

Four notes on this list:

Firstly, Stadel's entry on Bar Malkōn in his recent edition of bar Brīḵā's Catalogue (Stadel 2025: 213) lists the Berlin manuscript as located in Tübingen (as does Van Rompay). This is consistent with Assfalg's catalogue, but not with the online catalogue of the Tübingen collections (which, however, contains a work called Bülbüliye, huh). I am reliably informed the manuscript is indeed in Berlin at the Stabi; in fact, this is where I consulted it a few hours ago.


Secondly, Stadel does not list BnF Syr. 262, [Added note X] which is understandable: this manuscript does not give any author and its title is also different, namely

 ܟܬܐܒ ܐܠܢܚܘ ܡܦܣܪ ܡܢ ܐܠܣܪܝܐܢܝ ܐܠܝ ܐܠܥܪܒܝ ܐܠܡܥܪܘܦ ܥܢܕ ܐܠܣܪܝܐܢ  ܓܪܰܡܰܛܺܝܩܺܝ ܬܘܳܪܰܣ ܡܰܡܳܠܐ ܝܥܢܝ ܬܨܚܝܚ ܐܠܟܠܐܡ 
"The  book of grammar translated from Syriac into Arabic known as 'Gramaṭīqī - Tūraṣ mamlō' which means 'Grammar - Correction of speech'". 

The term ܬܳܘܪܽܨ ܡܰܡܠ̱ܠܳܐ (note the correct Syriac spelling here) tūraṣ mamllō lit. 'correction of speech' is generally used to mean 'grammar' and so one finds it in titles of grammatical works modern and medieval; the lost gramar by Jacob of Edessa is reported to have born it. The phrase shows up even in the Syriac version of the title given by BnF Syr. 370 and SBB Ms. or. quart. 1050, although in those two, the first word is given as ܬܪܝܨܘܬܐ. The BnF catalogue refers to the work contained in BnF Syr. 262 as "Grammaire de la langue syriaque, divisée en quarante-cinq chapitres, par un auteur maronite". Why a maronite is a mystery; it could be because it is written entirely in garšūnī or just because it uses Serto. Regardless, even a cursory comparison of BnF Syr. 262 to BnF Syr. 370 makes it clear that they are the same work containing 46 (BnF Syr. 370 and SBB Ms. or. quart. 1050) or 45 (BnF Syr. 262) chapters. Also, 46 chapters on some 100 folios of 18-20 lines each? So much for "poor coverage".

Thirdly, Stadel adds Vat. Syr. 150 (200r-215v, 1709). This identification is clearly not correct - as should be evident from the number of folios - and likely the result of undue reliance on Baumstark (a common affliction in Syriac scholarship). Assemani's catalogue describes the manuscript as "Jesujabi Episcopi Nisibeni ... Quaestiones Grammaticae & aenigmaticae" and sure enough, this is our Išoʕyahḇ. Baumstark incorrectly assumes that this is the same work as the previous ones he lists, i.e. Seert 99 (now lost), Seert 100 (also lost) and Seert 101 (our BnF Syr. 370) [6]. This work, however, is in Syriac only; moreover, it is indeed a list of questions - maybe even the one Bar Brīḵā refers to - and not Kitāb al-ʔīḍāḥ.

And finally, I was made aware of the existence of the Florence manuscript by Margherita Farina (see also her article), to whom I hereby extend my gratitude. The text seems to be identical with that of BnF. Syr. 262, although interestingly, a colophon ascribes the authorship of the work to George (Gewārgīs) ʕAmīra, a Maronite scholar and bishop, the author of Grammatica Syriaca.

In summary, it may well be the case that there are not two, but three grammatical works written by Bar Malkōn:

  1. A treatise on points (BnF Syr. 369, 114v-125v; BnF Syr. 370, 174r-187v; London BL Add. 25,876, 276v-290v and likely many more, on which later). 
  2. Kitāb al-ʔīḍāḥ (see above)
  3. Grammatical questions (Vat. Syr. 150, 200r-215v)  

Turning back to the contents of the manuscripts of Kitāb al-ʔīḍāḥ, it is not the case – as Baumstark’s description (which most likely goes back to Scher's catalogue and which Van Rompay copies in his GEDSH entry on Bar Malkōn) would have it – that Kitāb al-ʔīḍāḥ is originally written in Syriac with a translation in Arabic in two columns ("... das syrische Original in einer Parallelkolumne mit einer arabischen Üb[er]s[etzung] ...") [7]. That is not true of any of the surviving mss Baumstark was aware of, i.e. the two Paris mss and the Berlin one.  Rather, the primary language of Kitāb al-ʔīḍāḥ is Arabic, but Syriac is employed throughout, in both examples and definitions of grammatical phenomena. Such Syriac text rarely constitutes a direct translation of any of the Arabic parts. As a rather straightforward example, consider this section from chapter 2 on parts of speech (BnF Syr. 370, fol. 9r-9v) with the Syriac portions highlighted in red (translation and numbered subsection division mine, underlined text is colored in the manuscript):


1

Chapter 2: On the division of parts of speech. Division of speech.

الباب الثابي في اقسام الكلام ܗ̄. ܦܘܲܠܓܲܐ ܕܡܡܠܠܵܐ


2

Among the Syrians, as well as the Arabs, speech is divided into three things: noun, verb and particle. That is, noun, verb and particle.

الكلام عند السريانيين والعرب. ينتظم من ثلثه اشيآ ܫܡܐ. ܘܡܸܠܬ݂ܐ ܗ̄ ܥܒ̣ܵܕܐ ܘܐܣܵܪܐ. ܗ̄ اسمٌ. وفعلٌ. وحرفٌ.

3

Some examples of nouns include: person, man, horse, mountain, command and similar.

فالسم نحو قولك ܒܪܢܫܐ. ܓܒܪܐ. ܣܘܣܝܵܐ. ܬܘܪܐ. ܐܸܡܪܐ. وما شاكل ذلك ܀

4

And know that everything that ends in an alif in the Syriac language is, for the most part, a noun. And a (word) that takes one of the four particles (lit. 'additions') BDWL BDWL is a noun.

واعلم ان كل ما اخره الف في لغه السريانيون فهو اسم علي الامر الاكثر وما يدخل عليه احدي الزوايد الاربع وهي بدول ܒܕܘܠ فهو اسم ܀

5

The definition of a noun among them [= Syriac grammarians]: sound with meaning that (is) without tense.

و حد الاسم عندهم ܀ ܩܠܐ ܡܫܘܕܥܵܢܐ ܒܫܠܡܘ̣ܬܐ ܕܠܵܐ ܙܲܒܢܵܐ. ...

6

And others define it as follows: the first part of speech designating a thing or an action.

ܐܚܪ̈ܢܐ. ܕܝܢ ܬܲܚܡܘ̣ܗܝ ܗܟܢܐ ܡܢܬܐ ܩܕܡܵܝܬܐ ܕܡܡܠܠܐ ܕܡܫܵܘܕܥܵܐ ܨܒ̣ܘ̣ܬ̣ܐ ܡܕܡ ܐܵܘ ܣܘܥܪܢܐ ܀

To be fair, the Arabic influence is indeed undeniable: it is clear, for example, from the division of the parts of speech into three classes (section 2), where the native Syriac linguistic tradition typically works with seven, i.e. the eight of Technē Grammatikē minus the definite article. I guess it is ok to be servile to the Greek model, although on the other hand, Bar Hebraeus divides his grammar into four treatises on, respectively, nouns, verbs, particles and orthography, so maybe he is servile to Arabic models as well... In any case, the influence of Arabic on Bar Malkōn's analysis is also evident from the choice of his examples: 'man' and 'horse', for example, are also given as examples of nouns in Sībawayh’s Kitāb.

The rest of the section, however, is anything but a servile copy of the Arabic method without any connection to the Syriac linguistic tradition. One such connection is the terminology: melṯā, his term for 'verb', is one that is well-established in the Syriac scientific terminology, though originally used as a translation for ῥῆμα in philosophical works. The term for 'particles', esārā, is also in common use in native Syriac linguistic tradition, although typically meaning 'conjuction', translating the Greek σύνδεσμος, both as a philosophical term, as well as the linguistic one (ch. 20 of Technē Grammatikē). Interestingly, Bar Hebraeus uses both terms in the same way Bar Malkōn does. [8]  These and other items of Syriac linguistic terminology occur all over Kitāb al-ʔīḍāḥ, both as a result of dealing with matters specific to Syriac (and not only such obvious things as vowel points), but especially due to the bilingual nature of the work. This of course requires Bar Malkōn not only to engage with the Syriac tradition, but also attempt to harmonize it with the Arabic linguistic framework and even make attempts at comparative linguistics.

The major way in which Kitāb al-ʔīḍāḥ is undoubtedly a part of the native Syriac linguistic tradition - as opposed to a mindless copy of the Arabic one - is Bar Malkōn’s constant references to the same and his insistence on working within it. The introduction (BnF Syr. 170, ff. 2v-4v) contains a brief overview of the previous work by Syriac grammarians and scholars of language, including Jacob of Edessa (d. 708), Eliya of Ṭirhan and Yawsep̱ Hūzāyā (6th cent.), the purported translator of Technē Grammatikē into Syriac. The text of Kitāb al-ʔīḍāḥ then repeatedly refers to their work (the "among them" in section 5 above and "others" in section 6) and cites them by name regularly. The chapter on parts of speech cited above also contains one very telling example in section 5, i.e. the absence of time as a major criterion for the definition of a noun. This line of reasoning is unique to Syriac linguistic tradition and can be traced to Aristotle, e.g. De Interpretatione. In contrast, Technē Grammatikē opts for a morphological/semantic definition (English translation). Now Arabic tradition is complicated, but it involves morphological and syntactic criteria; a simplified contemporary grammar uses a definition that is heavy on the morphology. True, so does Bar Malkōn’s own definition of a noun in section 4, treating the particles BDWL as morphological properties. But then again, this is a fact of Syriac, obvious to anyone with even a passing familiarity with the language. So servile attitude towards Arabic models or sensible analysis of one’s language? The latter definitely applies to the entirety of what BnF Syr. 370 calls chapter 47 (96v-173v), missing in BnF. Syr. 262 and SBB Ms. or. quart. 1050. [9] This chapter is sometimes treated as a separate work - or even genre - called De vocibus aequivocis, i.e. "On ambiguous words" - and contains a Syriac-Arabic glossary of homographs. None of this slavishly follows the Arabic model; in fact, the more I think about it, the more I am convinced that those who argue so have only ever read the section on parts of speech. The relationship of Bar Malkōn's analysis to the Arabic linguistic tradition reminds me of the way grammars of modern languages follow the Latin model: there is some, even a lot of inspiration, that may even be slavish now and then - just think of the concept of parts of speech and the terminological fustercluck that are Wolof conjugated pronouns. Latin method, however, is not all there is.

As noted above, Bar Malkōn's work remains unedited and unpublished - hell, this hastily put-together post might be the most comprehensive study of his work to date. If anyone wishes to change it, for example as an MA thesis (his short treatise on points would be perfect) or even a PhD dissertation, hit me up.


Friday, February 06, 2026

dobra

Hans Stumme (1864-1936) was a German linguist whose work is is probably known to anyone interested in Berber and North-African varieties of Arabic. Stumme travelled a lot and collected huge amounts of spoken data from - inter alia - Tunisians, Išelḥiyen and the Maltese. As far as I can tell, this is the more or less full list of his works containing such data:

Arabic

  1. Albert Socin and Hans Stumme. 1894. Der arabische Dialekt der Ho̮uwāra des Wād Sūs in Marokko. Hirzel, Leipzig. Text
  2. Albert Socin and Hans Stumme, editors. 1901. Diwan aus Centralarabien. B.G. Teubner, Leipzig. Text
  3. Hans Stumme, editor. 1893. Tunisische Märchen und Gedichte: Eine Sammlung prosaischer und poetischer Stücke im arabischen Dialecte der Stadt Tunis; nebst Einleitung und Übersetzung. Hinrichs, Leipzig. Vol. 1; Vol. 2
  4. Hans Stumme. 1894. Tripolitanisch-Tunisische Beduinenlieder. J. C. Hinrichs’sche Buchhandlung, Leipzig. Text
  5. Hans Stumme. 1896. Grammatik des tunisischen Arabisch nebst Glossar. Hinrichs, Leipzig. Text
  6. Hans Stumme, editor. 1898. Märchen und Gedichte aus der Stadt Tripolis in Nordafrika: Eine Sammlung transkribierter prosaischer und poetischer Stücke im arabischen Dialekte der Stadt Tripolis nebst Übersetzung, Skizze des Dialekts und Glossar. Hinrichs, Leipzig.
  7. Hans Stumme. 1915. Fünf arabische Kriegslieder des berühmten deutschen Kriegsfreiwilligen Fritz Klopfer: Tunisische Melodien mit arabischem und deutschen Text. Hinrichs, Leipzig. Text

Berber

  1. Hans Stumme. 1895a. Dichtkunst und Gedichte der Schluh. PhD Thesis, Zugl.: Leipzig, Univ., Habil.-Schr., 1895, Leipzig.
  2. Hans Stumme, editor. 1895b. Märchen der Schluḥ von Tázerwalt. Hinrichs, Leipzig. Text
  3. Hans Stumme. 1899. Handbuch des Schilhischen von Tazerwalt. Grammatik - Lesestücke - Gespräche - Glossar. Hinrichs, Leipzig. Text
  4. Hans Stumme, editor. 1900. Märchen der Berbern von Tamazratt in Südtunisien. Hinrichs, Leipzig. Text
  5. Hans Stumme. 1914. Eine Sammlung über den berberischen Dialekt der Oase Sîwe: Sitzung vom 12. September 1914. Teubner, Leipzig.

Maltese

  1. Bertha Kössler-Ilg and Hans Stumme, editors. 1909. Maltesische Volkslieder im Urtext mit deutscher Übersetzung. Hinrichs, Leipzig.
  2. Hans Stumme, editor. 1904a. Maltesische Märchen, Gedichte und Rätsel in deutscher Übersetzung. Hinrichs, Leipzig. Text
  3. Hans Stumme. 1904b. Maltesische Studien: Eine Sammlung prosaischer und poetischer Texte in maltesischer Sprache, nebst Erläuterungen. Hinrichs, Leipzig.
It is quite clear that Stumme was particularly interested in collecting folk literature, such fairytales and songs, where his books remain an invaluable source of data for folklorists. At the same time, Stumme's work is extremely valuable for the study of the languages involved, since Stumme expended an enormous amount of effort on meticulously capturing the phonology of the varieties he studied. As a result, his work is regularly used by those studying the varieties he covered, in some cases being an object of study in itself.
 
This applies doubly to Maltese where there have been at least two major studies of the fairytales (1, 2). As far as I can tell, there is little focus in reevaluating Stumme's dialectological work (but that might change soon), which is a shame, because there is so much fascinating stuff in there. Like for example song no. 70 from the collection of Maltese songs (Kössler-Ilg and Stumme 1909, p. 27). I am reproducing the text below in standard Maltese orthography and Stumme's original German translation accompanied by my English one based on the Maltese text.

Ta' dobra sejrin jsiefru

kemm iħallu qlub miksura!

Kif ħarġu mill-port 'il barra,

tathom qalbhom, "erġgħu lura!"


Die Slawen wollen abreisen,

wie viele gebrochene Herzen lassen sie hier zurück!

Als sie aus dem Hafen hinausgefahren waren, 

gab ihnen ihr Herz ein: "Kehrt wieder um!"


The Slavs are about to leave,

how many broken hearts they leave behind!

As they left the port,

their hearts gave out, "Come on back!" 

A note here: the phrase tathom qalbhom is a bit of a mystery. It does bring to mind the idiom qata' + IO qalb + POSS 'be discouraged, loose faith', but the morphology does not make sense: the verb is PAST.3SGF - which works, since qalb 'heart' is feminine - and the noun bears the 3PL possessive marker. The -hom in tathom looks like the direct object (P argument) marker, but semantically it designates the recipient (R argument), so we have the IO component here. But then again, the form tat definitely looks like PAST.3SGF of ta 'to give' which recalls the idiom ta + DO ras + POSS 'to panic'. So maybe there is an entire class of such idioms to which ta + DO qalb + POSS belongs, I will need to look into that.

But that is not why we are here. We are here for the multi-word expression in bold that Stumme translates as the ethnonym "Slavs". The composition of the expression is clear: the element ta' is what Arabic dialectology refers to as genitive exponent, i.e. possession marker, the equivalent of 'of'. In North African varieties, it usually takes the form mtāʕ/ntāʕ etc., the apostrophe at the end of ta' is what remained of ʕ in Maltese. ta'  (or tal- with a definite article) + NOUN is how Maltese creates group names: ta' Lejber 'Labourists', tal-PN 'nationalists (lit. of Partit Nazzjonalista)" are perhaps the most prominent examples. Similarly, in a version of the Maltese translation of Bandiera rossa, the first verse goes Tal-pinna o ħutna, ukoll tal-mazza where pinna is 'pen' and mazza is 'sledgehammer', the two expressions meaning 'intellectuals' and 'workers'.

What then of the dobra? That is quite simple; as Stumme himself puts it on p. 11, we're dealing with "die Leute, die immer dobra 'gut!' sagen" ("the people who always say dobra 'good!'"). That we do so and that we are perceived as such I can attest to from personal experience, recalling for example an Albanian lady in a B&B in Italy who upon learning that I am Slovak went "Oh you are one of the dobre dobre people!" That this is also how the Maltese thought of us back in the late 19th century is fascinating. Now the question remains which Slavs are these, since the general adverb of agreement usually takes the form dobre/dobro. The only language I can think of where people use a form with an [a] at the end is Czech, but there the vowel is long and considering the geography of the region, it is more likely that Maltese would encounter South Slavs. So probably not Czechs and definitely not the Polish or Slovaks, otherwise it would either be ta' dopxe or, of course, ta' kurva.

Monday, February 02, 2026

work

So, anyway, been a while, right? How have y'all been the last *checks notes* few years? Yeah, I know,  interesting times... How about instead of focusing on that shit, I show you what I have been up to since 2015 or so. Let's start with some of the projects I have been working on that you might find interesting.

 

HunaynNET

Named after Ḥunayn ibn Isḥāq (807-873), a physician and prolific translator of classical philosophical and scientific literature, this ERC-supported project collected all texts of classical science that were translated into both Syriac and Arabic. The translations were then re-edited and aligned on the level of semantic and syntactic units that... Well, they are not quite sentences, but we tried to keep them as small as possible. The text is also tokenized and links to dictionaries and corpora are provided; and in some cases, we also provided aligned text of translations into modern languages. Although I did contribute to the editions, my main job was processing the data and building the interface. I know, I know, it now needs some updates, especially when it comes to the aforementioned links to various dictionaries, chief among them the Glossarium Graeco-Arabicum which is thankfully now back online, completely rebuilt. This project is a wonderful resource not just for those interested in the philosophical and scientific exchange between the East and the West, but also those learning/studying any/all of the languages involved.

 

Simtho

Despite its tagline "The Syriac Thesaurus" (it's a, um, user-friendliness thing), this is an electronic corpus - the only one worthy of the name - of the Syriac language. It contains ~25 million words and represents roughly 95% of all literature in Syriac. It is largely based on printed editions, although a few manuscripts snuck in. Simtho (a Syriac word meaning "treasure" pronounced according to West Syriac conventions) is the product of thousands of hours of work by hundreds of people assembling metadata, scanning books and checking OCRd texts. Simtho is the largest project run at Beth Mardutho led by the indomitable George Kiraz and it is done without any major grants or other financial support from research agencies or governments; a bootleg operation is what George calls it.  My job is being the last link in the chain, i.e. setting up and managing the entire processing pipeline, as well as the server(s) and all the software on it. This includes the installation (and its customization, for George has many ideas) of NoSketch Engine on which the whole corpus runs. In addition to that, I have been doing some language modelling and annotation, on which perhaps later. One by-product of the work on Simtho is a set of OCR models for the recognition of Syriac printed text. These models are trained using the open source Kraken platform and available on Zenodo.

 

Zoroastrian Middle-Persian Corpus and Dictionary

This DFG-supported ongoing project seeks to collect, annotate and analyze all available Zoroastrian texts written in Middle Persian to create a searchable corpus (in transcription) and finally an updated dictionary of Middle Persian. I was largely responsible for data processing, conversion and import, so none of what you see online is my work. The web application is still very much a work in progress, but once finished, it will be a one-stop shop for all your Zoroastrian Middle Persian needs, including manuscript images and comprehensive lexical resources.


Wednesday, January 28, 2026

depowedlajo

Weinreich's "History of the Yiddish Language. Vol. 1" (Yale 2008), p. A 43-44 (the notes at the end) contains the following intriguing section (emphasis mine):
 
Late survival of Aramaic [§2.8]: ... Scholarly Maskilim have used Targumic for humorous effect until modem times. A humorist from the former Hungarian regions, very likely eastern Slovakia, has one of his characters say: “Krapple depowedlajo, tawjo Mikrapplo [!] dezworechanjo, dos heiß of [!] gallchisch: Powedelkröpplech [!] senn besser ais wie Zworachkröpplich [!].” Cf. P. Schwarz, Reb Simmel Andrichau (Vienna, 1878), 47. To achieve a superclimax (§1.6, note), Targumic is called galkhish ‘Latin’ here (§3.3).

For clarity, I'm providing the Targumic Aramaic with glossing below:

 

As Weinreich says, the source of this passage is a purim play published by P. Schwarz in Vienna 1878 titled "Reb Simmel Andrichau". Luckily, it is now available on Google Books, so let's have a look. The dramatis personae page of the booklet says "The Handlung spielt am Purim in einer kleinen jüdischen Gemeinde Ungarns" ("The events take place on Purim in a small Jewish community in Hungary") and I wonder what made Weinreich think it was in Eastern Slovakia, aka my homeland.
 
Sure enough, we do make a pastry we call krepľe (Krapple) and fill them with both plum jam (Powedl) and cheese curds (Zworach, cf. Slovak tvaroh) and yes, obviously the former are far superior to the latter. However, we do not refer to plum/damson jam as Powidl or povidla, that is - as far as I can tell - a very distincly Czech and Austrian thing; to us in Eastern Slovakia, it's simply šľivkovi ľekvar
 
So I checked the text (printed in Fraktur, mind you) for some more clues, and sure enough, there they were. Like this on p. 6:
Mein Masel ober, kummt n' Orel mit e Sack auf'n Rucken
und thut sich tomid in die Stub umgucken;
...
Auf einmol sogt er: "No Schmerlitschko! Zo mie date
"Sa koßek Stschiebro lebbo Slate,
"Hrube wellize - namo duscha - hrube wellize
"Jak wasche dwje Tschewitze?"

And that is nothing but good and proper, if a bit mangled, Czech. In modern orthography:
[My luck now, there comes an Orel with a sack on his back
and starts looking around in the room;]
...
No Šmerličko, co mi dáte,
"za kousek stříbra, nebo zlaté,
"hrubé velice - na mou duši - hrubé velice
"jak vaše dvě střevíce?"

"Well, Šmerličko, what will you give me,
"for a piece of silver, or gold coins,
"very thick - I swear - very thick
"like two of your shoes?"
It should be noted that Schwarz provides translations of many Hebrew words used in the play - there are 263 endnotes for 49 pages of text, translating even common ones, like "Jeschiewe", "Emmes" and "Megille". And so "Orel" is glossed as "uncircumcised, gentile" and the entire passage I cited is translated, as are many others that feature dialogues in Czech.
 
Now the question is, what kind of Czech. The use of the word "hode" < "hody" by the aforementioned Orel (p. 8, translated as "Fasching" = carnival, religious feast) suggest that at least the Orel is from Moravia (cf. Český jazykový atlas vol 2., map 222). That the events take place somewhere in or around Moravia is further supported by a reference to Proßnitz, today's Prostějov, as a place one of the characters went to study in. 
 
So alas, not Eastern Slovakia, but rather Western Slovakia, if we are to believe the author. And I don't see why we shouldn’t, when according to the 1851 census, Western Slovakia (bordering Moravia) was the part of the country (which back then was a part of the Kingdom of Hungary) with the highest concentration of Jewish communities, on par even with Eastern Slovakia. The image below is taken from the Historical atlas of the population of Slovakia (p. 166).
 

And that's before considering the language of the play which seems to blend Yiddish with Standard German and South German varieties - the auxiliary "thut" in the passage cited above strikes me as very Austrian - and maybe even various registers of Yiddish, presumably for humorous effect. But that's another story.
 
 

Monday, January 26, 2026

mentalist

 I subscribe to an embarassingly high number of streaming services. I don't have much time to watch any of them (and in any case, I am an old man and I dislike everything except Matlock Columbo and look, it's on right now!) and so the only two justifications I can give myself for the expense are a) I get some of them for free or at a low cost b) it's all for research and learning. I am talking here of course about the dubs and subs various streamers provide, especially here in Europe.

In addition to the usual major languages, Netflix is doing a lot for Catalan, Galician and even Basque, as well as keeping Syrian voice actors who live in Turkey employed by dubbing all their Turkish production into Syrian Arabic. The European equivalent of Paramount+, Skyshowtime, provides subtitles in - among others - Albanian and Bosnian and if you are the kind of person who enjoys the slop that is now churned out under the name "Star Trek", you can watch it dubbed into multiple languages, including Norwegian, Slovenian and Romanian.

My favorite streamer is HBO Max. For one, the only way to watch Rick and Morty is in the original Polish and 90s nostalgia in Friends is so much better in Bulgarian which I am currently trying to learn. I also like to rewatch procedurals like The Mentalist and lucky for me, HBO Max also subtitles it in Bulgarian. I am working my way through it while on airplanes and trains and so I am currently on season 4. Yesterday I watched episode 19 where at one point, Jane's friend the magician addresses Jane as "dude". Bulgarian subtitles render this as пич [piʧ]. I chuckled - the word sounds like a Slovak vulgar term for feminine genitals and I am mentally 14 - and then today I looked it up.

It turns out the semantic range of the word is quite interesting and when it refers to human beings, it has four different senses in Bulgarian.

  1. (archaic, dialectal, vulgar) a male child born out of wedlock, bastard
  2. lazy and incompetent person
  3. (slang) a man who can be counted on, especially when it comes to something bad
  4. (youth slang) a stand-up guy

The dictionary gives the etymology as Persian pič پیچ "a plant shoot", in Bulgarian via Turkish. Now I seem to have misplaced my copy of Junker-Alavi's dictionary, but all the other resources agree or give a more specific sense, "young wine". And then I checked a Turkish dictionary and it all got interesting: the Persian word appears in the Codex Cumanicus, a 14th century manual of the Turkic language spoken by the Cumans known as Cuman, Kipchaq or Tatar. The book is arranged in three columns with Latin terms and their Persian and Cuman equivalents.

Our word appears on fol. 50r of the Venice 1597 copy (Marziana Lat. Z 549) in the second part of the Codex where lexical items are arranged in semantic fields. Fols. 49v-50r contain a section titled "Defecta hominum" and this is where we find Persian pič as a translation of the Latin bastardus.

 

This of course complicates the etymology and history of the Bulgarian пич, since the fifth meaning the Bulgarian dictionary gives is that of "sprout, plant shoot", albeit only dialectally. Persian dictionaries give this word a secondary sense "turn, complication, intricacy" and this the main sense also given by Turkish dictionaries. The same dictionary also gives the sense "bastard" (in the purest Turkic as "veledi zina"), but only with a question mark, so possibly only because of Codex Cumanicus. And, as the entry in the Turkish dictionary points out, the sense "bastard" is not found in any Persian dictionaries. Redhouse's Ottoman Turkish dictionary, however, gives all three senses - "bastard", "sprout", and "complication" (and one derivation that sounds extra funny to me).

 

This is the most likely source of the Bulgarian meaning, where the "shoot" sense was preserved only dialectally. Whence the Persian entry in Codex Cumanicus still remains a mystery, one that Jane, Lisbon and the rest of the CBI team probably can't help with.

 UPDATE: A colleague informs me that the Persian pič is derived pičidan "to twist" which is also attested in Middle Persian as pēčīdan, see e.g. the following entry from MacKenzie's A Concise Pahlavi Dictionary (OUP 1971, p. 68).