Thursday, March 26, 2026

order

A few weeks back, an article titled "Interactions Among Morphology, Word Order, and Syntactic Directionality: Evidence from 55 Languages" (henceforth: Li & Liu 2025) came across my link aggregators. It checked a few boxes for me: it had "word order" and "directionality" in the title, it promised to examine the interaction between word order/directionality and morphology using quantitative methods, and it used data from Universal Dependencies (henceforth: UD) treebanks. There's three of my major research interests right there, so I was very much hooked, but also suspicious; such studies tend to go two major ways and only one of them is a good one. It only took 8 pages for me to realize that this one will not fall into that group. More specifically, it was this part that gave me pause (Li & Liu 2025: 8):

"The Afro‑Asiatic Semitic family exhibits distinct word order preferences. Arabic and Maltese strongly favor verb‑initial (VSO) patterns (about 70%), reflecting a consistent syntactic profile." 

The data underlying this observation comes from the Maltese UD treebank, i.e. MUDT, as Table 1 of the paper (reproduced partially below) confirms.

Language/Data Type SVO(%) SOV(%) VSO(%) VOS(%) OVS(%) OSV(%)
Arabic (PUD + NYUAD) 18.1 0 72.8 NA 3.9 0.2
Maltese (MUDT) 24.1 0 69.9 NA 4.1 1.9

Now I am somewhat familiar with MUDT, seeing as I manually annotated the whole damn thing and then used it to get my PhD. In fact, a significant portion of my dissertation was dedicated to doing the exactly this kind of analysis - except I speak of "constituent order" - and my results were completely different. To compare, consider Fig. 7.5 and Tab. 7.5 from p. 190 (click to embiggen). 

Even if we focus just on the dominant order - or linearization, as Li & Liu (2025: 4) call it in line with common practice - the difference is striking: Li & Liu 2025 have 24% for SVO and 70% for VSO. In my analysis, VSO is nearly non-existent: only 3 sentences show that order (see p. 190); the vast majority of sentences (93.86%) exhibit SVO order.

So what giveth? It is of course, entirely possible that I was wrong, but surely I wasn't THAT wrong. Let's try to find out what happened and we might as well start with how Li & Liu got their numbers. And this is where we come across the first problem: they don't say. All we learn in the paper is that 

"All metric computations and statistical analyses, including correlation tests between variables, were implemented using the Python programming language (version 3.11)." 
(Li & Liu 2025: 3)

and 

"Our Python‑based tool extracts these linearizations directly from dependency structures, allowing the analysis to accommodate varied syntactic patterns and reflect actual language use." 

(Li & Liu 2025: 4)

and finally

"All metrics and values reportedin this study were computed independently using custom scripts developed by the authors."

(Li & Liu 2025: 12)

The code, however, is nowhere to be seen. Red flag number one.

OK, fine, ORANGE flag, because to be fair,  my data retention practices are not perfect either. The data I used is available online along with the text of the dissertation, but the instance of Annis I used to run queries on the treebank has gone to the eternal filesystem beyond. Then again, the queries I used are fully available online as Appendix C, so once I get Annis running again - or you do - all is good and my work can be fully reproduced

Except here's an issue: the treebank I used in my dissertation was UD version 1 and while it was current when I started annotating, by the time I finished, version 2.1 had come out.  Li & Liu (2025) write in 2025, by which time versions 2.15 and 2.16 were out. By that point, I had also updated MUDT to the current version, so we just need to find out which one they used... 

And here we have orange flag no. 2: they do not say. All we learn is that 

"The raw data used in this study are openly available from the UniversalDependencies (UD) treebanks at https://universaldependencies.org." 

(Li & Liu 2025: 12)

At the beginning of their paper, Li & Liu inform us that

"To ensure consistency and cross‑linguistic comparability, all treebanks in this study employed the Universal Dependencies (UD) annotation framework [15]. "

(Li & Liu 2025: 3)

Maybe the reference behind [15] will tell us. Well, no, it does not, since it leads to J.A. Hawkins's Efficiency and Complexity in Grammars. Reference [17] which is supposed to point to a work by Xanthos and Gillis points instead to De Marneffe et al. (2021) which is a paper on UD, so maybe stuff just got mixed up. De Marneffe et al. 2021, however,  is a general description of the framework, so alas, no info on the version here either. Also, nobody noticed this mix-up? Orange flag no. 3.

Let us therefore assume that Li & Liu (2025) used the most recent version available to them, 2.16. As of this writing, the current version is 2.17, but I am reliably informed by the portly gentleman who brushes my teeth that there were no relevant changes between the versions, so let's go with 2.17. And since we are dealing with a single treebank, we don't have to write any custom code or even deal with Annis; the fastest way to access any UD treebank in 2026 is to use grew-match

Y'all know how dependency grammars work: an utterance/sentence consists of tokens (≈ words) where each token is a dependent of another token, called the governor. There is one exception: in every utterance/sentence, there is a token - prototypically a verb - that does not depend on any other token; we call it the root. The relationship between the governer and a dependent is referred to as a dependency and it can be unlabeled or labeled. In UD, all dependencies are labeled according to a specific framework and as a shorthand, the dependencies are referred to by their label.

grew-match is a graph-based search engine for treebanks which allows the user to search for all kinds of things. It works with nodes (= tokens) and edges (= dependencies). It is powerful, but the query language can be a bit tricky, so let's look at an example:  

pattern { V -[nsubj]-> S;
          V -[obj]-> O }

Here we have three nodes, V, S, and O. They are connected to each other as follows: S depends on V and the dependency label is nsubj (= nominal subject), O depends on V through the dependency obj (= object). In other words, this looks for clauses where we have a Subject, Verb (or any other predicate) and Object. We could further specify what kind of part-of-speech V should be, but in UD, if it has an object, it is a verb, so let's keep it at that.

Now if you ran the query like I just did, grew-match would report "395 occurrences". This means that there are 395 sentences total where we have a transitive verb with both an overt subject and an overt object. This is consistent with the findings in my dissertation (p. 190, see above), with one caveat: MUDT annotation contains one more dependency that I treat as a direct object for purposes of constituent order analysis, namely obl:arg. This covers prepositional objects and it exists as a separate category for compatibility purposes (don't get me started). If we include it in our query like so

pattern { V -[nsubj]-> S;
          V -[obj|obl:arg]-> O }

we will get 473 results which matches the data above perfectly (modulo fixes between v1 and v2). 

The queries we have used so far do not take into account the order of the constituents and so for example, the third hit is an OVS sentence (sentence id 02_02J01:10). To consider the order, we adjust the query as follows:

pattern { V -[nsubj]-> S;
          V -[obj|obl:arg]-> O;
          V << S;
          S << O }

The double "less than" sign specifies the order of nodes; in this case, V should come before S and S should come before O. This query will thus retrieve all VSO sentences in the treebank. Recall that according to Li & Liu (2025: 8), this should be 70% of all sentences with V, S and O; according to my analysis, we should get 3.

And lo and behold, 3 sentences is what we get. Go ahead, try all the permutations, it will more or less be the same as in my dissertation.

So, I repeat, what giveth? Well, we have already seen three orange flags that point to a certain degree of sloppines on the authors' part, so let's... Hang on just a minute, what is the name of the journal this was published in? Entropy? What kind of a name is that for a ling...

"Entropy is an international and interdisciplinary peer-reviewed open access journal of entropy and information studies, published monthly online by MDPI." 

And now everything clicks into place and you will excuse me while I recover from the dizziness induced by the huge amount of red flags suddenly unfurled in front of my eyes.

For those of you not in the know, MDPI is a paper mill and predatory publisher. I will not dwell on the predatory part, just point out the following: every journal published by MDPI levies an article processing charge of 2600 CHF (2850 EUR).

The paper mill part is much more interesting. As of this writing, MDPI publishes 511 open access journals. This is on par with Oxford University Press, but much less than Springer, so that might not be suspicious. But let's take a closer look, for example at Entropy: it is the end of March 2026 and they have already published three issues of volume 27 containing 362 articles. That's four PEER REVIEWED articles a day, including holidays. One analysis of MDPI's production refers to "concerns about how peer review can be conducted effectively at this scale". That's putting it mildly.

To answer my question above, what giveth is that a) the authors of the paper are incompetent and b) there was no peer review to catch their incompetence. No one at MDPI cares as long as they pocket their 2600 CHF.

Incidentally, the aforementioned analysis focuses on the proliferation of "special issues" which allows the publisher to expand the number of articles published and thus their revenue. It turns out that Entropy published a special issue titled "Complexity Characteristics of Natural Languages". This special issue was edited by the same people who appear as "Academic Editors" (what the hell is even that) for the bullshit word order paper: Stanisław Drożdż, Jarosław Kwapień and Tomasz Stanisz. None of them is a linguist, they are all physicists and work at the Institute of Nuclear Physics of the Polish Academy of Science. They appear to be interested in what they call "quantitative linguistics" but what would better be described as "applying mathematical methods to texts". And don't get me wrong, this is a legitimate field of research, but maybe the gentlemen in question should devote more time to studying linguistics proper sparing us insights such as (emphasis mine)

"During their development languages have been gradually reshaping and adopting themselves under influence of other languages and dialects they have encountered. Especially English and Spanish experienced many of such encounters. Whether this made them as they are or they became global because they happen to possess some spontaneously acquired favorable intrinsic syntactic organization emerges an exciting issue for further interdisciplinary explorations."

or (emphasis mine)

"We found that although the global behaviour of words is described approximately by the Zipf-Mandelbrot law, the words from different parts of speech do not necessarily follow this global picture and one can show significant differences in functional dependence of frequency on rank between nouns, verbs and other types of words. This can be a manifestation of the existence of a non-trivial internal organization of language that cannot be reproduced in full detail by a simple power-law relation of the Zipf-Mandelbrot type."

Whether such a conclusion is worth the effort put into it, I will leave to the reader, accompanied by the appropriate XKCD comic (h/t a colleague who shall remain unnamed).

 Physicists

In any case, Messrs. Drożdż, Kwapień and Stanisz should be ashamed of their role in the predatory practices of MDPI in general and of this bullshit paper in particular.

Thursday, March 19, 2026

ová

I regularly read - God help me - Denník N, one of the major Slovak dailies. I realize I should have stopped years ago, considering the diminishing returns: for every interesting article or sharp analysis (almost always by a non-Slovak media outlet or expert), there are twenty incredibly stupid pieces, invariably authored by one the intellectual smallholders that are employed by the newspaper or by one of their circle of equally dimwitted and/or morally corrupt friends. Doyens of Slovak intellectual scene make idiotic assessments of Slovak history only to be corrected by actual historians, high-school teachers - who also engage (along with failed politicians and shitty translators) in the grift of "coaching critical thinking" - furnish us with their opinions on everything from appropriate mode of dress for chess players to how we have no records of people's personal lives from before 1800, and there are, of course, countless examples of the time-tested genre "here is how this thing happening to important people also affects my circle of friends". Add to that venerable intellectuals who produce incomprehensible drivel in defense of genocide and slightly less venerable - but equally lacking in coherence and sense - writers who went on a trip where they learned that you cannot become an expert on a country by going on a trip there and consider it absolutely necessary to share this insight while also painting a picture of said country in terms that racist British travel writers from two hundred years ago would consider too much, referring to the "ever-smiling Turkmens" and "the contrast between clay villages and mosaic-emblazoned palaces and mosques". In short, the entire newspaper is filled with the sort of loud faux-intellectual provincial mediocrity that is emblematic of Slovak society and I really should stop reading it, if only for the sake of my blood pressure.

But while I am, I have a few words on a recent piece on matters relating to language. In this piece, we are informed about the decision to no longer add the suffix -ová to last names of non-Slovak (lit. "foreign" which sounds weird...) female persons that the editorial board has taken. This process, referred to in Slovak as prechyľovanie (henceforth: modification) is described in the normative Orthographic Rules of Slovak (henceforth: PSP) and it is implied that it is mandatory: the final note in the chapter on modification (accompanying section 2.8) informs us that "foreign last names of known female artists" can be left in the original form, the assumption here is that this is the exception. The piece in question explains that the decision was taken and implemented a year ago and lays out the reasons for it, or, as they put it:

"Prechyľovanie či neprechyľovanie je pomerne zložitá téma. Preto náš korektorský tím pripravil na túto tému analýzu, ktorá pomenovala mnohé pravopisné a štylistické dilemy a pomohla nám zvážiť argumenty za a proti."

"Modification of feminine last names or lack of it is a somewhat difficult subject. This is why our proofreader team has prepared an analysis that pointed out many orthographic and stylistic dilemmas and helped us to weigh the pros and the contras."

Let's ignore the shudder induced by the words "our proofreader team" and "analysis". Let us also ignore the fact that the rules for modification laid out in PSP are somewhat clear, at least for now, and that despite of the name of the book, modification is not an orthographic issue, is it a morphological one. Instead, we consider the arguments the authors present. The very first one is, shall we say, suspect:

"Vďaka nej sme si uvedomili napríklad aj to, že ženské priezviská sa neprechyľujú len v slovanských krajinách a že majú rôzne podoby, čo má vplyv na ich skloňovanie a rôzne pravopisné nuansy."

"The analysis helped us realize that, for example, feminine last names are modified in non-Slavic countries as well and that this modification comes in different forms which influences their declensions and has various orthographic nuances."

I mean, yes, that is certainly true. So?

"Bývalá litovská prezidentka Dalia Grybauskaitė má na konci priezviska príponu -itė, typickú pre ženský rod, pričom mužská podoba je Grybauskas. Takže po prechýlení podľa slovenských pravidiel dostaneme Daliu Grybauskasovú – čo je oproti pôvodnému priezvisku dosť veľký rozdiel.

Alebo taká bývalá islandská premiérka Katrín Jakobsdóttir. Na konci má príponu -dóttir, teda dcéra. Správne prechýlená do slovenčiny by mala byť Katrín Jakobssonová. Ktovie, či by tieto ženy svoje mená spoznali, ak by ich uvideli takto poslovenčené." 

"The last name of the former Lithuanian president Dalia Grybauskaitė features the suffix -itė typical for the feminine gender; the masculine form is Grybauskas. Modifying her last name according to Slovak rules, we get Dalia Grybauskasová, which is quite different from the original.

Or take the former prime minister of Iceland, Katrín Jakobsdóttir. Her last name features the suffix -dóttir meaning daughter. Her last name modified according to the rules of Slovak should then be Katrín Jakobssonová. Who knows if these women would recognize their names if they saw these Slovak forms." 

Well, that is certainly a line of reasoning. Let's take it from the top: the official rules as written in PSP do not lay out any general principles of modification. The sections 2.1-2.3 do discuss the morphophonology of the suffix -ová as combined with various roots, but only in terms of euphonics and orthography (hence the inclusion of this phenomenon in a guide on orthography), not in terms of morphological analysis. And so section 2.1 discusses what to do with masculine last names ending in a consonant, 2.2 addresses masculine names ending in the vowel a or o and 2.3 covers masculine last names ending in a vowel. Note 1 to 2.3 then addresses Romanian, Albanian and Turkish last names ending in u and Note 2 discusses English and French last names ending in e which is "silent" in English and indicates a specific pronunciation of the preceding consontant in French. Nothing on morphology, though, let alone on removing any suffixes. So whence this interpretation of the rules?

I have a guess: section 2.3 specifically says that the suffix is to be added to the "unaltered* masculine last name". Ergo, to execute such a derivation, we must first derive said "unaltered masculine last name". It does make sense from a certain standpoint. From a different one, it lays bare the inadequacy of PSP rules and so the ultimate futility of trying to come up with a hard and fast rule. In any case, I can see the logic behind the decision in these and similar cases and do not have a problem with it per se. I mean the other option, to simply attach the suffix to the already feminine name and end up with Grybauskaitėová and Jakobsdóttirová does strike me as weird. So yeah, to say that such names should not be modified makes sense, especially considering PSP rules already in force, specifically the aforementioned note to section 2.8 which covers such cases as Grace Kelly or Greta Garbo.

What I do have a problem with is the appeal to the Slovak modified feminine name being "different" and "unrecognizable". For one, well, yes, at the very least we are slapping an entire morpheme on the name, so yeah, it is gonna look different, even according to PSP rules (Sharon Stonová, anyone?). But also, why should we care? Even with unaltered masculine names spelled according to conventions of the language in question we add declension suffixes, just consider Shakespeare and let me count the w... you know what, I will leave the counting as an exercise to the reader. 

And speaking of bogus arguments, here is one:

"Do debaty o prechyľovaní teda vstupujú nielen pravopisné dilemy, ale aj to, aké predstavy sa nám s prechyľovaním spájajú. Vnímame ho ako zbytočný relikt patriarchálnej histórie? Ako funkčnú súčasť jazyka? Alebo ako oboje?" 

"The discussion on modification thus involves not only orthographic arguments, but also ideas associated with it. Do we perceive it as a useless relict of patriarchal history? A productive part of the language? Or both?" 

Now don't get me wrong, my position has always been "fuck the patriarchy and all its manifestations". But we are talking about non-Slovak names here, in the context of editorial decisions of a particular newspaper (or three, as it turns out), not a general language reform. The piece is titled (emphasis mine) "A year ago we stopped modifying the names of non-Slovak** women. Have you noticed?", after all.

The justifications for the decision continue and so do bogus arguments:

"Ak by sme žili vo svete ideálneho pravopisu, všetky Slovenky by mali prechýlené priezviská a všetky cudzinky by mali mená pripravené na prechýlenie." 

"If we lived in the world of ideal orthography, all Slovak women would have modified last names and all foreign*** women would have names ready for modification."

I, for one, would love to know more about this "ideal orthography". And also what this has to do with orthography, since this is - as noted above - a morphological issue. Ignoring this bit of nonsense, let's follow the argument:

"Aj napriek tomu, že od čias vrcholného stredoveku prevažovali prechýlené ženské priezviská v rôznej podobe a začiatkom 20. storočia sa ustálila prípona -ová, slovenčina zvládla Margitu Figuli aj Elenu Maróthy-Šoltésovú a určite zvládne aj Simonu Petrík a Miriam Lexmann. A taktiež novinárky Janu Shemesh, Vitaliu Bellu, Kristinu Böhmer a Martinu Koník."

"Despite the fact that modified feminine last names in various shapes have been the norm since the late Middle Ages and early 20th century saw the codification of the suffix -ová, the Slovak language has managed to cope with Margita Figuli, as well as Elena Maróthy-Šoltésová and will certainly cope with Simona Petrík and Miriam Lexmann. And also with journalists Jana Shemesh, Vitalia Bella, Kristina Böhmer and Martina Koník."

That is an excellent point and yes, I agree, the Slovak language can deal. But did you notice that one of these names is not like the others? In case of Vitalia Bella (the executive editor of the newspaper), the last name is declined while the remaining unmodified last names are not, even though all appear as direct objects which requires the accusative. So the issue here is not only the lack of modification, but also the place of the unmodified last names in the morphosyntactic system of Slovak. To be specific, they all - with some exceptions, apparently - are indeclinable now. What are those exceptions and why? Why is the accusative "Vitaliu Bellu" (where, nota bene, the last name should be analyzed as M.SG.ACC if this is the unmodified form of the last name of her husband or father, so there go the antipatriarchal arguments) and not "Vitaliu Bella"? We are not told.

And to their credit, the authors of the piece are vaguely aware of the issues involved (italics indicate original Slovak in translation):

"Prečo sa teda mená cudziniek prechyľujú? Z viacerých dôvodov. Ľahšie ich vieme skloňovať. Ľahšie v texte rozoznáme, kedy hovoríme o prezidentovi Obamovi a kedy o prvej dáme Obamovej. Je zrozumiteľnejšie v titulku napísať, že Williamsová porazila Osakovú, než že Williams porazila Osaka."

"So why do we modify non-Slovak (last) names? For multiple reasons. They are easier to decline. It is easier to recognize in a text when a reference is made to president Obama and when to the First Lady Obamová. It is much easier to understand when we write Williamsová [F.NOM] porazila [defeated.PAST.F.SG] Osakovú [F.ACC] than Williams [INDCL] porazila [defeated.PAST.F.SG] Osaka [INDCL]."

OK, but then considering the flexible Slovak constituent order, if you get rid of the declension suffixes...

"Urobiť z textu zrozumiteľný celok môže byť bez prechyľovania ťažšie. Vety treba pozornejšie štylizovať, pomôcť si skloňovaním krstných mien či okolitých slov, prípadne použitím všeobecných podstatných mien. Williams zvíťazila, Osaka sa prepadla v rebríčku."

"It might be more difficult to turn the text into an intelligible unit without modification. The sentences need to be formulated more carefuly, one can also decline first names or surrounding words or use generic nouns. Williams [INDCL] zvíťazila [won.PAST.F.SG], Osaka [INDCL] sa [REFL] prepadla [drop.PAST.F.SG] v [in] rebríčku [ranking.SG.LOC]."

But the meaning here is diff...

"Rozhodovanie o tom, či pri zahraničných ženských priezviskách zrušiť prechyľovanie, nebolo pre redakciu jednoduché." 

"The decision whether to stop modifying non-Slovak feminine last names was not simple for us on the editorial team." 

Yeah, sure, but ... 

"Dá sa to, len to niekedy chce trocha viac času a tiež šikovných ľudí, ktorí text napíšu, zeditujú a skorigujú."

"It [= dealing with stylistic issues resulting from the decision] can be done, it just needs a little more time sometimes and smart people who will write, edit and proofread the text."

OK, fine but let me ask you this: what problem did you solve? Aside from the bogus argument regarding the patriarchy, what actual issue has been addressed here? We are assured that the editorial team contains people capable with dealing with the stylistic problems resulting from - not to put a too fine point on it - creating an entire class of indeclinable nouns. I have my doubts about that and in any case, you yourself admit it takes more time and effort, and for what? For dealing with Grybauskaitė and Jakobsdóttir? All that so you end up with clumsy and unnatural sounding headlines like this?

"Prečo ľudia vidia iné veci, keď sa pozerajú na rovnaké videá zabitia Renée Good a Alexa Prettiho" 

"Why do people see different things when they watch the same videos of the killing of Renée [INDCL] Good [INDCL] and Alex-a [Alex-M.GEN] Pretti-ho [Pretti-M.GEN]?"

How is that better, in what world is it better?

It is not. My theory is that the faux-intellectual provincial Slovak mind is simply incapable of dealing with variation. That is why they worship standard Slovak and its "rules" or "grammar", by which they mean orthography, because that is all they know of language. That is why they come up with this sort of shit.

"S rozhodovaním nám pomohlo, že naše spriatelené redakcie v týždenníku Respekt a v českom Deníku N zrušili prechyľovanie ešte skôr ako my a vyjadrili sa, že to funguje pomerne bez problémov. Každá redakcia však má nejaké svoje usmernenia či výnimky." 

"What helped with our decision was that our allied editorial departments in the weekly Respekt and the Czech Deník N did away with modification before we did and were of the opinion that it works more or less without problems. Each editorial office, however, has their own guidelines and exceptions."

So it's not just our intellectual smallholders, but also our neighboring intellectual smallholders. Splendid. But hang on a second, exceptions? Why?

"My sme si povedali, že s prechyľovaním skončíme, ale Merkelovej vládu si ponecháme. To znamená, že príponu -ová uvidíte menej často, a keď už, bude to najmä pri privlastňovaní."

"We decided to do away with modification, but will keep Merkel-ovej [Merkel-ADJ.F.SG.ACC] vládu [government.SG.ACC]. This means that you will see the suffix -ová less often and when, it will be in the context of possession."

And here we have it. These people introduced a hard rule that will complicate their work, but then will have exceptions because the new rule complicates their work too much. I guess going by PSP rules and adding a completely reasonable exception for Lithuanian and Icelandic (and Latvian etc.) last names while applying good judgment and maybe have some variation is way too complicated for them.

I swear to God, I really should stop reading this fucking rag. 

---------------- 

* The word used here is "celé" = "whole, entire". I believe it is to be interpreted in the context of the discussion of names ending in -ec, -ek and -ok (section 1.2). These suffixes are contracted in cases other than the nominative and so is the root used in modification, e.g. Strýček > Strýčková.

** The actual phrasing here is "women from abroad" where the Slovak word I translated as "abroad" is "zahraničie", lit. "beyond border land".

*** This just sounds... more racist than it is.  

Thursday, March 05, 2026

kangaroo

We all know the about of how the kangaroo got its name: one of the English sailors (or even Cook himself) asked a native what that strange creature was called and the native replied "gangarru" (or some such) which was then adopted as the name for the creature and which later turned out to mean "I don't understand". We all know that this story is, of course, a myth.

So earlier today I was reviewing the data for the publication of a paper I and Stefano presented at the 2024 conference of the Association Internationale de Dialectologie Arabe. The paper is a pilot study of sorts, attempting a phylogenetic analysis of Arabic varieties. The idea - adopted from biology - is to treat the varieties as taxa and selected linguistic features as nucleotide sequences or some such. Then you just annotate the data, plug it into one of the standard applications and presto! you got yourself a nice visualization of the relationship between the varieties, maybe even some Bayesian modelling of the same. It has been done before, e.g. for creole languages, so it makes perfect sense to try to do it for Arabic. Doubly so if there has recently been a survey of linguistic features of Arabic varieties and the authors have provided you with the raw data. A bit of Python jiggery-pokery and William's your mother's male sibling.

Famous last words.

I do not mean to disparage Manfred Woidich and Peter Behnstedt (may he rest in peace) in any way, their four-volume Wortatlas der arabischen Dialekte is a monumental achievement in Arabic dialectology. Plus let's face it, data management is not easy under the best of conditions, let alone with linguistic data of this type. It is true though that even with the raw data underlying the Wortatlas at one's fingertips, wrangling the data into shape is a complicated endeavor. For one, there were issues of technical nature, where all the data were stored in one of them God-forsaken ante-diluvian formats. The export worked, but since the Arabic text was stored in Odin-only-knows what encoding, the thousands of entries painstakingly recorded by Behnstedt are now lost to the ___???? hell, to say nothing of their attestations consisting of dead links to long-defunct internet fora painstakingly collected and annotated by Behnstedt. It is a giant loss and I hope the data is recoverable.

Then there's the nature of the data which, to be fair, reflects the practices of modern Arabic dialectology: the field is simultaneously is concerned with both the description of the current state of the variety in question AND history and relationship to other varieties; as a result, it often fails to do both properly. Take the the concept 'tomorrow' (item 460 in Wortatlas IV) where we learn that the corresponding words in modern varieties fall into three groups based on their root/etymology: root BKR (A), word ġudwah 'early part of the day' (B) and aṣ-ṣubḥ '(early) morning' (C). We then get lists of which derivations occur where, but without any indicaton as to what is the prevalent form. And so for example for Antiochia, the data set has both bikra (A) and ġadik (B). Which is it, which do I pick as THE form? And this is a simple example, it gets a whole lot more difficult with other lexical items.

And speaking of Antiochia, what do even take as taxa? All of you have surely noticed that while in the description of the entire endeavor I referred to Arabic varieties, in the previous paragraph, I discussed the practices of Arabic dialectology. As I once (before nuking my Twitter account) tried to explain to an idiot, in Arabic linguistics, we use the term 'dialect' for historical, political and sociolinguistic reasons for what in Romance, Slavic or Germanic linguistics we would call separate languages belonging to the Romance (Slavic, Germanic...) branch of Indo-European. So yes, it would be more appropriate to speak of Egyptian, Tunisian or Yemeni, except... Well, all of them co-exist in a special relationship with Modern Standard Arabic and, more importantly, there is no one/dominant/prevalent/okfineiwillsayit standard variety of, say, Moroccan or Iraqi in the same way that there is a dominant/prevalent/cmonpeopledontmakemesayitagainFINE standard German, Polish or French. And so in, say, Egypt, Cairo Arabic may be what people think of when they think of Egyptian, but there are other subvarieties of the Egyptian variety which are just as important and who is to say which is THE Egyptian, surely not the assholes in Cairo! So yes, Lebanese, why not, but then not just one.

Which brings me to the actual question: how do you divide the varieties? The prototypical way to do it in any dialectology is go by a location, except those range from entire regions through towns and villages, all the way to neighborhoods. But this is Arabic dialectology, so we have tribal varieties, confessional varieties and the specter of bedouin varieties hanging over the whole thing. And on top of that, there is the fact that despite literally centuries of Arabic dialectology, there are only handful of Arabic varieties that are actually well described. Sure, you can throw a metaphorical stone and metaphorically hit a book on this or that Arabic variety that bills itself as a grammar. Upon closer reflection, however, you will find that they contain a detailed description of phonology and morphology, but little to nothing on syntax. And then there's of course the aforementioned issue with the raw data underlying Wortatlas where the "place" (roughly our "taxon") column sometimes has the present-day country name like, "Sudan" (or "Usbekistna"), sometimes a region like "Ḥaḍramawt" or "Sinai" or even "Trucial Coast", sometimes a country with a placename, e.g. "Sudan/Šukriyya".

So in the end, we made a decision to focus on

  1. A selection of function words from Wortatlas IV (= features), 35 in total. These contain such items like 'who', 'what', 'when', 'never', 'yes' and the existential predicate, each of them bearing a three-digit code that starts with 4.
  2. 63 varieties where we could get the data for at least 15 of the features (= taxa).

The taxa were arranged by - mostly - country: AF = Afghanistan, AL = Algeria, BH = Bahrain, CH = Chad, EG = Egypt, IL = Israel, IN = Iran, IR = Iraq, JO = Jordan, KN = Kinubi (that's the 'mostly' part), KU = Kuwait, LB = Lebanon, LY = Libya, MK = Morocco, MT = Malta, NG = Nigeria, OM = , PL = Palestine, SA = Saudi Arabia, SU = Sudan, SY = Syria, TR = Turkey, TU = Tunisia, UZ = Uzbekistan and YE = Yemen. Where data was available for a particular village/region/tribe, the taxon was named XX-<village/region/tribe>, where we only had data for the country, we labelled it XX-general. Then for each taxon, we gathered all the features, decided which to pick a representative one whenever necessary (always going with the more common option) and to each unique feature, we assigned a code. Sounds simple, but oh boy, was it not.

And so earlier today I was reviewing that data, now imported into EDICTOR, when I came across this entry (click to embiggen):

 

'Tis weird, I thought, for item 473 is the existential predicate which is in most Arabic varieties derived from the preposition 'in'. The taxon here is Kinubi, a creolized variety which is admittedly very different from all the others, but surely not that different. So I checked the data and sure enough, there it was: 

es gibt
dé kalám táki má
Ki-Nubi Ki-Nubi
QUEST

The "QUEST" note means this piece of data came from a questionnaire. The lack of spaces in the EDICTOR entry is simply the result of normalization - in some entries, multiple options were given, so that is fine. Or, well, not actually, because normally the data would not have this many options/spaces and anyway this looks like a full sentence and besides, that last word looks like negation... So I went in and checked a bunch of sources and turns out it is a sentence:

dé kalám tá-ki má

this word/thing POSS-2SG NEG

'this is not your word/thing'

I have no idea how this ended up in the data considering the existential predicate in Kinubi is, unsurprisingly, . Perhaps this particular sequence of words has an idiomatic meaning in Kinubi, something like 'we don't say that'. Perhaps someone wasn't paying attention. In any case, this, ladies and gentlemen, is how we almost ended up with a version of kangaroo in our phylogenetic analysis. You know what, maybe I should keep it in, as a trap street of sorts, to keep reviewers on their toes.

In case you're wondering what we came up with, below is a preview consisting of only 28 taxa (click to embiggen). The main take-away is that by Jove, it works, or at least that the basic NeighborNet algorithm matches what we would expect. For example, there are clear groupings of North African (6-7 o' clock), Levantine (9-10 o'clock) and Iraqi and Gulf (3 o'clock) varieties. Kinubi and Juba, another creolized variety, go together (and are not that far from Baggara), as do Egyptian and Sudanese varieties. Maltese, of course, stands on its own.

For more, watch this space. But now, back to data wrangling. 

 

Thursday, February 26, 2026

melhor

The best I can tell, on or around May 12th 2011, the Brazilian government agency Ação Educativa released for general use a new textbook in their series Viver, Apprender ("Live, Learn"). Numbered vol. 6 and authored by Heloíse Ramos, the textbook was published under the title Por uma vida melhor ("For a better life"), paid for by the Brazilian Ministry of Education (MEC) and distributed to both children and adult students all over the Lusofonia. It consists of 6 units which cover various topics related to (in order) Portuguese language, English, art and literature, history, geography, natural sciences and math. The first chapter of the first unit, sensibly titled Escriver é differente de falar ("Writing is different than speaking"), addresses a number of language-related issues, from sociolinguistics (the difference between the spoken and the written norm), through orthography and phonology (stress), all the way to syntax. When the contents of the chapter - particularly this last part, a short section on agreement - became generally known, merda hit the ventilador.

"A book used by the Ministry of Education teaches students to speak incorrectly," [moved link] bellowed one headline. Another decried "the pedagogy of ignorance." "Brazil decides to criminalize those who speak correctly and want to teach others to do so as well" [moved link] [1], warned an op-ed by the former president of Brazil José Sarney"it's a crime, A CRIME, to preserve incorrect Portuguese" [2], insisted senator (and former minister of education) Cristóvam Buarque. And Janice Ascari, Regional Attorney General, accused (albeit only on her blog) all responsible for distributing the book of "comitting a crime against our youth." [3] And with such leaders, you can very well imagine what the rank-and-file members of the quickly assembled posse of self-appointed protectors of the Portuguese language had to say about Por uma vida melhor and its author.

By now you're wondering what were the heinous crimes and unspeakable evils the book aimed to corrupt the Portuguese-speaking youth with. For the original version of the first chapter, try here (pdf), the offending passages on agreement can be found on p. 14-16. For those of you who are not yet fluent in Portuguese (seriously, what are you doing with your lives), I have prepared an English translation of the section on agreement (below and here). As always, forgive the poor quality and disregard some of the terminological choices. Predictably, I had some difficulty with the terms norma/variedade culta and norma/variedade popular and in the end, I chose to translate them as "standard Portuguese" and "vernacular" respectively.


 

In summary, we have learned that:
1. There are (at least) two varieties of Brazilian Portuguese - standard Brazilian Portuguese and  vernacular Brazilian Portuguese, and
2. one of the differences between them is how they handle agreement.
3. It is ok to use either variety, as long as it's appropriate for the occasion.
4. There are people who will judge you based on how you speak.

That the first is true should be evident to anyone with a even passing familiarity with Brazilian Portuguese. As for the second, Azevedo's Portuguese: A Linguistic Introduction (CUP 2005, p. 226-227) sums up the situation as follows (emphasis mine):

7.3.2.1 Non-agreement in the noun phrase
Standard nominal agreement (4.1.1) requires pluralization of adjectives and determiners accompanying a plural noun. In the vernacular, however, pluralization is more erratic; in the extreme case, the plural marker is moved to the left-most determiner and the noun and other accompanying formants remain in the singular ...
Although lack of agreement is strongly condemned by prescriptive grammars, examples from educated speakers ... show that application of the pluralization rule tends to vary according to the level of formality...

7.3.2.2 Non-agreement in the verb phrase
Standard verbal agreement (4.1.1) requires a conjugated verb to match its subject in person and number. Non-agreement in V(ernacular)B(razilian)P(ortuguese) is related to the reduction of verb paradigms to three, two or even a single form ...
Although cooccurrence of verbal non-agreement and nominal non-agreement is strongly condemned by prescriptive grammars, it occurs in the colloquial speech of educated informants ...

Even those critics who insisted - with almost superhuman inability to perceive irony - that "There's only one Portuguese language" [wayback link] [4] are very well aware that there are differences, even profound ones, in the way people use Portuguese in Brasil. In fact, few of the enraged voices denied that those who say "os livro" speak different Portuguese. It's just that they don't call it "Portuguese without agreement in number" or "non-standard Portuguese", they call it "incorrect Portuguese" or (like the irony-proof superhero cited above) refer to the "butchering/murdering" of Portuguese [5]. We all know this song; the chorus singing it - and the speed with which they picked it up - does nothing but prove the validity of the third and fourth lessons drawn from the passage.

To be fair, some of the criticism of Por uma vida melhor raised a different issue, a more legitimate one, that of language and social stratification. The introduction to the first chapter addresses this directly:
Contudo,  é importante  saber  o  seguinte:  as duas  variantes  são  eficientes  como  meios  de comunicação. A classe dominante utiliza a norma culta principalmente por ter maior acesso à  escolaridade  e  por  seu  uso  ser  um  sinal  de prestígio. Nesse sentido, é comum que se atribua um preconceito social em relação à variante popular, usada pela maioria dos brasileiros. Esse  preconceito  não  é  de  razão  linguística, mas social. Por isso, um falante deve dominar as diversas variantes porque cada uma tem seu lugar na comunicação cotidiana.
It is, however, important to note the following: both varieties are (equally) efficient as modes of communication. The dominant class uses the high register primarily to gain access to education and to signal prestige. In this respect, it is common to approach the vernacular with a certain social prejudice, even though the vernacular is used by the majority of Brasilians. This prejudice has nothing to do with linguistics and everything to do with social stratification. Consequently, a speaker needs to be in command of both varieties since each has its place in everyday communication.


"The dominant class" naturally objected to being characterized as such. The venerable Brazilian linguist Evanildo Bechara in an interview with the Brazilian magazine Veja titled "Em defesa da gramática" ("In defense of grammar") decried the use of "sociolinguistic theories outside of the confines of academia" [6]. We have all encountered this type of thinking about the relationship between academia and the public, especially recently, but it is still shocking to hear an academic say that out loud. Bechara also described the observation that the standard language is a tool of domination used by the elites as "political orthodoxy" and "an obstacle for the country" [7]; make of that what you will. The aforemementioned former senator Cristóvam Buarque insisted that people like Heloíse Ramos who point out the differences in how various groups of people speak AND say that it is ok (depending on situation), actually create two Portugueses: "the Portuguese of condos and shopping malls" and "the Portuguese of the streets and the fields." [8] This biportuguesism, concludes Buarque, strengthens the Brazilian apartheid [9]. You know this song, too. Mr. Buarque seems to be - or have been, he is old and rich now - a fellow leftist, but that does not matter. The song he is singing goes like this: "there have never been any divisions / until you started talking about them." And we know those who sing it and why.

All this happened 15 years ago and the last I checked, Ação Educativa assembled a file summarizing the debate. I have not been following the developments; I don't know what happened to Por uma vida melhor or what and how Heloíse Ramos is doing. There is a Novo Viver, Aprender, with a new chapter dedicated to language and digital literacy, which sounds great. The entire story remains a striking example of the kind of public response you get when conservativism, classism, power and ignorance of all matters language clash, which is why I document it here for posterity.

A coda: while fixing all the broken links, I came across this article from 2019. It is titled "All Portuguese spoken in Brasil is correct" and the title is a quote by Marcos Bagno, a bona fide linguist. His Wikipedia page includes a reference to Por uma vida melhor in the context of his work on linguistic discrimination. It also turns out that he took part in a televized debate on the subject mentioned in the file and that he is the author of the excellent Gramática de bolso do português brasileiro I picked up in Paris in the Librairie Portugaise et Brésilienne (the one next to Emily's apartment) a while back. This grammar contains such wonderful sections as "Orthography is not a part of language" and "Lexicogrammar" which convincingly argues against the dumb idea of syntax as a separate entity from the lexicon. So maybe all this ado about Por Uma Vida Melhor were the last pangs of the old way of understanding language and things are looking up for Brazilian Portuguese.


Notes:
[1] "... o Brasil resolve criminalizar quem fala corretamente e quer ensinar a que os outros também o façam."
[2] The video seems to be privated, alas.
[3] "Vocês estão cometendo um crime contra os nossos jovens ..."
[4] "Só existe um português, que é o certo".
[5] "Tive muitos ... que assassinam a língua portuguesa cotidianamente." = "There were many ... who butchered/murdered the Portuguese language on a daily basis." 
[6] "As teorias da sociolinguística jamais deveriam ter deixado as fronteiras da academia". 
[7] "Dizer que la lengua culta é um instrumento de dominação de elites é uma ortodoxia política e um obstáculo para o país". 
[8] "Português dos condomínios e dos shoppings e o Português das ruas e dos campos." Italics in the original.
[9] "Permitir duas línguas é fortalecer o apartheid brasileiro." = "To allow two languages is to strengthen the Brazilian apartheid." Italics in the original. 

Thursday, February 19, 2026

warda

I would like to correct one of my previous statements: it is not entirely the case that I only like Columbo; I also like other things, for example Maltese and just about everything Umberto Eco has ever written. And so when I found out that a Maltese translation of The Name of the Rose had been published, all I could say was tace et cape pecuniam meam. A few weeks back I had the opportunity to escape the Central European cold and go to Malta. Naturally, one of my priorities was to obtain a copy of Isem il-Warda. And now that I hold it my hands (after having catalogued it and provided it with a protective cover), I am ... not really confident in the quality of the translation.

The first signs are right there on the cover, more specifically the back of it which contains a brief bio of the author. It informs us that 

Umberto Eco (1932-2016) twieled ġewwa Alessandria fil-Piemonte.
Umberto Eco (1932-2016) was born in Alessandria in Piedmont. 

I am of course not a native speaker of Maltese, but over the last sev... twel... ohmygodreally twenty-two years of my engagement with the language, I have developed a good feeling for it, and this phrase strikes me as strange. More specifically, it is the use of the preposition that is strange. As Stolz et al. (2017: 457) point out, Maltese exhibits toponymic zero marking, which is just a fancy way of saying that if you want to say something happened at a named place, you typically do not use a preposition. For example:

Dun Joe Caruana twieled il-Mellieħa nhar it-2 ta' Awissu, 1960.

Dun Joe Caruana was born in Mellieħa at noon on the 2nd of August, 1960. 

That this sentence also contains an adverbial of time without a preposition is just a happy coincidence I will look into later. For now, note that this being a human language, there is at least some variation, and so this is also perfectly good Maltese:

Patri Serafin twieled fix-Xewkija fit-23 ta' Awwissu tas-sena 1932.

Father Serafin was born in Xewkija on the 23rd of August of the year 1932.

There is a third option available, the preposition ġo also meaning 'in'. And it is a well-established option, as evident from the fact that it is featured in No. 69 from Ilg and Stumme's Maltesische Volkslieder (Leipzig 1909), p. 27. I am reproducing the verse in question in modern standard orthography.

ara x'ġara ġo Ħal-Qormi

look what happened in Ħal-Qormi 

Interestingly, ġo does not seem to be used with the verb twieled (or its feminine form twieldet). When I ran a corpus search on the two verb forms and extracted 300 random examples, only three options cropped up:

Preposition Count
ø 37
f' 42
ġewwa 3

This low frequency of ġewwa + NOUN_PROP supports my feeling and to check, I went to the local digital watercooler and asked native speakers. The vast majority of them shared my suspicion of it, some describing it as an Ingliżata, i.e. a calque from English (with a hint of negative sentiment conveyed by the suffix -ata). Many had never seen it and cried apage satanas itlaq ja xitan; some have pointed out that this is a feature of (assumingly bad) journalistic style - after all, the preposition actually means 'inside (of)' a 3-dimensional enclosed object. Corpus data partially bear these observations out: even if we just consider the two prepositions and the two most frequent toponyms - which, unsurprisingly, turn out to be the Maltese names for Malta and Gozo - ġewwa is by far the minority option:

f'Malta ġewwa Malta
    105.214 700 
f'Għawdex ġewwa Għawdex
    43.065 1.273 

So far so good. The text type (genre) analysis of the use of ġewwa with Malta, however, identifies a different culprit for this Ingliżata. As a shock to no one, it's the politicians who are responsible for this crime against the Maltese language.

Rank Text type Absolute frequency %
1 parliament debates 548 78.06%
2 newspaper 147 20.94%
3 non-fiction 6 0.85%
4 fiction 1 0.14%

Excellent, that all makes sense, so it's bad Maltese spread by politicians. And as with all such things, the origin of this abomination is in the influence of English where the function of the English 'in' was calqued onto the Maltese ġewwa. Done, dusted, all explained. Except...

You see, the corpus I have been using so far is one designed to cover the language of the first two decades of the 21st century (plus or minus). As such, it does not contain older works of literature, such as those written by Ġużè Muscat Azzopardi and Anton Manwel Caruana, where especially the latter is noted for his purism (i.e. exclusion of words of non-Arabic origin). I do have a corpus that includes these works and a few clicks and key presses later, I can confirm that ġewwa is used with toponyms in works from the late 19th century as well. Like this one from Muscat Azzopardi's 1881 Viku Mason:

Wara jumejn, kien magħluq ġewwa l-Imdina...

After two days, he was locked in Mdina...

Or this one from his 1909 Nazju Ellul (where the term il-Belt 'the city' refers to the location that only laws and tourist guides call 'Valletta').

Imma ġewwa l-Belt ġriet ix-xniegħa bejn in-nies tagħna...

But in Valletta the rumour spread among our people...

Recall that the original meaning of ġewwa is 'inside (of)', i.e. within a 3-dimensional enclosed object; in fact, ġewwa also doubles as an adverb with that very meaning. The prototypical noun to be used with ġewwa is 'house', 'building' or 'school'. And its use makes perfect sense in both examples above when you consider the physical nature of the locations: Mdina is a walled city located on a hill, while Valletta is on a peninsula with a single point of entry. Both can thus be viewed as 3-dimensional enclosed objects.

The same is not the case for the following example from Caruana's 1889 Ineż Farruġ. Here ġewwa is used with a name of a locality that is not surrounded by walls or the sea:

... kien ġej ma' missieru minn ġewwa r-Rabat...

... he was coming with his father from inside of Rabat...

And of course, in this case, we are not dealing with location, but rather movement from. The use of ġewwa is still at the very least redundant - minn itself would suffice - but maybe it serves to indicate that the character came from the center of the village and not, say, from some farmhouse on its outskirts.

Be that as it may, I am now much less confident that the use of ġewwa with toponyms can solely be blamed on bad Maltese spoken by politicians or the influence of English. The current use we see might very well be just the extension of use that was first limited to specific contexts (as with Rabat) or even specific locations (as with Valletta or Mdina). So maybe my scepticism regarding the quality of the translation of Eco's The Name of the Rose into Maltese is misplaced.

But then I opened the book to the first page. As I'm sure you remember, it begins with the author's description of how he came across the absolutely 100% totes very real manuscript, naturally, that he then somehow lost and now translates for us from his notes and memory - incidentally, a very popular trope in modern literature that turns out to have long history. This narrative is anchored by two dates: the Warsaw Pact invasion of Czechoslovakia and the date of the book in question. This is what we read in the Maltese translation:

Fis-16 t'Awwissu 1969 xi ħadd għaddieli ktieb ta' awtur jismu abate Vallet, Le manuscrit de Dom Adson de Melk, traduit en français d'après l'édition de Dom J. Mabillon (Aux Presses de l'Abbaye de la Source, Paris 1982).

The Prague Spring and the subsequent invasion took place in 1968. Eco's book translated here came out in 1980.

Here is how Weaver's English translation renders this passage (emphasis mine):

ON AUGUST 16, 1968, I WAS HANDED A BOOK WRITTEN by a certain Abbé Vallet, Le Manuscrit de Dom Adson de Melk, traduit en français d’après l’édition de Dom J. Mabillon (Aux Presses de l’Abbaye de la Source, Paris, 1842). 

Considering that this is the first page, this does not bode well...

Thursday, February 12, 2026

ishoyahb

In the history of native Syriac linguistic tradition [1]Išoʕyahḇ Bar Malkōn (d. early 13th century) is the odd man out. It is not that he is unknown or forgotten: his grammatical works are preserved in a not insignificant number of manuscript copies and his name is listed with other grammarians in overviews of Syriac literature compiled by modern scholars, as well as his contemporaries. Of the latter, the testimony of ʕAbdīšōʕ Bar Brīḵā's (d. 1318) Catalogue of Books is particularly telling: where Eliya of Ṭirhan (d. 1049) and Yōḥanan Bar Zoʕbī (d. 13th century) are described as having composed grammars or grammatical treatises,  of Išoʕyahḇ Bar Malkōn and his grammatical works we only learn the following:

ܡܳܪܝ ܝܶܫܘܽܥܰܝܲܗܒ ܒܰܪ ܡܰܠܟܳܘܢ ܕܰܨܘܒܳܐ ܐܝܺܬ ܠܶܗ ܫ̈ܘܽܐܳܠܐ ܓܪܰܡܡܰܛܝܺܩܳܝܶܐ

"Mār Išoʕyahḇ bar Malkōn of Ṣōḇā [Nisibis]: he has some grammatical questions..."

Whether this refers to a specific genre, is meant to be read generally or anything else, that's it as far as grammar is concerned. This lack of specificity with regard to Bar Malkōn’s work as a grammarian is also typical for modern sources. When consulting one, the reader typically learns no more than that he authored at least one treatise on points and one grammar (both unedited) [2], and that in his grammatical analysis, he followed the Arabic model [3]. One prominent example is Baumstark who describes Bar Malkōn’s grammar as “sachlich ganz die Methode der arabischen Grammatik befolgend” (“in terms of content, it entirely follows the methodology of Arabic grammar”) [4]. Over time, this simple observation - repeated uncritically - morphed into a judgment and finally into a condemnation: Talmon notes of Išoʕyahḇ bar Malkōn – and his contemporaries (or fellow travelers) like Yōḥannan bar Zoʕbī and Eliya of Ṭirhan – that they “exhibit either a servile attitude to Arabic grammar or poor coverage of grammatical issues.” [5]

Talmon's "poor coverage" remark is particularly silly. For one, the comparison made here is to Jacob of Edessa's grammar of Syriac which is notorious for - not to put a too fine point on it - BEING ALMOST ENTIRELY LOST. Secondly, "poor coverage" is a relative term, even this day and age, doubly so in the 13th century. But most importantly, none of Išoʕyahḇ Bar Malkōn's works have been edited or analyzed in any detail, so there is simply no way for Talmon to know.

In fact, that Talmon's (and, by extension, that of those whose judgment he relies on) assessment of Bar Malkōn is wholly wrong can be gleaned from even the most cursory of interactions with the latter's grammatical works. This applies especially to Bar Malkōn magnum opus, a grammar of Syriac titled Ktābā d-manhrānūṯā ba-gramaṭīqī sūryāytā/Kitāb al-ʔīḍāḥ fī naḥw as-suryānī (“Book of elucidation in Syriac grammar”, henceforth: Kitāb al-ʔīḍāḥ), extant in at least four manuscripts: 

  1. Paris BnF Syr. 262 (1v-112r; 16th century) 
  2. Paris BnF Syr. 370 (2r-96r; 1569) (olim Seert 101)
  3. Berlin SBB Ms. or. quart. 1050 (2v-106v; 17th century)
  4. Florence Laur. Or. 419 (1r-96r; 1589) 

Four notes on this list:

Firstly, Stadel's entry on Bar Malkōn in his recent edition of bar Brīḵā's Catalogue (Stadel 2025: 213) lists the Berlin manuscript as located in Tübingen (as does Van Rompay). This is consistent with Assfalg's catalogue, but not with the online catalogue of the Tübingen collections (which, however, contains a work called Bülbüliye, huh). I am reliably informed the manuscript is indeed in Berlin at the Stabi; in fact, this is where I consulted it a few hours ago.


Secondly, Stadel does not list BnF Syr. 262, [Added note X] which is understandable: this manuscript does not give any author and its title is also different, namely

 ܟܬܐܒ ܐܠܢܚܘ ܡܦܣܪ ܡܢ ܐܠܣܪܝܐܢܝ ܐܠܝ ܐܠܥܪܒܝ ܐܠܡܥܪܘܦ ܥܢܕ ܐܠܣܪܝܐܢ  ܓܪܰܡܰܛܺܝܩܺܝ ܬܘܳܪܰܣ ܡܰܡܳܠܐ ܝܥܢܝ ܬܨܚܝܚ ܐܠܟܠܐܡ 
"The  book of grammar translated from Syriac into Arabic known as 'Gramaṭīqī - Tūraṣ mamlō' which means 'Grammar - Correction of speech'". 

The term ܬܳܘܪܽܨ ܡܰܡܠ̱ܠܳܐ (note the correct Syriac spelling here) tūraṣ mamllō lit. 'correction of speech' is generally used to mean 'grammar' and so one finds it in titles of grammatical works modern and medieval; the lost gramar by Jacob of Edessa is reported to have born it. The phrase shows up even in the Syriac version of the title given by BnF Syr. 370 and SBB Ms. or. quart. 1050, although in those two, the first word is given as ܬܪܝܨܘܬܐ. The BnF catalogue refers to the work contained in BnF Syr. 262 as "Grammaire de la langue syriaque, divisée en quarante-cinq chapitres, par un auteur maronite". Why a maronite is a mystery; it could be because it is written entirely in garšūnī or just because it uses Serto. Regardless, even a cursory comparison of BnF Syr. 262 to BnF Syr. 370 makes it clear that they are the same work containing 46 (BnF Syr. 370 and SBB Ms. or. quart. 1050) or 45 (BnF Syr. 262) chapters. Also, 46 chapters on some 100 folios of 18-20 lines each? So much for "poor coverage".

Thirdly, Stadel adds Vat. Syr. 150 (200r-215v, 1709). This identification is clearly not correct - as should be evident from the number of folios - and likely the result of undue reliance on Baumstark (a common affliction in Syriac scholarship). Assemani's catalogue describes the manuscript as "Jesujabi Episcopi Nisibeni ... Quaestiones Grammaticae & aenigmaticae" and sure enough, this is our Išoʕyahḇ. Baumstark incorrectly assumes that this is the same work as the previous ones he lists, i.e. Seert 99 (now lost), Seert 100 (also lost) and Seert 101 (our BnF Syr. 370) [6]. This work, however, is in Syriac only; moreover, it is indeed a list of questions - maybe even the one Bar Brīḵā refers to - and not Kitāb al-ʔīḍāḥ.

And finally, I was made aware of the existence of the Florence manuscript by Margherita Farina (see also her article), to whom I hereby extend my gratitude. The text seems to be identical with that of BnF. Syr. 262, although interestingly, a colophon ascribes the authorship of the work to George (Gewārgīs) ʕAmīra, a Maronite scholar and bishop, the author of Grammatica Syriaca.

In summary, it may well be the case that there are not two, but three grammatical works written by Bar Malkōn:

  1. A treatise on points (BnF Syr. 369, 114v-125v; BnF Syr. 370, 174r-187v; London BL Add. 25,876, 276v-290v and likely many more, on which later). 
  2. Kitāb al-ʔīḍāḥ (see above)
  3. Grammatical questions (Vat. Syr. 150, 200r-215v)  

Turning back to the contents of the manuscripts of Kitāb al-ʔīḍāḥ, it is not the case – as Baumstark’s description (which most likely goes back to Scher's catalogue and which Van Rompay copies in his GEDSH entry on Bar Malkōn) would have it – that Kitāb al-ʔīḍāḥ is originally written in Syriac with a translation in Arabic in two columns ("... das syrische Original in einer Parallelkolumne mit einer arabischen Üb[er]s[etzung] ...") [7]. That is not true of any of the surviving mss Baumstark was aware of, i.e. the two Paris mss and the Berlin one.  Rather, the primary language of Kitāb al-ʔīḍāḥ is Arabic, but Syriac is employed throughout, in both examples and definitions of grammatical phenomena. Such Syriac text rarely constitutes a direct translation of any of the Arabic parts. As a rather straightforward example, consider this section from chapter 2 on parts of speech (BnF Syr. 370, fol. 9r-9v) with the Syriac portions highlighted in red (translation and numbered subsection division mine, underlined text is colored in the manuscript):


1

Chapter 2: On the division of parts of speech. Division of speech.

الباب الثابي في اقسام الكلام ܗ̄. ܦܘܲܠܓܲܐ ܕܡܡܠܠܵܐ


2

Among the Syrians, as well as the Arabs, speech is divided into three things: noun, verb and particle. That is, noun, verb and particle.

الكلام عند السريانيين والعرب. ينتظم من ثلثه اشيآ ܫܡܐ. ܘܡܸܠܬ݂ܐ ܗ̄ ܥܒ̣ܵܕܐ ܘܐܣܵܪܐ. ܗ̄ اسمٌ. وفعلٌ. وحرفٌ.

3

Some examples of nouns include: person, man, horse, mountain, command and similar.

فالسم نحو قولك ܒܪܢܫܐ. ܓܒܪܐ. ܣܘܣܝܵܐ. ܬܘܪܐ. ܐܸܡܪܐ. وما شاكل ذلك ܀

4

And know that everything that ends in an alif in the Syriac language is, for the most part, a noun. And a (word) that takes one of the four particles (lit. 'additions') BDWL BDWL is a noun.

واعلم ان كل ما اخره الف في لغه السريانيون فهو اسم علي الامر الاكثر وما يدخل عليه احدي الزوايد الاربع وهي بدول ܒܕܘܠ فهو اسم ܀

5

The definition of a noun among them [= Syriac grammarians]: sound with meaning that (is) without tense.

و حد الاسم عندهم ܀ ܩܠܐ ܡܫܘܕܥܵܢܐ ܒܫܠܡܘ̣ܬܐ ܕܠܵܐ ܙܲܒܢܵܐ. ...

6

And others define it as follows: the first part of speech designating a thing or an action.

ܐܚܪ̈ܢܐ. ܕܝܢ ܬܲܚܡܘ̣ܗܝ ܗܟܢܐ ܡܢܬܐ ܩܕܡܵܝܬܐ ܕܡܡܠܠܐ ܕܡܫܵܘܕܥܵܐ ܨܒ̣ܘ̣ܬ̣ܐ ܡܕܡ ܐܵܘ ܣܘܥܪܢܐ ܀

To be fair, the Arabic influence is indeed undeniable: it is clear, for example, from the division of the parts of speech into three classes (section 2), where the native Syriac linguistic tradition typically works with seven, i.e. the eight of Technē Grammatikē minus the definite article. I guess it is ok to be servile to the Greek model, although on the other hand, Bar Hebraeus divides his grammar into four treatises on, respectively, nouns, verbs, particles and orthography, so maybe he is servile to Arabic models as well... In any case, the influence of Arabic on Bar Malkōn's analysis is also evident from the choice of his examples: 'man' and 'horse', for example, are also given as examples of nouns in Sībawayh’s Kitāb.

The rest of the section, however, is anything but a servile copy of the Arabic method without any connection to the Syriac linguistic tradition. One such connection is the terminology: melṯā, his term for 'verb', is one that is well-established in the Syriac scientific terminology, though originally used as a translation for ῥῆμα in philosophical works. The term for 'particles', esārā, is also in common use in native Syriac linguistic tradition, although typically meaning 'conjuction', translating the Greek σύνδεσμος, both as a philosophical term, as well as the linguistic one (ch. 20 of Technē Grammatikē). Interestingly, Bar Hebraeus uses both terms in the same way Bar Malkōn does. [8]  These and other items of Syriac linguistic terminology occur all over Kitāb al-ʔīḍāḥ, both as a result of dealing with matters specific to Syriac (and not only such obvious things as vowel points), but especially due to the bilingual nature of the work. This of course requires Bar Malkōn not only to engage with the Syriac tradition, but also attempt to harmonize it with the Arabic linguistic framework and even make attempts at comparative linguistics.

The major way in which Kitāb al-ʔīḍāḥ is undoubtedly a part of the native Syriac linguistic tradition - as opposed to a mindless copy of the Arabic one - is Bar Malkōn’s constant references to the same and his insistence on working within it. The introduction (BnF Syr. 170, ff. 2v-4v) contains a brief overview of the previous work by Syriac grammarians and scholars of language, including Jacob of Edessa (d. 708), Eliya of Ṭirhan and Yawsep̱ Hūzāyā (6th cent.), the purported translator of Technē Grammatikē into Syriac. The text of Kitāb al-ʔīḍāḥ then repeatedly refers to their work (the "among them" in section 5 above and "others" in section 6) and cites them by name regularly. The chapter on parts of speech cited above also contains one very telling example in section 5, i.e. the absence of time as a major criterion for the definition of a noun. This line of reasoning is unique to Syriac linguistic tradition and can be traced to Aristotle, e.g. De Interpretatione. In contrast, Technē Grammatikē opts for a morphological/semantic definition (English translation). Now Arabic tradition is complicated, but it involves morphological and syntactic criteria; a simplified contemporary grammar uses a definition that is heavy on the morphology. True, so does Bar Malkōn’s own definition of a noun in section 4, treating the particles BDWL as morphological properties. But then again, this is a fact of Syriac, obvious to anyone with even a passing familiarity with the language. So servile attitude towards Arabic models or sensible analysis of one’s language? The latter definitely applies to the entirety of what BnF Syr. 370 calls chapter 47 (96v-173v), missing in BnF. Syr. 262 and SBB Ms. or. quart. 1050. [9] This chapter is sometimes treated as a separate work - or even genre - called De vocibus aequivocis, i.e. "On ambiguous words" - and contains a Syriac-Arabic glossary of homographs. None of this slavishly follows the Arabic model; in fact, the more I think about it, the more I am convinced that those who argue so have only ever read the section on parts of speech. The relationship of Bar Malkōn's analysis to the Arabic linguistic tradition reminds me of the way grammars of modern languages follow the Latin model: there is some, even a lot of inspiration, that may even be slavish now and then - just think of the concept of parts of speech and the terminological fustercluck that are Wolof conjugated pronouns. Latin method, however, is not all there is.

As noted above, Bar Malkōn's work remains unedited and unpublished - hell, this hastily put-together post might be the most comprehensive study of his work to date. If anyone wishes to change it, for example as an MA thesis (his short treatise on points would be perfect) or even a PhD dissertation, hit me up.