bulbulistan: tahime

Unless you've been living under a rock these past weeks, you know about the Gospel of Jesus' Wife (GJW), so no intro, just the basic overview: this is the draft PDF of the paper by Karen King, Alin Suciu has some great comments, Mark Goodacre links to detailed textual analysis by Francis Watson (but see Timo Paananen for a rebuttal) and a video by Christian Askeland; and Jim Davila is, as always, your go-to guy for the complete picture of the debate. Before I venture any further, let me offer a disclaimer: I have no dog in the fight over the historical Jesus and my faith (such as it is) is not threatened by the very idea of Him being married. To me, this whole debate is first and foremost a wonderful example of how scholarship works and it's a great thing to watch it more or less live. Well, mostly watch, because (disclaimer continues) I am no expert on either Coptic or any aspect early Christianity and thus I cannot and will not offer any opinion as to the authenticity of the GJW fragment. I did, however, leave one comment on Mark Goodacre's blog where I essentially wondered aloud about some usage in GJW I found peculiar. As it so happened, ~~yesterday~~ earlier this week on Charles Halton's blog, Gesine Robinson offered her view of the whole affair and her remark #8 addresses the same point, except of course in a much more detailed and better articulated manner. I'm reproducing it below with one modification - I've changed the transliteration scheme[1] to something that I find a little easier on the eyes:

Therefore, the rather rare phrase PEDžE IS* (though frequently used in the Gospel of Thomas since we have to do there with a collection of Jesus’ sayings) is used even in both instances of speaking, instead of the form PEDžAF (+ pronominal/nominal object) + NQI + subject that is more common in dialogues or other literary texts. Here in the first instance one would expect something like PEDžAU NIS* NQI NMAThÉTÉS, and in the second instance PEDžAF NAU NQI IS*, or since Jesus answers the disciples, even AFOUÓŠB[2] NQI IS* PEDžAF NAU DžE. It seems a cautious and perhaps unsure modern Coptologist was at work here.

To actually understand what's going on, first, PEDžE. Translated as "(pronoun) said", PEDžE belongs to a funny little class of words Layton (2000:297-314) refers to as verboids. Semantically, they are like verbs, and they can even take some of the verbal affixes, but there are a few important aspects in which they differ from actual verbs. In case of PEDžE, they are as follows:

1. PEDžE cannot be negated or converted, i.e. it cannot take the relative, circumstantial, preterit or focalizing prefix (Layton 2000:321-322).
2. PEDžE only expresses the past tense.
3. PEDžE can only be conjugated sufixally.
4. PEDžE can appear in two forms:

independently (PEDžE), in which case it must be immediately followed by the subject noun or pronoun (Layton's 'prenominal state').
suffixed (conventionally written as PEDžA=) where the suffix marks the subject of the action of speaking (Layton's 'prepersonal state'). In this case, if the 3rd person subject is also expressed by a noun, the noun is preceded by the preposition NQI.

In GJW, PEDžE (i.e. the prenominal state) appears twice: first on line 2 (PEDžE MMAThÉTÉS NIS* DžE ... = "The apostles said to Jesus: ...") and then of course on line 4 (PEDžE IS* NAU TAHIME = "Jesus said to them 'My wife...'"). The objection Dr. Robinson raises is that this is unlikely since PEDžE in its prenominal state is rare and seeing it twice in such a short text even more so when there are other (presumably more frequent) constructions that could have been used. We have very little reason to doubt Dr. Robinson's intuition and experience. But what we also have is a way to actually check whether she's right. Distribution and probability, that's all we're dealing with here, and that is a familiar theory NLP territory where a corpus and some math is all you need. The questions to be asked can be reformulated as follows:

1. Is the prenominal state of PEDžE indeed rare?
2. What is the probability of one prenominal PEDžE following another?
3. What about the frequency of other constructions?

First, the data: I decided to use the gospels for both theoretical (we are looking at Jesus' words after all) and practical reasons (Coptic translation of the canonical gospels is readily available from a number of sources online). So I cobbled together a little Perl script to retrieve the text of the canonical ones from The Unbound Bible website. I only used the version they refer to as "Coptic: Sahidic NT" which, according to their information, ultimately comes from Sahidica. To the canonical gospels I added the Gospel of Thomas (GThom) which gets mentioned a lot in this context and which I retrieved from metalog. After some minor cleanup, I ended up with 4 UTF-8-encoded plain text files (plus one for GThom) which I then fed into antconc. One of the cool features of antconc is the ability to define custom fonts which is particularly handy for Coptic. For best results, use New Athena Unicode and for replicability, my original settings.

And now the procedure: for the first question, let's go with something simple. The null hypothesis is that the distribution of the prenominal state on one hand and all the forms of the prepersonal state on the other is roughly equal. In other words, there is no particular reason why an author or a translator should prefer one to the other. So when searching for all possible forms of PEDžE (PEDž.* in regex terms) - of which there are only a handful - we would expect to find PEDžE about 50% of the time. The table below sums the actual findings for the corpus consisting of the four canonical gospels:

Total wordcount	59100
Total PEDž.*	776

		% of PEDž.*	Observed relative frequency (per 1000)
PEDžE	140	18,04%	2,37
PEDžAF	501	64,56%	8,48
PEDžAS	20	2,58%	0,34
PEDžAU	113	14,56%	1,91
PEDžÉTN	2	0,26%	0,03
Total	776	100,00%	13,13

Prenominal state	140	18,04%	2,37
Prepersonal state	636	81,96%	10,76
Total	776	100,00%	13,13

Though the definition of 'rare' may vary, these figures clearly show that in the canonical gospels, the prepersonal state PEDž= is preferred to the prenominal stage. This becomes even clearer when one looks at the synoptic gospels only:

Matthew		% of PEDž.*	Observed relative frequency (per 1000)
Prenominal state	2	1,20%	0,12
Prepersonal state	165	98,80%	9,72

Mark
Prenominal state	12	10,62%	1,19
Prepersonal state	101	89,38%	10,02

Luke
Prenominal state	55	20,37%	3,19
Prepersonal state	215	79,63%	12,47

With John, the prenominal form makes up almost a third of all forms of PEDžE. Moreover, more than a half of all instances of the prenominal form in the four canonical gospels can be found in the Gospel of John:

John		% of PEDž.*	Observed relative frequency (per 1000)
Prenominal state	72	31,72%	4,86
Prepersonal state	155	68,28%	10,47

Gospel of Thomas, however, is another story altogether:

Gospel of Thomas		% of PEDž.*	Observed relative frequency (per 1000)
Prenominal state	101	71,13%	25,63
Prepersonal state	41	28,87%	10,41

Here Dr. Robinson's intuition is proven correct once again: GThom clearly prefers the prenominal state of PEDžE. Moreover, the relative frequency (per 1000 words) of this state is much higher than even the relative frequency of all forms of the verboid in any of the canonical gospels or all of them combined.

So much for the first question, now on to the second one. The problem is a trivial one: calculate the probability of one prenominal PEDžE following another prenominal PEDžE. In other words, given the probability of two complementary events (i.e. the probability of either state ocurring), we need to calculate the probability of one of those events occurring twice in a row. Let P(N) be the probability of prenominal state being selected as determined above - i.e. in a situation where the author has already decided to use a form of PEDžE, P(N) expresses the probability of this form being the prenominal state. The probability of event P(M) (the prenominal state occuring twice in a row) is calculated as follows:

Canonical gospels: P(M) = P(N) * P(N) = 0,18 * 0,18 = 0,032
Gospel of Thomas: P(M) = P(N) * P(N) = 0,71 * 0,71 = 0,504

The probability of PEDžE ocurring twice in a row is therefore 3,2% for the canonical gospels and 50,4% for the Gospel of Thomas. If GJW is a narrative similar to the canonical gospels rather than a sayings gospel like GThom, then one would be fully justified in raising a brow over the two prenominal PEDžE in a row - doubly so when one takes into account the relative frequency of that state. For the canonical gospels, it's 2,37 (from 0,12 for Matthew to 4,86 for John), for GThom, the figure is 25,63. Compare that to the figures for GJW (calculated assuming a total word count of 31 words):

GJW
Total wordcount	31
Total PEDž.*	2

		% of PEDž.*	Observed relative frequency (per 1000)
PEDžE	2	100,00%	64,52

And finally, question no. 3. Here the issue is a little more complicated (it involves variations in word order and information structure) and as such, the answer hard to arrive at without a decently tagged corpus. What I can do, however, is throw a few regular expressions around looking at what structures are used to refer to Jesus speaking (in absolute numbers):

Jesus speaks	Canon	GThom
PEDžE IÉSOUS / IS*	53	85
PEDŽAF * NQI IÉSOUS / IS*	20	0
AFOUÓŠB NQI IÉSOUS / IS*	18	0

Interestingly enough, the second structure is used four times in GThom, but only for other people speaking (Peter, Matthew and twice Thomas). Based on this, it would not be that unreasonable to expect PEDžE with Jesus as the subject to occur in any text similar to the canonical gospels, let alone to GThom.

When it's the apostles' turn to speak to Jesus, the picture is even more complicated: the apostles can be referred to as MMATÉTÉS ("the apostles") or NEFMATÉTÉS ("his apostles"), Jesus can be referred to by his name, by the nomen sacrum IS* (I counted those together) or by NAF ("to him"). A few more quick regular expressions and voila:

Apostles speak	Canon	GThom
PEDžE MMATÉTÉS	0	4
PEDžE NEFMATÉTÉS	0	0
PEDžAU NIÉSOUS / NIS* NQI MMATÉTÉS	0	0
PEDžAU NAF NQI MMATÉTÉS	1	0
PEDžAU NAF NQI NEFMATÉTÉS	2	5

These are by no means all possible constructions, just the ones that seem to be relevant for this discussion. So the first is the one that occurs in GJW and, no surprise there, it crops up in GThom as well, but not in the canonical gospels. The second one is just a check - it struck me that NEFMATÉTÉS crops up roughly twice as much as MMATÉTÉS in both the canonical gospels and GThom - but as you can see, it really doesn't matter since this structure cannot be found in either. The third construction is the one Dr. Robinson would expect instead of the first one. As it turns out, this would be an unreasonable expectation, as it doesn't appear in either the canonical gospels or GThom. Two of its variations do - in both cases the target of speaking (=Jesus) is expressed by means of NAF and in one of them, the subject is NEFMATÉTÉS which is not surprising considering the relative frequency of the two forms of this noun.

Of course, all these figures mean very little. GJW is a small fragment, the corpus of material I used is limited in both size and scope and chances are some of my math is wrong (check for yourself), just to give a few objections that might legitimately be raised. Nevertheless, with a decently sized and properly tagged corpus of Coptic, this is an example of what Coptologists could do to check whether their intituition regarding the distribution of certain morphological or syntactic forms is correct, not to mention all the other cool stuff.

[1] I borrowed the table from Wikipedia; the asterisk marks a nomen sacrum.
[2] At least I think that's what Dr. Robinson meant.