Appendix 2: Self-segregating morphemes

Introduction

One feature that all loglangs and many other engelangs have in common is 'self-segregating morphemes'; that is, in any string of letters (or string of sounds), the morpheme boundaries are quite unambiguous. Loglangs do this is because one of their aims is to avoid ambiguity as far as possible.

How do other loglangs do this?

In both Rex May's Ceqli and Jim Carter's Gua\spi the morphemes must begin with one or more contoids (obstruents) and end in one ore more vocoids (sonorants & vowels). This is neat but we cannot, of course, apply similar rules to a strictly CV structure; indeed, our language would have only 16 morphemes!

Lojban attempts to implement self-segregation of morphemes, but its rules are complicated and it requires the use of pauses in certain places in order to fully implement this feature. Enforced pauses are unnatural and not, in my opinion, likely to be properly implemented in normal continuous speech. For our language some other approach is needed.

How does Plan B do it?

Jeff Prothero refers to the morphemes of Plan B as 'affixes'; the morphemes (or affixes) can be uniquely segregated, using

"a Huffman-style expanding-opcode sort of scheme. A simple and effective one is simply to have the number of leading 1 bits in the affix give the number of trailing letters in the affix.This gives us the following infinite set of affixes, where '?' may be any single char:

Length 1 (      8 affixes): b   d   g   j   l   n   s   v
Length 2:(     64 affixes):   c?      h?      m?      t?
Length 3:(    512 affixes):       f??             p??
Length 4:(   4096 affixes):               k???
Length 5:( 32768 affixes): zb??? zd??? zg??? zj???
                            zl??? zn??? zs??? zv???
Length 6:( 262144 affixes): zc???? zh???? zm???? zt????

... with infinitely more to follow, eight times more for each length."

If you compare this with the Plan B alphabet given on the 'Phonology & Orthography' page, you will see that the even numbered quartets are all 'Length 1', quartets 1, 5, 9 & 13 begin 'Length 2' morphemes, quartets 3 & 11 begin 'Length 3' morphemes, quartet 7 begins a 'Length 4' morpheme, and quarter 15 begins morphemes of five or more sounds. It will be found that, beginning with the least significant bit and reading towards the most significant bit, we count the number of 1s (if any) before the first 0 and add one to the total. Now this is all very well for a computer, but is it practical for humans?

In effect, at the human level in Plan B, the first sound (which may be a consonant or a vowel, depending upon what has come before) determines the number of 'sounds' the morpheme has. At this level the system for unambiguous segmentation of the spoken chain into discrete morphemes appears quite arbitrary, as the following table shows:

First sound	Number of 'sounds' *	Shape of morpheme
[b] [d] [g] [ʒ] [ð] [n] [s] [v] or [ɛ] [ɪ] [aɪ] [oʊ] [ɹɛ] [ɹɪ] [ɹaɪ] [ɹoʊ]	one sound	C or V
[ʃ], [θ] [m] [t] or [eɪ] [u:] [ɹeɪ] [ɹu:]	two sounds	CV or VC
[f] [p] or [ɑ] [ɹɑ]	three sounds	CVC or VCV
[k] or [i:]	four sounds	CVCV or VCVC
[z] or [ɹi:]	at least five sounds; i.e. four plus the number denoted by the second sound.	CVCV... or VCVC...

* Remember that [ɹ] does not count as a separate sound for this purpose, but as the onset of a diphthong or triphthong.

This may be 'near optimal' for a computer, but I fail to see how it is near-optimal for human use.

How might we do it?

It is true that if the same Huffman-style scheme were applied to our experimental loglang, there would be less abitrariness (and it is easier to count the number of syllables than it is to count the 'sounds') in that:

First syllable	Number of syllables	Shape of morpheme
all syllables ending in /o/ (All series #0 & #2)	one syllable	CV
/je/ /le/ /ne/ /me/ (All series #1)	two syllables	CVCV
/ke/ /te/ (series #3, grades #0 & #2)	three syllables	CVCVCV
/se/ (series #3, grade #1)	four syllables	CVCVCVCV
/pe/ (series #3 grade #3)	at least five syllablles, i.e. four plus the number denoted by the 2nd syllable.	CVCVCVCV...

It would, I guess, not be difficult to learn those rules even though, at a human level surely, they appear a tad arbitrary. Let us call this 'Scheme A'.

Gary Shannon and John Cowan have suggested other methods to achieve the self-segregation of morphemes in a language where all syllables are (C)V. Gary's suggestion is that we normally have sequences CV but that CVV marks a morpheme boundary. In the restricted syllabary of our loglang, it would mean that w /wo/ and y /je/ serve to mark the end of morphemes.

Gary's system also allowed morphemes beginning V, thus in a sequence CVVV the last V is the initial syllable of a the next morpheme; simiarly we could allow w and y to behave the same way. However, in a sequence CVVVV we would have first two Vs marking the end of a morpheme, the third V being a monosyllabic morpheme and the next V beginning another morpheme. If this sequence occurred finally, then the last two Vs would be monosyllabic morphemes. Also, of course, in an initial sequence of VV the first V would be a monosyllabic morpheme while the second V would be the initial syllable of another morpheme.

However, it seems to me that it make things a little easier if we did not have to count the number of Vs when they occur in a string. If this scheme is adopted, we shall restrict w and y to marking morpheme endings only. If no CV syllable occurs before w or y, then the w or y in question is both the beginning and end of a morpheme, i.e. it is a monosyllabic morpheme. Thus VCVV must be V-VVV and CVVVV must be CVV-V-V only. Let us call the modified version of Gary's suggestion 'Scheme B'.

John gave his conlang xuxuxi, which also has only V or CV syllables, a rule of vowel harmony/disharmony to determine morpheme boundaries. The first syllable determines the vowel harmony which operates on all the following syllables (if any) except the final one; that is marked by vowel disharmony. Xuxuxi has five vowels, so there is room for variation of vowels within the morpheme. If we adopted a similar system for our loglang, the rule would be much simpler since we have only two vowels, thus:

If the vowel of the first syllable is /o/, then all subsequent syllables also have /o/ except the final syllable, which has /e/.
If the vowel of the first syllable is /e/, then all subsequent syllables also have /e/ except the final syllable, which has /o/.

(In a bisyllabic morpheme there will be no medial syllables; the two syllables will simply have different vowels.)

In John's scheme, as applied to our loglang, monosyllables are not possible possible*. We shall call our version of John's vowel harmony/disharmony rule 'Scheme C.'

* In xuxuxi, words of two or more syllables are marked by stressing the first syllable. Any syllables which follow the end of word vowel-disharmony syllable are monosyllabic words until the next stressed syllable is met. We could, of course, apply a similar rule to Scheme C, but this would mean that we have to introrduce phonemic stress. If this is to be shown in the bit representation, it will mean modifying the syllabary. It does not seem worth doing this for the sake of 15 or 16 extra morphemes.

Evaluating the three possible schemes

I will consider this under three heads:

How many morphemes can each scheme generate?
How easy is it for humans to determine morpheme boundaries?
How easy is it for machines to determine morpheme boundaries?

Consideration (iii) is, of course, not strictly relevant to a loglang. But as it was clearly important to the author of Plan B, I have added it for consideration.

i. How many morphemes can each scheme generate?

Scheme A is is the same as Plan B, except that the surface representation of alphabet and pronunciation is different. The number of morphemes that may be generated remains the same as in the table given above in the section 'How does Plan B do it?'

In Scheme B we may have only two one syllable morphemes, namely w and y. Two-syllable morphemes may begin with any syllabogram except w and y (otherwise we'd have two one syllable morphemes), while the second syllable must be either w and y; thus we may have 28 (14 * 2) two-syllable morphemes. With three syllables and more, each additional syllable may also be any syllabogram except w and y, i.e. the number of possible morphemes increases by fourteen times for each additional syllable.

In Scheme C, we cannot have any single syllable morphemes. For two-syllable morphemes, we may have any of the eight syllables ending in /e/ followed by any of the eight syllables ending in /o/, and vice versa; in other words, we may have 128 (8 * 8 * 2) different two-syllable morphemes. For three-syllable morphemes, we have any eight /e/ syllables, followed by any eight /e/ syllables, follwed by any 8 /o/ syllables or any eight /o/ syllables, followed by any eight /o/ syllables, follwed by any 8 /e/ syllables, i.e. 1024 (8 * 8 * 8 * 2) morphemes. Thus it will be seen that the total number of possible syllables is increased eightfold for each additional syllable.

We may summarize all this, thus:

	Scheme A	Scheme B	Scheme C
One syllable morphemes	8	2	0
Two syllable morphemes	64	28	128
Three syllable morphemes	512	392	1 024
Four syllable morphemes	4 096	5 488	8 192
Five syllable morphemes	32 768	76 832	65 536
Morphemes of more than five syllables	eight times more for each length	fourteen times more for each length	eight times more for each length

It will thus be seen that while Scheme A does better than Scheme B as regards one-, two- and three-syllable morphemes, it does noticeably less well with morphemes of four or more syllables. Indeed, it consistenly does less well when compared with Scheme C (exactly half the number, in fact) for all except one-syllable morphemes; indeed, it not until we come to five-syllable morphemes that Scheme B does better than Scheme C. Indeed, with morphemes of five syllables or more, Scheme B will always generate considerably more morphemes than the other two schemes. However, we shall probably want to use longer morphemes less often than the shorter ones and Scheme B's poor performance as regards one-, two- and three-syllable morphemes is a weakness.

That Scheme C has no monosyllabic morphemes may seem a weakness, but even Scheme A has only eight of them, which is not a large number. If we take the total of monosyllabic and bisyllabic morphemes (which may well serve as 'particles') we find that Scheme A has 72 possible forms, Scheme B only 30, while Scheme C has 128 possibilities. It would seem that on balance Scheme C is going to be best for our purposes as regards morpheme generation.

How easy is it for humans to determine morpheme boundaries?

Let us take for consideration the string: mzdwplywgzykñw

If we apply the rules for each scheme, we shall find the following morphemes:
Scheme A: mz-d-w-plywgz-yk-ñw
Scheme B: mzdw-ply-w-gzy-kñw
Scheme C: mz-dwp-lyw-gzy-kñw

There is no doubt that just by looking at the strings, the Scheme B division is the easiest: we just look for w and y, whereas in Scheme C we have to know how each letter is pronounced and in Scheme A we have to learn how many syllables are denoted by each letter which occurs in morpheme initial position (or know its bit representation) and then count of the syllables as we speak or read. But, of course, when we read we do normally think of the sound of the words we read.

Let us try the same exercise again, but taking the string of syllables: /mesotowopelejewokosojekenewo/

Now if we apply the rules we shall find:
Scheme A: /meso/ /to/ /wo/ /pelejewokoso/ /jeke/ /newo/
Scheme B: /mesotowo/ /peleje/ /wo/ /kosoje/ /kenewo/
Scheme C: /meso/ /towope/ /lejewo/ /kosoje/ /kenewo/

I think we can forget scheme A as far as normal human intercourse is concerned. The idea of ascertaing the number of syllables and then counting off is not exactlt the way humans work - though it may well be suited to a robot. Both Scheme B and Scheme C, however, are straightforward; in the former we look for /wo/ and /je/, and in the latter we look for change in the syllable rhyme from /e/ to /o/ or vice versa.

How easy is it for machines to determine morpheme boundaries?

Scheme A has to:

ascertain from that first character how long the morpheme will be or, if the first character is p, to move onto the second character and ascertain the length from there (or to the next character after a string of initial ps)
then to count off the requisite number of characters.

Scheme B just has to check eack character read against w or y; if the character is w or y then we are at the end of a morpheme. Scheme C, after having read the first character, has to check subsequent characters until it finds a non-matching LSB. As programming languages allow characters to be read as an integer number, this becomes simply a method of checking for odd and even numbers.

So even for the machine, Scheme A thus seems slightly more complicated than either Scheme B or C. There is little to choose betwee Schemes B and C.

Conclusion

It will be seen from the preceeding sections that Scheme A performs worse on all three considerations: number of morphemes that can be generated, ease for humans, and ease for machines. Indeed, from a human point of view it would appear to be somewhat arbitrary, unless a person was aware of the bit representation of each character.

Scheme B is undoubtably easier for someone just looking at a stream of characters without knowing the language or, indeed, how the characters are pronounced. But when one actually reads the characters or hears them spoken, then is nothing to choose between Schemes B and C. Likewise, there is is no real gain for a machine in one or the other of these two.

The telling point, I think, is with the number of possible morphemes generated. Although Scheme C will certainly be able to generate more possibilities if morphemes are of five syllables or more, it performs significantly less well on shorter morphemes; Scheme B can generate only 28 two-syllable morphemes whereas Scheme C may generate 128 (this more than makes up for Scheme C's lack of monosyllabic morphemes as opposed to Scheme C's two); and Scheme B can generate only 392 three-syllable morphemes as opposed to Scheme C's 1024. This, I consider, to be a significantly weak point in Scheme B and is caused by the restricted use of two of its syllables, namely /wo/ and /je/.

Therefore, if I had continued to develop the 16-syllable language I would have adopted Scheme C: the vowel harmony/disharmony rule, i.e.
- if the initial syllable ends in /o/ (LSB is 0), then all medial syllables (if any) also end in /o/ (LSB = 0),but the final syllable ends in /e/ (LSB = 1)
- if the initial syllable ends in /e/ (LSB is 1), then all medial syllables (if any) also end in /e/ (LSB = 1),but the final syllable ends in /o/ (LSB = 0)

Dee ("Plan D") pages:

Dee ("Plan D")?