If you have ever wondered what it might be like to read minds, then consider reading James
Joyce’s Ulysses.
But do not expect to understand what the minds are thinking
or
whatever they are doing, unless you are versed in the classics, as well
as ancient and modern languages, and your vocabulary consists of
about a half million
words.
According to Wikipedia,
Ulysses
contains more than 30,000 distinct
words, but what exactly is a word?
To be sure, writers invent new words—Shakespeare invented a
great many new words.
It also goes without saying
(but I am saying it) that words should not be expected to appear in the
dictionary before
they are invented. Nor should every newly made up word be lookupable
in a pocket dictionary. However, it seems only fair
to expect legitimate candidate words to find their way into a
moderately
large lexicon within, say, a hundred years after their
first appearance in a significant book.
A few years ago, in order to play with word puzzles and such, I
compiled a Lexicon containing more than a quarter million English
language
words, about as many as the total number of words counting
repetitions in Ulysses.
Words in this lexical database were gleaned from multiple
on-line sources, but not from the book Ulysses itself.
So the question occurred to me, “what are the odds that a
random ‘word’ from Ulysses exists in this database?”
In my student days this kind of puzzle was called an
empirical question. It should only be necessary to check all
unique words in the book against the contents of the selected reference
Lexicon. Such a comparison would ignore case, of course, and
perhaps should include a few additional tweaks.
For a meaningful calculation it might be necessary
to eliminate
proper names and non-English language words (Ulysses has more than a
few of
these). In the end it may not be practical to propose a precise
answer to the odds question, due to its inherent ambiguities, or due to failing to
notice or purposely ignoring subtleties, in preparing a Ulysses
lexicon.
The text of Ulysses is readily
available for analysis—or for that matter, for reading—thanks to Project Gutenberg.
To the computer programmer, words are more-or-less equivalent
to
space-delimited substrings, after superfluous punctuation has been
removed. Internal punctuation like hyphens and apostrophes
might
need to be preserved. I know of no perfect formula for
extracting words from text, but space-delimited substrings are a fair approximation or starting point. A
more
rigorous approach would be needed for real research.
In any case, it should be interesting to
compare the size of the Ulysses
lexicon, as computed by this approximation method, to the Wikipedia
article’s 30,000 number. Such a comparison would serve as a rough
validity check
on the method.
And the number is 30,800.
On quick inspection of the imported Ulysses
lexicon, more than a few
words have a
long dash stuck to them (a double hyphen ‘--’ in the text rendering of
the book). Re-do the import, removing long
dashes. Good, now the number is 30,200. I did not
screen
out proper names or non-English words, so 30,200 will do—it is close enough.
The beauty of the MUMPS programming language is that one can
do this sort
of quick-study very quickly. Creating a Lexicon
database and
supporting code to read in the book, clean up the words, and file them,
count the number of occurrences of each, etc. took only about an
hour. I will summarize the method below.
But, before doing that let’s
look at a few imported words. Ha! The 74th “H”
word (in
alphabetic order) is “hairynostrilled.”
I do not remember it, but
am nevertheless confident of two things, 1) it is in Ulysses
and 2) it will not be in the other-sourced Lexicon. True
enough,
it is here, “Ben Jumbo Dollard, Rubicund, musclebound,
hairynostrilled,...” and it is not in the 200,000 word comparison
lexicon.
Let’s try an on-line source http://www.oxforddictionaries.com/
—

Perhaps intentionally selecting a compound word that looked funny was not fair...
The
541st “M” word (in alphabetic order) is one I don’t know, but it looks
like a word and I probably should know it. It is maugre:
“But sir Leopold was passing grave maugre his word by cause
he
still had...” Yes it is in the big Lexicon and it is also in
the
dictionary. The on-line dictionary meaning is “bad pleasure.”
James Joyce knew
something about that subject.
Let’s try a “T” word. Well, I
have just learned another word tatterdemalion:
“feeble goosefat whore in a tatterdemalion gown of mildewed
strawberry...” It is in the big lexicon and apparently means
something like ragamuffin.
Skimming the Ulysses
lexicon is revealing. The majority of words that I do not
recognize are either proper names: Poulaphouca (a place in County
Wicklow), or foreign words, generally part of a quoted phrase, or
compound words like the first example above, or sounds rendered
as words.
I wish there were a convenient
way to filter proper names and foreign words. My revised
expectation, though, based on a quick scan of the Ulysses lexicon is
that a smaller proportion of terms will fail lookup in the larger
lexicon than I had originally thought.
Having started this exercise, I may as well finish it.
Setting aside many valid
objections, cases that should be excluded and so forth, we compare the Ulysses
lexicon to the considerably larger one that is based on other sources. And
the
answer is approximately 1/4 of the unique terms in Ulysses (including
foreign words,
proper nouns, sounds, run-together words, and so forth) are not in the
large lexicon, which contains only English language words and not very
many proper nouns.
In conclusion, no conclusion is
possible, except that I should probably stay away from lexical analysis.
For anyone who programs in MUMPS, and is
familiar with the MUMPS File Manager —
A.) FILE NAME:------------- BOOK LEXICON
F.) FILE ACCESS:
B.) FILE NUMBER:-----------
29340.5
DD______ @
Read____ @
C.) NUM OF FLDS:-----------
4
Write___ @
Delete__ @
D.) DATA GLOBAL:-----------
^SIS(29340.5,
Laygo___ @
E.) TOTAL GLOBAL ENTRIES:-- 30220
G.) PRINTING STATUS:-- Off
================================================================================
.01
WORD [0;1] [RF]
1
BOOK [1;0]
[29340.51PA]
<-Mult
.01
BOOK [0;1] [P1360105']
.02
COUNT [0;2] [NJ8,0]
The programming steps to populate the
database were approximately as follows:
- Read the text file into a scratch global.
- Inspect the global to determine where the book
begins and ends (i.e., eliminating publisher data, etc.).
- For each line in the book, remove extraneous
punctuation, convert case, split into space pieces (quasi-words).
- For each word either add it to the Lexicon or
increment the count if it is already there.
The time required to parse and file all
the words in Ulysses
on my now retired quad-core AMD was about 2 seconds. The time
required to test words against the larger lexicon was negligible.