31 Jul 2012 01:00

language-model dimensionality reduction with Markov-chain distribution medians

```In my kragen-tol post earlier this month about [predictive text input
methods][0], I wrote:

> You could do *much better* than SwiftKey with dimensionality reduction
> techniques, which could allow much more context to be brought to bear
> on the prediction problem using a much smaller model.  SwiftKey often
> makes predictions that yield clearly ungrammatical 3-grams, which even
> a simple 3-gram model over part-of-speech tags ought to be able to
> reject.

[0]: http://lists.canonical.org/pipermail/kragen-tol/2012-July/000961.html

But part-of-speech tagging is sort of hard, especially on incomplete and
possibly erroneous sentences containing words that aren't in the dictionary.
But it seems like you ought to be able to do a good enough job with simple
statistics to be useful.

Now, the rest of this post may be a case of "a few hours of hacking can save
you several minutes in the library", so take that under advisement.  I haven't
checked out the literature enough to know if this idea is already known, let
alone an improvement over what already exists.

One simple approach is to consider a Markov-chain model of text, and cluster
together words whose transition probabilities are similar, either for
in-transitions (what they tend to follow) or out-transitions (what they tend to
precede).

However, this runs into a practical problem immediately: even with a
first-order Markov model, you're trying to perform unsupervised classification
on thousands of vectors in a space with thousands of dimensions.  This is, as
far as I know, a totally intractable problem, because in high-dimensional
spaces, almost all points are closer to boundaries of the space than they are
to any other point.  So you need to reduce the dimensionality of the term space
before you can reduce the dimensionality of the term space.

Rather than eat my tail through that rathole, I thought maybe I'd pick some
kind of arbitrary characteristic of those vectors to cluster them.  The vectors
are finite discrete probability distributions of preceding or following words,
so given a mapping from the words onto the real line, you can use the usual set
of characteristics of finite discrete probability distributions, such as mean,
median, mode, standard deviation, skewness, kurtosis, and so on.  Of these,
median is dependent only on the ordering of the words, and mode isn't even
dependent on the ordering.

So if we cluster these distributions by their median, an idea due to a friend
of mine who should get credit if he wants it, we should be able to easily find
words that are used in similar contexts.

So, grouping the words into clusters that tend to precede the same word, using
the KJV bible for my example text, I got reasonable-looking clusters like the
following.  First, a cluster of mostly verbs, mostly naughty ones, that precede
"not", although interestingly "Is", "am", and "ye" made it in there too:

not: Accuse, Be, Boast, Cast, Cease, Could, Count, Curse, Deceive,
Defile, Desire, Despise, Destroy, Devise, Did, Didst, Distress,
Do, Doth, Dread, Eat, Enter, Fear, Forget, Fret, Give, Grudge,
Harden, Hide, Hurt, Is, Judge, Labour, Lay, Lodge, Lust, Marvel,
Meddle, Mind, Murmur, Neglect, Quench, Regard, Rejoice, Rejoiceth,
Remove, Reprove, Respect, Rob, Slack, Think, Told, Touch, Trust,
Uncover, Was, Weep, Went, Withhold, agreed, altereth, am,
appertaineth, approveth, are, attained, backbiteth, bearest,
believed, believeth, bewray, brawler, bridleth, buildedst,
burdened, canst, casteth, ceased, circumspectly, confesseth,
consent, considerest, consisteth, continueth, could, couldest,
defer, deferred, defile, defileth, delightest, descendeth, dieth,
diggedst, dishonesty, displease, do, does, doeth, doubletongued,
doubt, durst, edify, envieth, epistle, extendeth, falleth, feared,
filledst, followedst, forbid, fret, gathereth, gnaw, grieve,
harden, heareth, hearkenedst, hide, housetop, kept, knew, knewest,
knowest, lacketh, lady, laid, layedst, layeth, let, lingereth,
needed, needest, needeth, obey, obeyed, obeyedst, obeyeth,
observest, oppress, patient, payeth, perceivest, perfection,
prefer, profane, purifieth, rained, receive, regard, regarded,
regardest, regardeth, remained, repented, respected, respecteth,
return, returneth, reviled, sacrificeth, savourest, saw, seek,
servedst, shalt, shone, slack, slumbereth, sobriety, sowest,
spared, spin, staggered, striker, strive, stumbleth, suffereth,
swelled, tarrieth, tarry, they, toil, tortured, touch, travailest,
trow, understand, understood, upbraideth, vaunteth, wasted, wax,
wipe, wist, withdraweth, withheldest, wot, wotteth, wouldest,
wrestle, ye

And second, a cluster of mostly proper names that precede capitalized "And":

Ahlai, Ahoah, Alamoth, Allonbachuth, Ammonitess, Anaharath, Aniam,
Anim, Antothijah, Aphekah, Ara, Ardon, Aridatha, Armageddon,
Asareel, Ashbea, Ashima, Asiel, Aspatha, Athach, Athlai, Attalia,
Avith, Azem, Azzan, Bealoth, Beera, Berith, Bethbaalmeon,
Bethphelet, Birzavith, Boscath, Camon, Canaanitess, Caphthorim,
Caphtorim, Chelubai, Chislon, Chronicles, Dalmanutha, Diklah,
Dinhabah, Dothan, EARTH, Edar, Eker, EleloheIsrael, Elpalet,
Enhazor, Ephesdammim, Ephod, Eshbaal, Eshean, Ethnan, Euroclydon,
Ezel, Gaash, Gabbatha, Galeed, Galilaean, Gazer, Gennesaret,
Gilboa, Girgashite, Girgasite, Goath, Hamongog, Hamul, Hararite,
Harum, Hasenuah, Hathath, Hazarsusah, Hazelelponi, Hazezontamar,
Helekites, Hepherites, Hormah, Huphamites, Ibnijah, Imla,
Irshemesh, Ishmeelite, Ivah, Jaasau, Jagur, Japhia, Japho,
...
Tyrannus, Urias, Uzzensherah, Zacher, Zaham, Zarthan, Zered,
Zithri, Zorathites, Zorites, abolish, alloweth, amen, amethyst,
badness, bakers, bat, begging, casement, chamois, clearness,
conquer, consist, crew, cupbearer, cymbal, delicacies, dropsy,
dryshod, effected, excused, fishhooks, flatteries, foals,
foreheads, hungered, inclosings, inheriteth, manger, marketplaces,
meadow, meant, mightiest, mouldy, mowings, murrain, occurrent,
pollution, pounds, pricks, recorder, repenting, rigour, rushes,
shrubs, size, socket, sorrowing, stanched, target, tens,
tentmakers, therefrom, traitor, vails, woods, wretchedness

Unsurprisingly, the "begat" cluster is entirely proper names:

begat: Abia, Abishua, Abiud, Achaz, Achim, Amariah, Aminadab,
Amminadab, Attai, Azor, Bukki, Canaan, Coz, Eliud, Ephlal, Eshton,
Jekamiah, Joatham, Joiada, Josaphat, Josias, Kish, Manasses,
Matthan, Mehujael, Meonothai, Meraioth, Meribbaal, Methusael,
Mikloth, Mizraim, Ner, Obed, Ozias, Phares, Pharez, Ram, Roboam,
Zerahiah, Zorobabel

All in all, the 14000 or so distinct words get clustered by this approach into
1793 clusters, of which 1185 have only one word.  Many of the smaller clusters
very clearly express some kind of affinity between words:

support: feebleminded, financial, volunteer
falsely: science, swearing, you
Where: Puteoli, Telassar, Thelasar
brother: weak, younger, youngest
appear: flowers, menchildren, plainly
honour: husbands, retaineth, without
hither: Hasten, Reach, description
Duke: Jetheth, Mibzar, Pinon
branches: Took, fruitful, natural
She: Tirhanah, distaff, tributary
besides: Stephanas, loins, self
knoweth: Man, certainly, unjust
Till: intermission, purity, stocks
slept: Jehoiakim, Menahem, Omri
died: Achbor, Chilion, Jether, Pirathonite, Seled, Terah
When: Hareth, discretion, guile, patrimony, thereat, unequal
F: DAMAGE, PARAGRAPH, PURPOSE, below, problem, provisions
such: Defects, lettest, marvels, perhaps, pigeons, saveth
year: eighteenth, eightieth, fiftieth, fortieth, nineteenth, thirtieth
cities: Berothai, Chun, defenced, fenced, ruined, thirteen
children: Little, Zebedees, backsliding, beget, impudent, sottish
two: Asuppim, Telaim, Teresh, calamus, containing, spearmen
So: corpses, debt, eater, freewoman, peacocks, saying
none: So, burned, finding, put, quarter, smiths
might: I, mortality, mother, purpose, scripture, whereto
bare: Adah, Aramitess, Bashemath, Hammoleketh, Jehudijah, Milcah
OR: DISTRIBUTE, EXPRESS, MERCHANTIBILITY, PUNITIVE, REPLACEMENT,
WARRANTY
some: Arm, oppressed, save, sixtyfold, sowed, thirtyfold

It seems like this kind of clustering might be a useful dimensionality-
reduction technique for natural-language processing, in general; and perhaps it
could be iterated to reduce the numbers further.  For example, when applying
Katz smoothing to a language model, as an alternative to backing off to shorter
N-grams when you don't have enough N-grams for a given value of N, you could
back off to N-grams computed over a vocabulary of these clusters.

Efficient computation
---------------------

computed efficiently from a suffix array.  Although I'm not familiar with the
performance characteristics of modern suffix-array construction algorithms (I
only know that they're linear-time), I suspect that computing the suffix array
is, in practice, more expensive than computing the pdfs of the contexts using a
hash table, then computing their cdfs in order to find the median, which I seem
to recall is what you were doing; but if you're already paying the cost of
building a suffix array for some reason, this may be useful.  For example, I'm
thinking about using a suffix array for Katz smoothing of a language model.

To find the distribution median of followers for a single context:

1. Find the beginning B and end E of that context's span in the suffix array
--- that is, the first index B such that text[SA[B]:].startswith(context),
and the last index E meeting the same criterion.  This is O(lg len(text)).
2. Find the median index M = (B+E)/2, or under some circumstances, B+(E-B)/2.
3. Find text[SA[M]+len(context)], and that's your median follower character.

Step 3 is slightly more complicated if you want something more than a single
character, but not much; this is an advantage of the suffix-array structure.

Finding the distribution median for all M contexts this way is
O(M lg len(text)), which might be less than O(len(text)).  It's very likely
faster than computing the cdfs, which is 256*M work, albeit sequential.

There's a much simpler linear-time algorithm to get them all at once, of
course: read through the suffix array in order, remembering the last boundary
between contexts.

Simple implementation
---------------------

This is the code I used to get the results above.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import re
import textwrap

def words(file, pattern='[a-zA-Z]+'):
return (word for line in file for word in re.findall(pattern, line))

def ngrams(n, words):
queue = []
for word in words:
queue.append(word)
if len(queue) == n:
yield tuple(queue)          # Lists arenâ€™t hashable.
queue.pop(0)

def multimap(kvpairs):
rv = {}
for key, value in kvpairs:
if key not in rv:
rv[key] = []
rv[key].append(value)

return rv

following = multimap(ngrams(2, words(open(sys.argv[1]))))
clusters = multimap((sorted(next_words)[len(next_words)//2], prev_word)
for prev_word, next_words in following.iteritems())
items = sorted(clusters.iteritems(), key=lambda (name, contents): len(contents))

for next_word, prev_words in items:
for line in textwrap.wrap(', '.join(sorted(prev_words)),
initial_indent=next_word+': ',
subsequent_indent='    '):
print line

--

--
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol```

Gmane