I spent last night with Python’s nltk—my initial focus was seeing if I could make AndyBot smarter/more coherent. The results weren’t great—a trained Hidden Markov Model produced equally incoherent sentences as the current random algorithm (“would friend do noodle using melting sibelius?”). Instead I used some of the analysis tools in nltk to view trends in my own tweets.
Organizing The Corpus
The first step was creating a corpus of words for analysis—iterating through tweets and using the nltk.pos_tag to tag the parts of speech in each of them.
The result of one of these tagged tweets looks something like this:
What I Talk About
I analyzed the corpus for the most common things I say—nouns, verbs, and n-grams.
Nouns:
There are clearly some issues with parsing tweets—’@’, ‘[’ and ‘]’ are perceived as proper nouns and ‘i’ is perceived as plural in 76 cases.
My pronouns are also clearly skewed in terms of gender. Common contexts between ‘guy’ and ‘girl’ are ‘the guy/girl who’ and ‘a guy/girl with’
Verbs:
No surprises here—this stacks up pretty evenly against all of English. 8 of the most common verbs are represented here.
n-grams:
n-grams, which are used when AndyBot generates tweets (specifically 3-grams), are groups of n words which appear sequentially within the corpus.
nltk has the BigramCollcationFinder and TrigramCollcationFinder classes which I used to determine the most frequently collated bigrams and trigrams.
I can trace most of the following 3-grams to specific tweets/subjects: parodies of the “Now That’s What I Call Music” genre, parodies of “it’s not delivery, it’s Digiorno”, and parodies of “all my friends are getting married and having children and I’m…”.
The following bigrams are mostly two-part nouns, either proper or otherwise.
Summary
Overall, I’m not surprised to see the most common n-grams/nouns in the corpus of my tweets—they’re all pretty mundane.
The lexical diversity of my tweets was pretty low—23% of words in the corpus were unique.
The calculation was the following:
It’s somewhat disheartening, but I’ve been able to make >4000 tweets out of only 4600 different words. I’ll be interested to see if diversity will grow as the corpus does or if I’ve reached the limit of dumb words to say.