Reddit. A place of unrelenting procrastination and unexpected inspiration. At first glance, a confusing concoction of trite memes, current affairs, cute animals and rampant shitposts. But amidst the apparent chaos one discovers an underlying order: the endless stream of links is neatly segmented into subreddits, each boasting its own sub-community and distinct personality.

I thought it might be fun to investigate the subcultures that exist within these subreddits. With the help of some basic natural language processing, we gain insight into the distinct lexicon of a subreddit. Most likely, the results will simply reinforce our preconceptions, but hopefully we might learn something new.

A quick note: all of the results you see below were generated using a Python app that I’ve made available as open source. You can read more about it here or type git clone https://github.com/jaijuneja/reddit-nlp.git into your terminal. To avoid boring you with details of the implementation, let’s jump straight to the results.

I grabbed a list of the 25 top subreddits from here. For each one, I processed approximately 10,000 recent comments, tokenising and performing a running count of all words that appeared. What you see above are the most common words in each subreddit by their term frequency-inverse document frequency (tf-idf) score. This score reflects the importance of a word within a specific document. For example, the word “and” might appear very frequently in a subreddit, but since it also appears frequently in all other subreddits it is down-weighted. This approach helps to filter out a lot of uninformative words and has been supplemented with the use of a stop-word list. I also toyed with stemming (using the Porter Stemmer algorithm), but found that it wasn’t particularly effective. This explains why you can see both singular and plural forms of certain words appearing above.

Continue reading »