The Vocabulary of Reddit

Posted on December 2, 2014 in: Personal, Programming|Comments Off

Reddit. A place of unrelenting procrastination and unexpected inspiration. At first glance, a confusing concoction of trite memes, current affairs, cute animals and rampant shitposts. But amidst the apparent chaos one discovers an underlying order: the endless stream of links is neatly segmented into subreddits, each boasting its own sub-community and distinct personality.

I thought it might be fun to investigate the subcultures that exist within these subreddits. With the help of some basic natural language processing, we gain insight into the distinct lexicon of a subreddit. Most likely, the results will simply reinforce our preconceptions, but hopefully we might learn something new.

A quick note: all of the results you see below were generated using a Python app that I’ve made available as open source. You can read more about it here or type git clone https://github.com/jaijuneja/reddit-nlp.git into your terminal. To avoid boring you with details of the implementation, let’s jump straight to the results.

I grabbed a list of the 25 top subreddits from here. For each one, I processed approximately 10,000 recent comments, tokenising and performing a running count of all words that appeared. What you see above are the most common words in each subreddit by their term frequency-inverse document frequency (tf-idf) score. This score reflects the importance of a word within a specific document. For example, the word “and” might appear very frequently in a subreddit, but since it also appears frequently in all other subreddits it is down-weighted. This approach helps to filter out a lot of uninformative words and has been supplemented with the use of a stop-word list. I also toyed with stemming (using the Porter Stemmer algorithm), but found that it wasn’t particularly effective. This explains why you can see both singular and plural forms of certain words appearing above.

Continue reading »

Bag-of-Words-based Localisation and Mapping of Textured Scenes

Posted on October 7, 2014 in: Programming, University|Comments Off

During my final year at Oxford I spent many nights ruminating on the topic of computer vision. Haggard, bearded, and jaded by the prospect of partying and gallivanting that once dominated my life, I was left with only my thesis. Over the course of a year, I transformed from a youthful Frodo into a schizophrenic Gollum, hunched over a laptop and typing furiously as freshers frolicked beyond the library window.

All jokes and prose aside, my time at Oxford was awesome and irreplaceable. But given that my master’s thesis did, in fact, consume a sizeable chunk of my lifespan, it would be unfortunate if it got lost in the colossal annals of academia. So I shall post it here for the world (i.e. 3 people tops) to read, where it will instead become lost in the deep web.

Paper

Poster

Abstract

This report develops a large-scale, offline localisation and mapping scheme for textured surfaces using the bag-of-visual-words model. The proposed system builds heavily on the work of Philbin et al. and Vedaldi and Fulkerson, taking as an input a corpus of images that are rapidly indexed and correlated to construct a coherent map of the imaged scene. Thereafter, new images can be reliably queried against the map to obtain their corresponding 3D camera pose. A simple bundle adjustment algorithm is formulated to enforce global map consistency, exhibiting good performance on real datasets and large loop closures. Furthermore, a proposed submapping scheme provides dramatic computational improvements over the baseline without any deterioration in localisation performance. A comprehensive implementation written in MATLAB as proof of concept is pressure tested against a variety of textures to examine the conditions for system failure. The application unambiguously outperforms the naked eye, demonstrating an ability to discriminate between very fine details in diverse settings. Tests validate its robustness against broad changes in lighting and perspective, as well as its notable resilience to high levels of noise. Areas for further development to augment the original work and achieve real-time performance are also suggested.

I conducted my research in the Visual Geometry Group under the supervision of Prof Andrea Vedaldi. The project focused on building a system to coherently reconstruct large-scale “textured” scenes using a single camera. The idea was to see whether a computer could reconstruct an environment which, to the human eye, contains very little physical or visual information. A complete architecture for offline localisation and mapping using the Bag-of-Words (BoW) model was proposed and subsequently implemented in MATLAB.

The poster above provides a good overview of the system, but to give you a better sense of what the algorithm does at a high level, consider the image below. First, a corpus of training images is used to extract “interesting” features (using the Scale Invariant Feature Transform) – typically unique areas of high contrast, corners, edges and so on. These are then quantised into visual words using k-means clustering. Once a “vocabulary” has been established a large set of images can have their features extracted and geometrically matched, such that the scene can be stitched and reconstructed. Thereafter, a new input image can be rapidly localised within the scene.

The MATLAB implementation can be downloaded from the link below.

Alternatively if you use Git then you can clone the repository from GitHub using the command:

git clone https://github.com/jaijuneja/texture-localisation-matlab.git

The Man Behind The Blog

Hello! My name is Jai. I like eating, sleeping, writing, reading, building, breaking, making, fixing, solving, swimming, photographing, computing... running out of spa