A large portion of my current research on consumer reviews (of hotels) with coauthor Alex Chaudhry revolves around processing online review data that includes a lot of text (200k+ reviews or about 34m+ words). One of the ways we wanted to explore the data is through grammatical structure. For example, reviews with a lot of verb phrases might be indicative of a consumer detailing his/her personal experiences (we: stayed at, visited, checked in, called, complained) whereas reviews with relatively more adjective phrases may indicate more abstract sentiment (beautiful, extravagant, dingy, scary: hotel!). There are plenty of great packages in Python to help with our NLP needs (we use these: Pattern, NLTK, TextBlob, Gensim). The problem with NLP, and POS tagging in particular, is that it is a relatively computing intensive process. For example, a single hotel with 10,000 reviews (ok, our setting – Vegas – isn’t typical…) may take 10 minutes to process. With any interesting number of hotels, this will make testing ideas quite a pain.
One way to improve the processing time is to parallelize your code. For the uninitiated researcher (here are slides for conceptual understanding of parallel computing), this may sound like a daunting task, but I promise it is way easier than you might imagine! The tagging of every review can be done independently. So with the power of our multi-core computers, we should be able to utilize all the cores simultaneously to improve our run times. There is a VERY easy way to declare parallel process functions within the iPython Notebook environment. Basically, turn on your engines in the cluster tab of your notebook’s landing page, add a few lines of code to declare view of available engines, and add/change a few lines of your code to run in parallel (see our reference code). With these slight adjustments, we are able to cut compute time per hotel to about 2-4 minutes on a 4 core standard PC or laptop. This is a pretty good improvement for just 5 minutes of work, but what if we want to do better? Buy a powerful computer? Well… yes.
If you have access to a multicore server, awesome! Load your python setup into the server, and run the code on that. But even if you don’t, Amazon Web Services‘s EC2 product has made it extremely easy and cheap to run code on clusters with up to 32 nodes and 60gb+ of RAM. Requesting a spot instance of such a setup can cost as little as 30c/ hour. For notes on how to set up your iPython environment on EC2 (or some similar Linux server), see Randy Zwitch’s blog. After porting my code over to EC2, I was able to bring the run time for each hotel down to approximately 19 seconds, or about a 96% reduction in compute time! This compute time reduction has been invaluable to maximizing my productivity, and I hope those new to parallel computing and Python can all take advantage of this easy to implement optimization step.