Stemming with pypy and python

Speed comparison

For a project on maliciousness detection that I am working on, I needed an unsupervised stemming method. We were examining the role that text cleanup plays in the classification task. This would become especially important as we investigated other feature extraction methods like dependency triples.

The main problem was that we needed a system that would work for both English and German data. In both cases, we were using social media data.

I'll make another post explaining why this text cleanup needed to be done and how Yass works as well as some qualitative examples of the performance achieved.

The main point of this post is that PyPy drastically sped up the stemming process.

What is pypy?

PyPy is an alternative python engine that compiles code instead of running on top of the PVM (Python Virtual Machine). Similar to the JVM, python's PVM is a layer that runs bytecode generated by the interpreter. Despite the fact that the python interpreter is written in C, it still has a large amount of overhead.

PyPy, uses just-in-time compilation to get to machine code. Just-in-time compilation is a method where only the functions that are used are actually ran through the compiler. Ordinary compiled languages like C or go have separate compilation and execution stages. For example, in C you could run gcc myprogram.c -o myprogram to generate a static binary that can then be run using ./myprogram. PyPy and other compiled languages don't have a separate compile and execute steps. The program is compiled as it is being executed, compiling only the pieces that need to be compiled. Typically, the function compiled is specific to the arguments being called. For example calling this python function:

def simple_mul(num1, num2):
    """
    This function simply multiplies two numbers together and
    is just meant to be an example
    """
    return num1 * num2

as simple_mul(3, 9) would compile a version where both arguments are integers while simple_mul(3.2,9) would compile a version of the function where the first argument is a float and the second, an integer.

This highly specific compilation is part of what makes languages like julia and PyPy so fast.

An issue early on with PyPy was support for packages that utilized python bindings to c programs. These types of programs are common in datascience and statistics packages for python as this is the best way to get high performance out of python. Libraries like numpy and scipy were not functional on the PyPy engine.

However, recent advances have made these packages work quite well.

Unfortunately, scikit-learn does not work at the moment. Once this package works, I will switch completely over to PyPy for all projects. As it is right now, I use pypy for data extraction and preprocessing and then use vanilla python for the classification start.

What kind of speed up did it achieve?

The PyPy engine ran the code about twice as fast as the raw python version. With the full dataset of 15616 unique tokens, the python implementation took 114.459 seconds or about 18.5 minutes. In contrast the PyPy engine took 548.275 seconds or about 9 minutes. This required no changes to the original python implementation.

A number of different vocabulary sizes were also tried to see how the performance results changed.

Scaling graph for pypy and python running yass stemming distance metric 2. The graph shows that, as the problem size increases, pypy's running time remains a fraction of the running time of python.

During execution, I noticed that file IO consumed a much larger portion of the total execution time for the PyPy version. The actual logic of the code took a smaller proportion.

However, I did not quantify this. In the future, I would like to write a version of this program in julia with benchmarking of the initial file read, calculation of the distances between every pair of words, construction of the minimum spanning tree and then writing the results out to a file. Doing so will clarify where the performance profiles of python, PyPy and julia differ.

This gitea repository holds the code used.

The plot was generated using the Gadfly package for julia.