Benchmarking Research tools

2019-08-13T02:30:03.269991+00:00

Motivation

I am a PhD student in Computational Linguistics. As such I have both the need to experiment with deep learning frameworks and little money to build a powerful deep learning machine.

Unlike my last build, which was centered around an open source deep learning stack provided by AMD, this build is designed to be as cheap as possible.

There are some things that I included in this build that I just had lying around and some things that I had to buy to assemble it. I will break down the cost for both types of materials.

Chassis

Building a deep learning machine in an old gaming computer case would probably be ideal. However, those kinds of systems have held their value better than I initially expected. In addition, ram was a bit of a concern for systems old enough to be within my price range: gaming systems from 2011 or so almost always seemed to be built with 8 GB of ram on the high end. Of course, this can be expanded, but the highest memory capacity I could achieve for a $80 motherboard seemed to be about 32 GB. I wanted to have the flexibility to work with large datasets in the future.

In addition, I am very familiar with old server hardware (due to the number of non-deep learning oriented servers I maintain). I started seeking out used workstations like the HP z620, z820, Lenovo Thinkstation s30, or the Dell Precision T3600.

Registered ECC memory for these old systems is very easy to come by and incredibly cheap (I just upgraded the system described in this post with 32 GB of ram for $35).
The LGA 2011 processors (the e5-2600 and e5-2600 v2 lineup) have AVX extensions which enable faster floating point calculations. Tensorflow's prebuilt binaries started assuming AVX instructions sometime this past year. A system any older will not perform nearly as well.
The power supplies are generally overspecc'd to handle GPUs for CAD work
They don't typically have any of the vendor lock in that existed in many servers from these brands in the same year.

I ended up setting on the HP z620 in particular because:

The motherboard provided a large number of SATA connections for storing more hard drives
The PSU was quite overspecc'd with each of the 6 pin connectors capable of being split into an 8 pin connector
A second processor can be added via a riser
More memory could be added compared to the Dell.
z820's are too expensive if the second processor is not a necessity.
It was quiet
It was cheap

GPUs

For the GPUs in this machine, I went with the GTX 1070. This was for two primary reasons. The first being that this is the cheapest processor available from NVIDIA with 8 GB of VRAM. VRAM is where the models and data live on the GPU. Having more VRAM directly relates to the size of the models you can run as well as the speed with which you can train (as training speed is highly dependent upon batch size and larger batches require more memory). The second reason is that the GTX 1070 typically only required a single 8 pin power connector. Provided that the z620 can supply two 8 pin connectors, something like the 1080ti is only practical if I have no desire to expand to more than one GPU.

The first GPU I selected was a cheap Gigabyte model that I bought off of an online forum.

The second GPU, which I bought after using the machine for a while was a blower-design HP OEM card that I bought on Ebay. This was slightly more expensive due to having to pay tax.

Hard drives

Data takes up a lot of space and I wanted to have no shortage of spinning metal to put my data on. The z620 came with one 500 GB drive. I had a number of 1 terabyte 2.5" hard drives around from a failed attempt at using them in my Proliant DL380 machine (however, the temp sensors on these drives did not agree with this machine).

The z620 has two 5.25" bays. Only the top one has a CD drive. I bought a caddy from IcyDock that enabled me to put four 2.5" drives in one 5.25" bay.

Cost breakdown

|----|-----| |Chassis | 200 | | GTX 1070 | 200 | | GTX 1070 | 215 | | IcyDock Chassis | 20 |

]]>

2019-04-23T22:03:27.862263+00:00

The biggest question is how does this perform? Yeah the stack is open source, it uses a driver already integrated into the kernel and is able to run tensorflow, pytorch and caffe but how well does it do that?

Some results are provided from lambda labs for comparsion with the vega 56.

Model / GPU	Vega 56	1080 Ti
ResNet-50	145.19	203.99
Inception v3	67.08	130.2
VGG16	80.57	133.16

The Vega GPU is quite about half as fast as the 1080ti on the worst performing model (Inception v3). The result for ResNet-50 is where the gap is the closest and that result was actually achieved by turning on ROCm Fusion. This fusion operation seems to alter the computation graph to combine multiple operations into a single convolution where possible.

To enable this, run export TF_ROCM_FUSION_ENABLE=1 inside the docker container before starting a tensorflow workload. Perhaps the other models would have been closer in performance to the 1080ti with this setting. Unfortunately, I was not able to perform very rigorous testing as I was building this machine for someone else. I would like to try out ROCm Fusion as well as [undervolting and overclocking the card])https://github.com/RadeonOpenCompute/ROCm/issues/463). Undervolting should reduce the amount of heat and fan noise, allowing the card to maintain higher boost frequencies.

Conclusion

While this build wasn't for me, I would certainly build this myself if I had an extra thousand dollars to use on this. After tax and everything the entire build was 996.17. However, the value for performance with the Vega GPU is actually pretty decent. I got the Vega 56 for $320 after tax. Considering that the cursory benchmark results obtained showed the Vega getting anywhere from 50-75% of the performance of a 1080ti at under a third of the price of a 1080ti (most new 1080ti's I see are around $850 at the moment).

In the future, it would be better to compare the cost/performance of the Vega to a lower tier Nvidia GPU like a 1070.

]]>

2019-04-23T15:23:21.451345+00:00

Operating System

I used Ubuntu 19.04 partially because I wanted to try out the April release of Ubuntu and I knew that the newer kernels were more compatible with Vega (the amdgpu driver is merged into the kernel after 4.19 which reduces installation headaches) and the Ryzen CPU.

A note about docker

If you are not a fan of docker, for security or whatever reason, I don't advise that you use Ubuntu 19.04. This release has only python 3.7 and there are, at the moment, a few issues with running rocm 2.3 with python 3.7. This doesn't seem to be a problem on python 3.5 or 3.6 and with older versions of ROCm. However, the performance improvement with the newer version of ROCm is pretty substantial so I would use a version of Ubuntu where you can downgrade your python version instead.

Initial Software Stack

First, the debian repository has to be added

wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list

Then, the appropriate packages are installed.

sudo apt update
sudo apt install rocm-libs miopen-hip cxlactivitylogger
sudo apt install rocm-dev

Because we are using the amdgpu drivers in kernel 5.0 that ships with Ubuntu 19.04, we need to add the following udev rule.

echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules

Docker install

I added the following line into my ~/.bash_rc file to allow for quick launching of the container:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx'

To launch the container, you then simply run drun rocm/tensorflow to drop into your container. The first time you run this, it will pull the images from dockerhub. After that, it will use the cached image.

]]>

2019-04-23T04:22:20.881453+00:00

Deep learning has historically been dominated by NVIDIA GPUs. The Nvidia CUDA API is a proprietary standard for writing code to run on graphics processing hardware. CUDA is tightly integrated in all the major deep learning toolkits and provides a relatively intuitive programming interface (in comparison to OpenCL). For a more in depth discussion of the history of GPGPU programming and the potential for an interoperable open-source gpu programming future check out this youtube video.

However, CUDA is proprietary, only works on NVIDIA GPUs, and requires proprietary linux drivers to work. Many people, myself included have objections to the monopolistic hold that NVIDIA has established on the deep learning infrastructure market and object to their non-open practices. In addition, using CUDA can be a flat out pain on the administration side. In my experience, the CUDA utilities integrate poorly with package managers. I have had a number of issues with removing CUDA or replacing it with a new version where installation added a large number of additional programs but removal only uninstalled a couple programs.

Hardware considerations

AMD HIP/ROCm is slightly more picky than CUDA with regard to the hardware it will run on. RX 50 GPUs, RX 40 GPUs and the R9 3*0 series are not able to run on older cpus where pcie v3 atomics are not supported. Newer GPUs like the Vega 56, Vega 64, Vega Founders Edition and Radeon VII are able to run in a mode without PCIE v3 atomic support with a performance penalty.

CPUs with PCIE v3 atomic support include all Ryzen CPUs as well as all Intel CPUs from Haswell on (e.g. all Intel processors greater than 4000). For more information on supported hardware check out this page

Hardware used

32 gb (2 16 gb dimms) of 3000 MHz GSkill Trident Z
Ryzen 7 1700
Vega 56 (ASRock blower)
Wraith Spire
B450 Aorus M
128gb ssd
2 tb hard drive
750 watt power supply
Rosewill scm-01 case

All RGB was merely an accident of pricing. I just went with the best performance for the money. In addition, a blower style vega 56 was used instead of a free flowing Vega 56 like those by PowerColor as the case has relatively poor airflow. Getting hot air out of the case was deemed much more important for prolonged workload performance.

]]>

2019-03-02T20:55:18.422178+00:00

For a project on maliciousness detection that I am working on, I needed an unsupervised stemming method. We were examining the role that text cleanup plays in the classification task. This would become especially important as we investigated other feature extraction methods like dependency triples.

The main problem was that we needed a system that would work for both English and German data. In both cases, we were using social media data.

I'll make another post explaining why this text cleanup needed to be done and how Yass works as well as some qualitative examples of the performance achieved.

The main point of this post is that PyPy drastically sped up the stemming process.

What is pypy?

PyPy is an alternative python engine that compiles code instead of running on top of the PVM (Python Virtual Machine). Similar to the JVM, python's PVM is a layer that runs bytecode generated by the interpreter. Despite the fact that the python interpreter is written in C, it still has a large amount of overhead.

PyPy, uses just-in-time compilation to get to machine code. Just-in-time compilation is a method where only the functions that are used are actually ran through the compiler. Ordinary compiled languages like C or go have separate compilation and execution stages. For example, in C you could run gcc myprogram.c -o myprogram to generate a static binary that can then be run using ./myprogram. PyPy and other compiled languages don't have a separate compile and execute steps. The program is compiled as it is being executed, compiling only the pieces that need to be compiled. Typically, the function compiled is specific to the arguments being called. For example calling this python function:

def simple_mul(num1, num2):
    """
    This function simply multiplies two numbers together and
    is just meant to be an example
    """
    return num1 * num2

as simple_mul(3, 9) would compile a version where both arguments are integers while simple_mul(3.2,9) would compile a version of the function where the first argument is a float and the second, an integer.

This highly specific compilation is part of what makes languages like julia and PyPy so fast.

An issue early on with PyPy was support for packages that utilized python bindings to c programs. These types of programs are common in datascience and statistics packages for python as this is the best way to get high performance out of python. Libraries like numpy and scipy were not functional on the PyPy engine.

However, recent advances have made these packages work quite well.

Unfortunately, scikit-learn does not work at the moment. Once this package works, I will switch completely over to PyPy for all projects. As it is right now, I use pypy for data extraction and preprocessing and then use vanilla python for the classification start.

What kind of speed up did it achieve?

The PyPy engine ran the code about twice as fast as the raw python version. With the full dataset of 15616 unique tokens, the python implementation took 114.459 seconds or about 18.5 minutes. In contrast the PyPy engine took 548.275 seconds or about 9 minutes. This required no changes to the original python implementation.

A number of different vocabulary sizes were also tried to see how the performance results changed.

During execution, I noticed that file IO consumed a much larger portion of the total execution time for the PyPy version. The actual logic of the code took a smaller proportion.

However, I did not quantify this. In the future, I would like to write a version of this program in julia with benchmarking of the initial file read, calculation of the distances between every pair of words, construction of the minimum spanning tree and then writing the results out to a file. Doing so will clarify where the performance profiles of python, PyPy and julia differ.

This gitea repository holds the code used.

The plot was generated using the Gadfly package for julia.

]]>