Benchmarking Research toolshttps://amplifi.casa/~/BenchmarkingResearchTools@blog.ksteimel.duckdns.org/atom.xml2019-08-13T02:30:03.269991+00:00<![CDATA[Scrapyard Deep Learning]]>https://blog.ksteimel.duckdns.org/~/BenchmarkingResearchTools/scrapyard-deep-learning/2019-08-13T02:30:03.269991+00:00ksteimelhttps://blog.ksteimel.duckdns.org/@/kenneth/2019-08-13T02:30:03.269991+00:00<![CDATA[<h1>Motivation</h1>
<p>I am a PhD student in Computational Linguistics. As such I have both the need to experiment with deep learning frameworks and little money to build a powerful deep learning machine.</p>
<p>Unlike my last build, which was centered around an open source deep learning stack provided by AMD, this build is designed to be as cheap as possible.</p>
<p>There are some things that I included in this build that I just had lying around and some things that I had to buy to assemble it. I will break down the cost for both types of materials.</p>
<h2>Chassis</h2>
<p>Building a deep learning machine in an old gaming computer case would probably be ideal. However, those kinds of systems have held their value better than I initially expected. In addition, ram was a bit of a concern for systems old enough to be within my price range: gaming systems from 2011 or so almost always seemed to be built with 8 GB of ram on the high end. Of course, this can be expanded, but the highest memory capacity I could achieve for a $80 motherboard seemed to be about 32 GB. I wanted to have the flexibility to work with large datasets in the future.</p>
<p>In addition, I am very familiar with old server hardware (due to the number of non-deep learning oriented servers I maintain). I started seeking out used workstations like the HP z620, z820, Lenovo Thinkstation s30, or the Dell Precision T3600.</p>
<ul>
<li>Registered ECC memory for these old systems is very easy to come by and incredibly cheap (I just upgraded the system described in this post with 32 GB of ram for $35).</li>
<li>The LGA 2011 processors (the e5-2600 and e5-2600 v2 lineup) have AVX extensions which enable faster floating point calculations. Tensorflow's prebuilt binaries started assuming AVX instructions sometime this past year. A system any older will not perform nearly as well.</li>
<li>The power supplies are generally overspecc'd to handle GPUs for CAD work</li>
<li>They don't typically have any of the vendor lock in that existed in many servers from these brands in the same year.</li>
</ul>
<p>I ended up setting on the HP z620 in particular because:</p>
<ul>
<li>The motherboard provided a large number of SATA connections for storing more hard drives</li>
<li>The PSU was quite overspecc'd with each of the 6 pin connectors capable of being split into an 8 pin connector</li>
<li>A second processor can be added via a riser</li>
<li>More memory could be added compared to the Dell.</li>
<li>z820's are too expensive if the second processor is not a necessity.</li>
<li>It was quiet</li>
<li>It was cheap</li>
</ul>
<h2>GPUs</h2>
<p>For the GPUs in this machine, I went with the GTX 1070. This was for two primary reasons. The first being that this is the cheapest processor available from NVIDIA with 8 GB of VRAM. VRAM is where the models and data live on the GPU. Having more VRAM directly relates to the size of the models you can run as well as the speed with which you can train (as training speed is highly dependent upon batch size and larger batches require more memory). The second reason is that the GTX 1070 typically only required a single 8 pin power connector. Provided that the z620 can supply two 8 pin connectors, something like the 1080ti is only practical if I have no desire to expand to more than one GPU.</p>
<p>The first GPU I selected was a cheap Gigabyte model that I bought off of an online forum.</p>
<p>The second GPU, which I bought after using the machine for a while was a blower-design HP OEM card that I bought on Ebay. This was slightly more expensive due to having to pay tax.</p>
<h2>Hard drives</h2>
<p>Data takes up a lot of space and I wanted to have no shortage of spinning metal to put my data on. The z620 came with one 500 GB drive. I had a number of 1 terabyte 2.5" hard drives around from a failed attempt at using them in my Proliant DL380 machine (however, the temp sensors on these drives did not agree with this machine).</p>
<p>The z620 has two 5.25" bays. Only the top one has a CD drive. I bought a caddy from IcyDock that enabled me to put four 2.5" drives in one 5.25" bay.</p>
<h1>Cost breakdown</h1>
<p>|----|-----|
|Chassis | 200 |
| GTX 1070 | 200 |
| GTX 1070 | 215 |
| IcyDock Chassis | 20 |</p>
]]><![CDATA[Building an AMD Deep Learning Machine: Part 3]]>https://blog.ksteimel.duckdns.org/~/BenchmarkingResearchTools/building-an-amd-deep-learning-machine-part-3/2019-04-23T22:03:27.862263+00:00ksteimelhttps://blog.ksteimel.duckdns.org/@/kenneth/2019-04-23T22:03:27.862263+00:00<![CDATA[<p>The biggest question is how does this perform? Yeah the stack is open source, it uses a driver already integrated into the kernel and is able to run tensorflow, pytorch and caffe but how well does it do that?</p>
<p>Some results are provided from <a href="https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-titan-v-vs-1080-ti-benchmark/" rel="noopener noreferrer">lambda labs</a> for comparsion with the vega 56.</p>
<table><thead><tr><th> Model / GPU </th><th> Vega 56 </th><th> 1080 Ti </th></tr></thead><tbody>
<tr><td> ResNet-50 </td><td> 145.19 </td><td> 203.99 </td></tr>
<tr><td> Inception v3 </td><td> 67.08 </td><td> 130.2 </td></tr>
<tr><td> VGG16 </td><td> 80.57 </td><td> 133.16 </td></tr>
</tbody></table>
<p>The Vega GPU is quite about half as fast as the 1080ti on the worst performing model (Inception v3). The result for ResNet-50 is where the gap is the closest and that result was actually achieved by turning on ROCm Fusion. This fusion operation seems to alter the computation graph to combine multiple operations into a single convolution where possible.</p>
<p>To enable this, run <code>export TF_ROCM_FUSION_ENABLE=1</code> inside the docker container before starting a tensorflow workload. Perhaps the other models would have been closer in performance to the 1080ti with this setting. Unfortunately, I was not able to perform very rigorous testing as I was building this machine for someone else. I would like to try out ROCm Fusion as well as [undervolting and overclocking the card])https://github.com/RadeonOpenCompute/ROCm/issues/463). Undervolting should reduce the amount of heat and fan noise, allowing the card to maintain higher boost frequencies.</p>
<h1>Conclusion</h1>
<p>While this build wasn't for me, I would certainly build this myself if I had an extra thousand dollars to use on this. After tax and everything the entire build was 996.17. However, the value for performance with the Vega GPU is actually pretty decent. I got the Vega 56 for $320 after tax. Considering that the cursory benchmark results obtained showed the Vega getting anywhere from 50-75% of the performance of a 1080ti at under a third of the price of a 1080ti (most new 1080ti's I see are around $850 at the moment).</p>
<p>In the future, it would be better to compare the cost/performance of the Vega to a lower tier Nvidia GPU like a 1070.</p>
]]><![CDATA[ Building an AMD Deep Learning Machine: Part 2 ]]>https://blog.ksteimel.duckdns.org/~/BenchmarkingResearchTools/building-an-amd-deep-learning-machine-part-2/2019-04-23T15:23:21.451345+00:00ksteimelhttps://blog.ksteimel.duckdns.org/@/kenneth/2019-04-23T15:23:21.451345+00:00<![CDATA[<h1>Operating System</h1>
<p>I used Ubuntu 19.04 partially because I wanted to try out the April release of Ubuntu and I knew that the newer kernels were more compatible with Vega (the amdgpu driver is merged into the kernel after 4.19 which reduces installation headaches) and the Ryzen CPU.</p>
<h2>A note about docker</h2>
<p>If you are not a fan of docker, for security or whatever reason, I don't advise that you use Ubuntu 19.04. This release has only python 3.7 and there are, at the moment, <a href="https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/389#issuecomment-485057344" rel="noopener noreferrer">a few issues</a> with running rocm 2.3 with python 3.7. This doesn't seem to be a problem on python 3.5 or 3.6 and with older versions of ROCm. However, the performance improvement with the newer version of ROCm is pretty substantial so I would use a version of Ubuntu where you can downgrade your python version instead.</p>
<h1>Initial Software Stack</h1>
<p>First, the debian repository has to be added</p>
<pre><code>wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add -
echo 'deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main' | sudo tee /etc/apt/sources.list.d/rocm.list
</code></pre>
<p>Then, the appropriate packages are installed.</p>
<pre><code>sudo apt update
sudo apt install rocm-libs miopen-hip cxlactivitylogger
sudo apt install rocm-dev
</code></pre>
<p>Because we are using the amdgpu drivers in kernel 5.0 that ships with Ubuntu 19.04, we need to add the following udev rule.</p>
<pre><code>echo 'SUBSYSTEM=="kfd", KERNEL=="kfd", TAG+="uaccess", GROUP="video"' | sudo tee /etc/udev/rules.d/70-kfd.rules
</code></pre>
<h1>Docker install</h1>
<p>I added the following line into my <code>~/.bash_rc</code> file to allow for quick launching of the container:</p>
<pre><code>alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx'
</code></pre>
<p>To launch the container, you then simply run <code>drun rocm/tensorflow</code> to drop into your container. The first time you run this, it will pull the images from dockerhub. After that, it will use the cached image. </p>
]]><![CDATA[Building an AMD Deep Learning Machine: Part 1]]>https://blog.ksteimel.duckdns.org/~/BenchmarkingResearchTools/building-an-amd-deep-learning-machine/2019-04-23T04:22:20.881453+00:00ksteimelhttps://blog.ksteimel.duckdns.org/@/kenneth/2019-04-23T04:22:20.881453+00:00<![CDATA[<p>Deep learning has historically been dominated by NVIDIA GPUs. The Nvidia CUDA API is a proprietary standard for writing code to run on graphics processing hardware. CUDA is tightly integrated in all the major deep learning toolkits and provides a relatively intuitive programming interface (in comparison to OpenCL). For a more in depth discussion of the history of GPGPU programming and the potential for an interoperable open-source gpu programming future check out <a href="https://www.youtube.com/watch?v=ZTq8wKnVUZ8" rel="noopener noreferrer">this youtube video</a>.</p>
<p>However, CUDA is proprietary, only works on NVIDIA GPUs, and requires proprietary linux drivers to work. Many people, myself included have objections to the monopolistic hold that NVIDIA has established on the deep learning infrastructure market and object to their non-open practices. In addition, using CUDA can be a flat out pain on the administration side. In my experience, the CUDA utilities integrate poorly with package managers. I have had a number of issues with removing CUDA or replacing it with a new version where installation added a large number of additional programs but removal only uninstalled a couple programs.</p>
<h1>Hardware considerations</h1>
<p>AMD HIP/ROCm is slightly more picky than CUDA with regard to the hardware it will run on. RX 5<em>0 GPUs, RX 4</em>0 GPUs and the R9 3*0 series are not able to run on older cpus where pcie v3 atomics are not supported. Newer GPUs like the Vega 56, Vega 64, Vega Founders Edition and Radeon VII are able to run in a mode without PCIE v3 atomic support with a performance penalty.</p>
<p>CPUs with PCIE v3 atomic support include all Ryzen CPUs as well as all Intel CPUs from Haswell on (e.g. all Intel processors greater than 4000). For more information on supported hardware check out <a href="https://rocm.github.io/hardware.html" rel="noopener noreferrer">this page</a></p>
<h1>Hardware used</h1>
<ul>
<li>32 gb (2 16 gb dimms) of 3000 MHz GSkill Trident Z</li>
<li>Ryzen 7 1700</li>
<li>Vega 56 (ASRock blower)</li>
<li>Wraith Spire</li>
<li>B450 Aorus M</li>
<li>128gb ssd</li>
<li>2 tb hard drive</li>
<li>750 watt power supply</li>
<li>Rosewill scm-01 case</li>
</ul>
<p>All RGB was merely an accident of pricing. I just went with the best performance for the money. In addition, a blower style vega 56 was used instead of a free flowing Vega 56 like those by PowerColor as the case has relatively poor airflow. Getting hot air out of the case was deemed much more important for prolonged workload performance.</p>
<p><img src="https://blog.ksteimel.duckdns.org/static/media/EF8395D9-980E-2AE4-1F7C-90CD191D1920.jpg" alt="ASRock Blower style Vega 56 GPU"></p>
]]><![CDATA[Stemming with pypy and python]]>https://blog.ksteimel.duckdns.org/~/BenchmarkingResearchTools/stemming-with-pypy-and-python/2019-03-02T20:55:18.422178+00:00ksteimelhttps://blog.ksteimel.duckdns.org/@/kenneth/2019-03-02T20:55:18.422178+00:00<![CDATA[<p>For a project on maliciousness detection that I am working on, I needed an unsupervised stemming method. We were examining the role that text cleanup plays in the classification task. This would become especially important as we investigated other feature extraction methods like dependency triples.</p>
<p>The main problem was that we needed a system that would work for both English and German data. In both cases, we were using social media data.</p>
<p>I'll make another post explaining why this text cleanup needed to be done and how Yass works as well as some qualitative examples of the performance achieved.</p>
<p>The main point of this post is that PyPy drastically sped up the stemming process.</p>
<h2>What is pypy?</h2>
<p><a href="https://pypy.org/" rel="noopener noreferrer">PyPy</a> is an alternative python engine that compiles code instead of running on top of the PVM (Python Virtual Machine). Similar to the JVM, python's PVM is a layer that runs bytecode generated by the interpreter. Despite the fact that the python interpreter is written in C, it still has a large amount of overhead.</p>
<p>PyPy, uses just-in-time compilation to get to machine code. Just-in-time compilation is a method where only the functions that are used are actually ran through the compiler. Ordinary compiled languages like C or go have separate compilation and execution stages. For example,
in C you could run <code>gcc myprogram.c -o myprogram</code> to generate a static binary that can then be run using <code>./myprogram</code>. PyPy and other compiled languages don't have a separate compile and execute steps. The program is compiled as it is being executed, compiling only the pieces that need to be compiled. Typically, the function compiled is specific to the arguments being called. For example calling this python function:</p>
<pre><code>def simple_mul(num1, num2):
"""
This function simply multiplies two numbers together and
is just meant to be an example
"""
return num1 * num2
</code></pre>
<p>as <code>simple_mul(3, 9)</code> would compile a version where both arguments are integers while
<code>simple_mul(3.2,9)</code> would compile a version of the function where the first argument is a
float and the second, an integer.</p>
<p>This highly specific compilation is part of what makes languages like <em>julia</em> and <em>PyPy</em> so fast.</p>
<p>An issue early on with PyPy was support for packages that utilized python bindings to c
programs. These types of programs are common in datascience and statistics packages for python
as this is the best way to get high performance out of python.
Libraries like <em>numpy</em> and <em>scipy</em> were not functional on the PyPy engine.</p>
<p>However, recent advances have made these packages work quite well.</p>
<p>Unfortunately, <em>scikit-learn</em> does not work at the moment. Once this package works, I will switch completely over to PyPy for all projects. As it is right now, I use pypy for
data extraction and preprocessing and then use vanilla python for the classification start.</p>
<h2>What kind of speed up did it achieve?</h2>
<p>The PyPy engine ran the code about twice as fast as the raw python version. With the full dataset of 15616 unique tokens, the python implementation took 114.459 seconds or about 18.5 minutes. In contrast the PyPy engine took 548.275 seconds or about 9 minutes. This required no changes to the original python implementation.</p>
<p>A number of different vocabulary sizes were also tried to see how the performance results changed.</p>
<p><img src="https://blog.ksteimel.duckdns.org/static/media/192D86AA-1ADB-D296-716C-CC26944780D1.svg" alt="Scaling graph for pypy and python running yass stemming distance metric 2. The graph shows that, as the problem size increases, pypy's running time remains a fraction of the running time of python."></p>
<p>During execution, I noticed that file IO consumed a much larger portion of the total execution time for the PyPy version. The actual logic of the code took a smaller proportion.</p>
<p>However, I did not quantify this. In the future, I would like to write a version of this program in julia with benchmarking of the initial file read, calculation of the distances between every pair of words, construction of the minimum spanning tree and then writing the results out to a file. Doing so will clarify where the performance profiles of python, PyPy and julia differ.</p>
<p><a href="https://git.ksteimel.duckdns.org/ksteimel/pypy_speed_test" rel="noopener noreferrer">This gitea repository</a> holds the code used.</p>
<p>The plot was generated using the Gadfly package for julia. </p>
]]>