Blog

How to set up Docker for Deep Learning with Ubuntu 16.04 (with GPU)

In this tutorial, I will guide how you set up your computer as a work station to do some serious deep learning while managing different development environments with Docker.

First let’s get the machine to running without any docker.

  1. Install Ubuntu 16.04 (the latest version with LTS)
  2. Install the latest (supported by your GPU) Nvidia drivers.
  3. Install CUDA (which allows fast computation on your GPU).
  4. Install the CUDA Toolkit
  5. Install cuDNN (install from Debian files)
  6. Now to verify that everything works as expected, follow the steps, I had to fix some bugs in the sample code as described in this blog post.
  7. Now run it again, ‘all test passed!’ should be the last command.

Now let’s focus on Docker.

  1. Install Docker
  2. Install nvidia-docker in order to get access to the GPU inside the docker container
  3. Write a scrip to run a Docker container. This pulls an image from floydhub, removes the need for a password with Jupyter, persits Jupyter Notebooks in $HOME/jupyter and makes the notebook accessible on port 8888.
    #!/usr/bin/env bash
    docker run -it -p 8888:8888 --runtime=nvidia --volume /home/filter/jupyter:/jupyter --name pytorch floydhub/pytorch:0.3.1-gpu.cuda9cudnn7-py3.27 /bin/sh -c "./run_jupyter.sh --NotebookApp.token='' --notebook-dir=/jupyter"
  4. Now verify your docker is working `docker ps`
    5. To stop it run `docker stop pytorch` or to restart `docker restart pytorch`.

 

 

6. To access the work borks from your local machine, tunnel the traffic via ssh E.g. `ssh -N -f -L localhost:8888:localhost:8888 @`

If you happened to have a special setup (like myself) that you don’t have a IPv6 address, you may need to tunnel it over an another server. In this case,

1) tunnel from the local machine to the server and

ssh -N -f -L localhost:8888:localhost:64149 jfilter@vis.one

2) tunnel from that server to your gpu work stations.

ssh -N -f -L localhost:64149:localhost:8888 filter@filter.dynv6.net

That’s it. I hope this post helpful to you and you are right to train deep neural networks on large datasets. 😉

How to Import a Local React-Native NPM Package

I had to fix an unmaintained NPM package for my latest React-Native app. But I struggled to import it locally. npm install $relative-patch or npm link were not working (check also this issue ). For me, it looks like there is no simple way to import a local package that is outside of the root of your React-Native project. So I figured out I just place my local components in a folder local_components within the react-native root project. Then, I can just import X from ‘package-X’; without having to touch the package.json of the react-native project.

How to Create Grouped Bar Charts with R and ggplot2

In a recent university project, I had to collect and analyze data via Google Forms. It was a survey about how people perceive frequency and effectively of help-seeking requests on Facebook (in regard to nine pre-defined topics). The questionnaire looked like this:

Altogether, the participants (N=150) had to respond to 18 questions on an ordinal scale and in addition, age and gender were collected as independent variables. In the first step, I did not want to do any statistical analysis but only visualize the results. For every question, I wanted to take a look at the distribution of the data in regard to gender and age. This not possible with Google Forms so I had to resort to more sophisticated visualization tools: R and ggplot2. The following image is the end product of this tutorial.

You can find the whole source code on Github and I will walk you through the essential steps of my code. In general, we want to create an R script that produced for every question two graphs. One where the responses are compares to the gender and one for the age. This allows to spot some interesting patterns in the data but also to present this images to and an audience. With this script, you can also easily re-draw all the graphs when the data gets updated. So you can write this kind of script for your survey and re-run it every time you collect more responses.
I prefer to demonstrate the use of R and ggplot2 on a real world example. It would be easier to do it with some mock data, but when you work in real world, you also have real-world problems. So when you apply it to your specific survey, your data probably needs some cleaning as well. I will go over the steps I had to do in my survey.
If you are new to R, I advise you to find some resources to get am overview over the language. I especially liked this free course from Code School. You may also need to check out the R documentation to get further information about some functions used.
So now, lets’ get started! Make sure you have R and RStudio installed. And install the R packages ggplot2 and dplyr via the Console in RStudio:

install.packages("ggplot2")
install.packages("dplyr")

Create an RStudio project and put the data as .csv into the same folder as the project. First, read in the data.

data = read.csv("./data.csv", header=TRUE, sep=",")

Then rename the columns to make it easier to work with them.

colnames(data) <- c("time","gender", "age", "f_1", "f_2", "f_3", "f_4", "f_5", "f_6", "f_7", "f_8", "f_9", "e_1", "e_2", "e_3", "e_4", "e_5", "e_6", "e_7", "e_8", "e_9")

In this very case, I had ordinal data. But R cannot automatically know about them. So I have to create a create a vector and specify them. Here, for the frequency. We will save it in a variable and use it later.

f_levels <- c("never", "rarely", "sometimes", "often", "very often");

For the age, we chose to just ask from some ranges. This was a mistake and we had not enough responses for the older age buckets. As a result, we have to merge some attributes. This was more complicated than expected but the following snippet works.

i2 <- levels(data$age) %in% c("35 - 44", "45 - 54", "55 - 64", "65 +")
levels(data$age)[i2] <- "35 +"

I encapsulate the drawing of the graph into a function to stay DRY. I will explain the most important function arguments in the following explanations.

draw_chart <- function(df, prefix_column, x_title, new_levels, indep_col_name, question, topics)

Because we want to draw graphs for all the topics, we have to loop over all the topics.

for (i in 1:length(topics)) {

All of the following descriptions are done inside the loop. Each loop cycle works on one topic which in essence means one question of the survey.
Now f transformation. First, we have to construct the name of the column. Second, we select the needed columns from the data frame into a new one. This allows doing destructive transformations on the new data frame. Third, we rename the columns and fourth, we filter out rows with empty values.

col_name <- paste(prefix_column, i, sep="")
cur_df <- df[,c(indep_col_name ,col_name)]
colnames(cur_df) <- c("indep_col" , "dep_col")
cur_df <- subset(cur_df, dep_col != "" & indep_col != "")

Afer that, we assign the levels we constructed earlier:

levels(cur_df$dep_col) <- new_levels

In the next step, we calculate the relative share of the responses in regard to the independent variable. Or in other words, we are not interested in the absolute number of responses but only how many percents of e.g. example women choose this specific attribute. This snippet makes use of the dply packet to transform the data. First, we group it, then summarize it and finally calculate the relative values.

cur_df <- cur_df %>%
group_by(indep_col, dep_col) %>% # NB: the order of the grouping
summarise(count=n()) %>%
mutate(perc=count/sum(count))

We are ready to draw the graph. I skip all the part regarding the labeling of the graph. It’s trivial and you check it in the code example.
Frist, I have to tell ggplot what data frames and how the columns of the data frames are mapped onto the graph.

ggplot(data=cur_df, aes(x=dep_col, y=perc, fill=indep_col)) +

Then, I specify further details regarding the representation of the bars. The width is the width of the “groups”. The position_dodge argument is needed to show this kind of diagram. stat=”identity” is needed because with the default, ggplot counts and presents the specific values. We already calculated what to present (the percentages) before and want to show them. Therefore “identity”.

geom_bar(width=0.7, position=position_dodge(width=0.75), stat="identity") +

This line changes the label of the y-axis to percentages and also specifies a limit. It ensures that every graph has the same range for the y-axis which greatly improves comparability among all the graphs.

scale_y_continuous(labels=scales::percent, limits=c(0, 0.5)) +

This changes the labels.

labs(y = "Percentage", x=x_title, title=diagram_title, fill=indep_col_name) +

The following lines, first set up the general theme, and then change it do hide all vertical lines because there not needed and distract. In addition, the diagram title is centered and the default color palette is changed.

theme_bw() +
theme(panel.grid.minor.x=element_blank(), panel.grid.major.x=element_blank(), plot.title = element_text(hjust = 0.5)) +
scale_fill_manual(values=cbPalette)

Puh, we are almost done. Now we have to save the image to the disk. For this, we first have to create the folder if needed and then save the plot. It is referenced by last_plot().

dir_path <- paste(getwd(), "/graphs/", indep_col_name, sep="")
dir.create(dir_path, recursive=TRUE, showWarnings=FALSE) # ignore warning if folder already existent
file_name <- paste(x_title, topics[i], ".png", sep="")
ggsave(file_name, last_plot(), "png", path=dir_path)

This was all that happened in the loop and also in the function. Now we just call the function four times with altered parameters.

draw_chart(data, "f_", "Frequency", f_levels, "gender", f_question, topics)
draw_chart(data, "e_", "Effectivity", e_levels, "gender", e_question, topics)
draw_chart(data, "f_", "Frequency", f_levels, "age", f_question, topics)
draw_chart(data, "e_", "Effectivity", e_levels, "age", e_question, topics)

All the resulting images will be in the folder graphs. Here is the complete code once again and at the end of the post are some more resulting charts.
I hope this tutorial helped you a little bit. Please don’t hesitate to ask any questions in the comments below.

Scraping With Greasemonkey

In a recent project, I had to scrape Web pages that were only accessible after authorization. Due to a complicated authorization process (“Estonian ID card”), I could not just fake the login with tools such as Selenium. So I had to somehow execute code in the browser environment. It took me some time to figure out for what exactly I was looking for. I ended up with Greasemonkey for Firefox. This allowed me to automatically execute a Javascript script after page load. The script consistent of three steps:

  1. Scrape data from HTML
  2. Save data to LocalStorage
  3. Change Webpage

When the scraping process is done, you can copy the localStorage to the clipboard

copy(JSON.stringify(JSON.stringify(localStorage)));

Paste and save it.

Backup With rsync over SSH

I needed a simple backup strategy for a folder on one of my servers. There is an abundance of articles how to do it with rsync, but I somehow could not find one that was simple enough for me to understand, and also secure. So here are my thoughts and links.
First, I created a new non-sudo user on my server where the backup should lie. Then, I followed the steps of this tutorial but run into problems when testing the restrictions of the SSH commands. In the end, it turned out that I have placed the command in front of the wrong SSH key.
But the tutorial will result in a non-secure system as described in this post on serverfault. So, you have to change the owner of the validate_sync script to root and set the right permission to authorized_keys file as described in the post.
Finally, I added a crontab:

0 5 * * * rsync -az –delete -e ssh X@Y:Z /home/USER

 
 

Failed (111: Connection Refused) While Connecting To Upstream

After following a tutorial on how to get Flask running with NGINX by Digital Ocean, I experienced some problems. In the erorr.log, I just found

failed (111: Connection refused) while connecting to upstream

and it took me some to figure the problem out. I read everywhere on the Web that the most problems with uWSGI are related to file permissions. In my case, the .sock file had the right configuration, but I was trying to log the output of the uWSGI process to a file that didn’t exist (that part was not covered in the tutorial). After creating the file and setting the adequate permissions, everything worked smoothly. In general, the tutorials by Digital Ocean are excellent.👌

Minimal Local Development Environment for WordPress with Trellis

When you want to develop a theme or a plugin for WordPress, it is preferred to do it locally. In order to set up a local development environment, you can do it manually (Web server, database etc.) or with tools such as MAMP (that it in my opinion cumbersome to use). I friend of mine recommended Trellis to me. It makes use of Vagrant to run on a virtual machine and it promised to “Just Work”.
(Edit: I use this setup to develop general themes without having any control over the production environment.  I got criticized because the production and the development environment differ a lot and this can cause problems in certain situations due to the different approaches of Bedrock. But this does not affect theme development. It can be different for plugins so be warned (and read the provided link above), that you may run into problems.)
The provided documentation is okay, but I was very confused because the creators of Trellis also have some other tools for theme development (Bedrock). At some moments, I was not sure what do to when I just wanted a plain stupid WordPress installation. That is why I wrote this blog post to give you a minimal setup.

  1. Requirements: Make sure you have the specified versions of the software requirements (>= means you can have a newer version). Better check the link for the latest requirements. Below for now (13.02.2017)
  2. Create a project folder and in your project folder, run this command that will download Trellis. The result is a folder called ‘trellis’.
    git clone --depth=1 git@github.com:roots/trellis.git && rm -rf trellis/.git
  3. In your project folder, run this command that will download Bedrock. Bedrock is a tool for theme development that we will not use but it is needed to easily run WordPress. The result is a folder called ‘site’. Here will be your WordPress files for theme or plugin development.
    $ git clone --depth=1 git@github.com:roots/bedrock.git site && rm -rf site/.git
  4. Just run it.
    $ cd trellis && ansible-galaxy install -r requirements.yml
  5. Open the following file with your favorite text editor and replace all the occurrences of ‘example’ with your project name.
    trellis/group_vars/development/wordpress_sites.yml
  6. Open the following file with your favorite text editor and replace all the occurrences of ‘example’ with your project name (remember or change the WordPress admin credentials).
    trellis/group_vars/development/vault.yml
  7. Go to your ‘trellis’ folder and run
    vagrant up
  8. After Vagrant is ready (this can take a while), visit http://your-project-name.dev in your browser.
  9. Success. You can now e.g. develop your own theme in the folder site/web/app/themes and changes will appear in the browser.

How to Export Data from Google Ngram Viewer

Google Ngram Viewers gives information about the frequency of words in Google Books. You can query for several words and the results is a graph. But they do not offer a way to export the data. To do so follow the instructions (Mac OS 10.12.2, Chrome 55):

  1. Specify the query and select a smoothing of 0. This ensures that you get the raw data that was not subjected to smoothing.
  2. Open Developer Tools.
  3. Run the query.
  4. Select the Sources panel
  5. Select “search all files” (click on the three dots to see a menu where you can select this)
  6. Search for “var data”
  7. Look in the resulting list of domains for something like “books.google.com/ngrams/graph?content=long&year_start=1800…..”
  8. Copy and clean the data