A simple template for using deeplearning4j with Spark and Jupyter

Recently I started to make my hands dirty with this great library: deeplearning4j. I found this library really great and I’m using it as a way for learning a bit more about the fantastic world of Deep Learning.

What I found interesting is their approach to scale the learning phase. Scaling the process of training a neural network it’s pretty tough, fortunately, recently, practical approaches are emerged to accomplish this goal: exploiting a cluster of CPUs with or without GPUs for accelerating the training of complex neural networks where the training set can be very big.

In CGnal we do data science on big data, so it’s extremely important for us being able to scale all the machine learning algorithms we apply on the data sets that our customers provide us.

Deeplearning4j provides a unified library everything you need to create deep learning neural networks and to decide to scale them on different platforms. I like the fact that it also allows accelerating the local computation exploiting the presence of a GPU. So, by using Spark, for example, you can, transparently, scale the training of your neural network on a cluster of nodes containing a GPU next to the CPU.

Having said that I just want to show what I put together to simplify the life of people willing to use deeplearning4j in a Spark based project and eventually on a Jupyter notebook.

The Spark environment I’m using is the Cloudera’s one installed in the latest CDH version (5.8.2 at the time of writing). I created a simple SBT base project that you can find here.

The most important piece is the build.sbt file. This is what this project shows:

  1. How to create a “uber” jar containing all the classes you need for running deeplearning4j on Spark. In particular I had to resolve some version conflict between libraries used by deeplearning4j and Spark itself.
  2. It shows how to write a simple application that shows that uses deeplearning4j with Spark.
  3. It shows how to run a main from the IDE or from the SBT console without the need to have a local copy of Spark.

So, this project can be used a as starting point/template for building a deeplearning4j Spark based project but it can also be used to simplify the usage of deeplearning4j inside a Jupyter notebook. Let me show how.

The prerequisite is to have the Jupyter already set up. In this case the integration of Jupyter with Spark is based on Livy and sparkmagic. Please consult the documentation of those projects to learn how to set up the environment.

Then follow the following steps:

Now, supposing you are connected to your Hadoop cluster, you should copy the generated “uber” jar in a well defined location on HDFS. For the sake of the example this location is /user/livy/repl_jars:

At this point you could write your notebook that uses deeplearning4j. Under the directory notebooks I put an example, it’s the equivalent of the application but in a notebook form.

The most important part of the notebook is where you configure the Spark context to load the “uber” jar from HDFS:

%%configure -f
{
“jars”:[“/user/livy/repl_jars/dl4j-assembly-0.6.0.jar”],
“driverMemory”:”3g”,
“executorMemory”:”2g”,
“conf”:  {
“spark.driver.extraClassPath”:”/home/livy/dl4j-assembly-0.6.0.jar”,
“spark.serializer”:”org.apache.spark.serializer.KryoSerializer”,
“spark.kryo.registrator”:”org.nd4j.Nd4jRegistrator” }
}

That’s all folks.