Ghetto Distributed Computing For Neural Networks
The idea for this post came, like many of my ideas, after watching a video.
I have been working on a web service that can host various artificially intelligent structures in a way that I can access them from wherever I need in third-party code projects.
Let me give you a simple example.
There are three database tables, neural_nets, neurons, and weights. These tables will have the appropriate fields and keys so that they link and work together to form any type of neural network you desire.
Any row in the neurons table will have a layer field for instance, so if you want to go deeper, add more layers.
Are you with me so far? Good.
Once everything is trained up, the web service exposes a RESTful API that can be queried to retrieve a neural_net by some identifier field, and it will give you back all the functionality of that particular trained network.
But, as we all know, training a neural network can take a very long time, especially with all the quirks of backward propagation and the likes, and therefore I started thinking about distribution.
My initial instinct was to write some software that compiles into a binary executable for various platforms that people can install to become a distributed node for calculation.
It is kind of like how the SETI program works, whenever your machine idles, you give up some of your processing power to the cause, and you will help SETI perform the various calculations they need and send the results back to them.
Then I saw this video in question, I will link it here below.
So what can we use this for?
Training weights is a great candidate first and foremost.
While a new neural net is still in training, clients (or bots) could connect to the web service, query for a piece of training data, and the model it is training for, run the calculation, and return the result to the web service, which will then place it in the right spot in the database.
Synthesizing new training data is a relatively new concept in the field of neural network training, and does pretty much what it says on the tin.
The idea here is to find a way to generate more training data than you have originally, by either combining bits of data you do have, or coming up with new and clever ways to generate new data from scratch.
In an example I heard about, people were developing a handwriting classifier, and one of the methods they used to synthesize more training data was having a script download random images from the internet, opening those up in a word document, and printing a letter on the image in a random font.
Because they new which letter they were printing, they could now tell what the expected output of the training step that used this data should be, while still having a completely new piece of training data.
The results, as I am told, were incredibly solid.
So this would suit itself rather perfectly for distribution, especially because the processes on the Master/Control server would not need to know anything about the synthesizing process itself, all it needs to receive is the newly generated training data, and the expected output, which can be easily provided by the distributed node, once it is ready and sending it to the server.
Of course we can not use the same tactics that are described in the BlackHat video, because we do not want to do anything nefarious, but we sure can set up specific pages on the Internet that people know they can visit to become part of the research, just by visiting the page.
Another idea I am toying with is to open up the web service altogether, with some sort of API key structure, and allow anyone who signs up to create their own entities on the Master/Control server, while all the "bot" clients indiscriminately pick up tasks from the database and perform the necessary actions.
This would really democratize the potential computing power this can deliver to our research.
I would love to hear some ideas on this, so if you got 'em, show 'em.