Google playing Atari (how?)

Pretty excited today, saw on the new that Google has made this cutting edge break through in making a ‘general agent’ which plays Atari games.

The bad news is, it is not breaking news. In fact it is over 2 years old.

The cool news is, I spent the summer working on improving some of the limitations for my MSc in Intelligent Systems and Robotics, and I am eager to talk about the project.

Background

The concept of playing Atari games with a general agent was originally was started by Deepmind. Deepmind was an unusual business, perhaps a business for future entrepreneurs, in which they did not produce anything. They simply got the best machine learning people together and researched really interesting topics, like Deep Reinforcement Learning. I’m partly guessing here, but there funding is likely to have come from industrial partners want to get in on cutting edge projects.

Anyway, Google bought Deepmind for a lot of money (See here) and now Google own their work and have continued to develop it further.

Machine Learning

So the system works through the use of Q-Learning which is a form of Reinforcement Learning (RL). RL is a semi supervised machine learning technique. The semi supervised aspect is that the system needs something to aim for, like a goal, or in this case a score. By getting a increase in score then the system was gone a good job and by loosing a life the system did a bad job.

In addition to this the system also takes pixel data from the screen. It processes this through a technique called a Convolutional Neural Network (CNN). CNNs are based on biological systems, specifically Cat’s eyes (and also Monkey’s eyes) and is based on the work of LeCun. This process extracts the features of an image with the aim of reducing the image data size whilst maintaining the features. Below you can see an example of this working with the MNIST character database (taken from LeCun’s website).

LeNet5

As shown, this model has 3 layers, as the layers get smaller less data is stored but features (which loose their pattern to us) are still present.

So how does it all work?

So now we know the two underlying processes how does it work? The first this to understand is that Q-Learning needs two things, the state and the action which are represented as Q(s,a). The output from the CNN is the state and the actions are the viable control actions (fire, move left, move right, etc).  So each state and action has a value, starting as 0.0000. Imagine a speadsheet, the left column is the possible states and the headings across are the actions.

Each time the system triggers/ticks/update (tried to do something) it checks if the state exists, if it doesn’t it adds the state to the list and sets all the actions to 0.000. It then tries a random action and the updates the Q value for that state/actions as discussed later. If the state does exist it will perform the action with the highest Q Value. I would also guess occasionally it would randomly try a different action that the one with the highest Q-Value, this would avoid problems (not discussed here).

The update rule for updating the Q Values is based on a couple of things. Firstly was there a reward (score increase), how big was the reward and then was series of states/actions helped achieve the reward. This last part is achieved through a delayed rewards as discussed in general Q Learning theories. This can be seen in this equation from Wiki (sorry fellow academics) and you have read how it works here.

QLearning

Summary

So the system works, we know it works and in short, it is awesome. They have developed an agent which only needs to be given a few things (reward, controls, states). From this is can learn to play games, and trust me it is awesome from a machine learning point of view and it really does work.

The system has further applications, just using a simple example of Advertising promotions. The state is your browsing history/pattern, the actions is what type of Ad to show you and the reward is obvious, did you click it. But this system can go beyond this, how about stock market trading?

Now the bad part they aren’t telling you. First, processing images through a CNN is computationally intensive. Doing it with a CPU is too slow however running it thought a GPU is a lot better. Secondly, the system builds up its ‘knowledge’ through random actions, this makes the system make a lot of mistakes in the training process. It starts with 0.000 for all actions and states and has to build this knowledge up. Imaging training a system with live data, all those mistakes. Thirdly, training takes ages, and I mean ages. Think how many possible combinations of events on the screen can be happening, even when CNNs reduce the state it is still massive.

Want to have a go?

If you want to have a go I recommend a few things….

Read this paper it is by the people at Deepmind titled – Playing Atari Games with Deep Reinforcement Learning

Have a look at the Arcade Learning Environment  (ALE) – Has papers, examples and it what Deepmind built their system with (Python and Java support)

You should also look at Theano for the convolutional neural network aspects. I recommend starting with the logistic regress part first though. And if you get as far as CNNs, enable CUDA if you have it.

Stella VCS is a great Atari 2600 emulator

This chap (Krist Jankorjus) had a go a recreating the Deepmind project for his MSc and put the code of Github. Warning, I don’t know how far he got BUT you can see how he interacts with ALE. A special big thanks to him for distributing the work via Githib.

How do I know all of this?

Easy, I did my MSc project trying to reduce the costly nature of the training process.