-
-
Save awjuliani/fffe41519166ee41a6bd5f5ce8ae2630 to your computer and use it in GitHub Desktop.
| { | |
| "cells": [ | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "# Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond\n", | |
| "\n", | |
| "In this iPython notebook I implement a Deep Q-Network using both Double DQN and Dueling DQN. The agent learn to solve a navigation task in a basic grid world. To learn more, read here: https://medium.com/p/8438a3e2b8df\n", | |
| "\n", | |
| "For more reinforcment learning tutorials, as well as required gridworld.py file, see:\n", | |
| "https://github.com/awjuliani/DeepRL-Agents" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "import gym\n", | |
| "import numpy as np\n", | |
| "import random\n", | |
| "import tensorflow as tf\n", | |
| "import matplotlib.pyplot as plt\n", | |
| "import scipy.misc\n", | |
| "import os\n", | |
| "%matplotlib inline" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Load the game environment" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Feel free to adjust the size of the gridworld. Making it smaller provides an easier task for our DQN agent, while making the world larger increases the challenge." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false, | |
| "scrolled": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "from gridworld import gameEnv\n", | |
| "\n", | |
| "env = gameEnv(partial=False,size=5)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Above is an example of a starting environment in our simple game. The agent controls the blue square, and can move up, down, left, or right. The goal is to move to the green square (for +1 reward) and avoid the red square (for -1 reward). The position of the three blocks is randomized every episode." | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Implementing the network itself" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "class Qnetwork():\n", | |
| " def __init__(self,h_size):\n", | |
| " #The network recieves a frame from the game, flattened into an array.\n", | |
| " #It then resizes it and processes it through four convolutional layers.\n", | |
| " self.scalarInput = tf.placeholder(shape=[None,21168],dtype=tf.float32)\n", | |
| " self.imageIn = tf.reshape(self.scalarInput,shape=[-1,84,84,3])\n", | |
| " self.conv1 = tf.contrib.layers.convolution2d( \\\n", | |
| " inputs=self.imageIn,num_outputs=32,kernel_size=[8,8],stride=[4,4],padding='VALID', biases_initializer=None)\n", | |
| " self.conv2 = tf.contrib.layers.convolution2d( \\\n", | |
| " inputs=self.conv1,num_outputs=64,kernel_size=[4,4],stride=[2,2],padding='VALID', biases_initializer=None)\n", | |
| " self.conv3 = tf.contrib.layers.convolution2d( \\\n", | |
| " inputs=self.conv2,num_outputs=64,kernel_size=[3,3],stride=[1,1],padding='VALID', biases_initializer=None)\n", | |
| " self.conv4 = tf.contrib.layers.convolution2d( \\\n", | |
| " inputs=self.conv3,num_outputs=512,kernel_size=[7,7],stride=[1,1],padding='VALID', biases_initializer=None)\n", | |
| " \n", | |
| " #We take the output from the final convolutional layer and split it into separate advantage and value streams.\n", | |
| " self.streamAC,self.streamVC = tf.split(3,2,self.conv4)\n", | |
| " self.streamA = tf.contrib.layers.flatten(self.streamAC)\n", | |
| " self.streamV = tf.contrib.layers.flatten(self.streamVC)\n", | |
| " self.AW = tf.Variable(tf.random_normal([h_size/2,env.actions]))\n", | |
| " self.VW = tf.Variable(tf.random_normal([h_size/2,1]))\n", | |
| " self.Advantage = tf.matmul(self.streamA,self.AW)\n", | |
| " self.Value = tf.matmul(self.streamV,self.VW)\n", | |
| " \n", | |
| " #Then combine them together to get our final Q-values.\n", | |
| " self.Qout = self.Value + tf.sub(self.Advantage,tf.reduce_mean(self.Advantage,reduction_indices=1,keep_dims=True))\n", | |
| " self.predict = tf.argmax(self.Qout,1)\n", | |
| " \n", | |
| " #Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.\n", | |
| " self.targetQ = tf.placeholder(shape=[None],dtype=tf.float32)\n", | |
| " self.actions = tf.placeholder(shape=[None],dtype=tf.int32)\n", | |
| " self.actions_onehot = tf.one_hot(self.actions,env.actions,dtype=tf.float32)\n", | |
| " \n", | |
| " self.Q = tf.reduce_sum(tf.mul(self.Qout, self.actions_onehot), reduction_indices=1)\n", | |
| " \n", | |
| " self.td_error = tf.square(self.targetQ - self.Q)\n", | |
| " self.loss = tf.reduce_mean(self.td_error)\n", | |
| " self.trainer = tf.train.AdamOptimizer(learning_rate=0.0001)\n", | |
| " self.updateModel = self.trainer.minimize(self.loss)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Experience Replay" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "This class allows us to store experies and sample then randomly to train the network." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "class experience_buffer():\n", | |
| " def __init__(self, buffer_size = 50000):\n", | |
| " self.buffer = []\n", | |
| " self.buffer_size = buffer_size\n", | |
| " \n", | |
| " def add(self,experience):\n", | |
| " if len(self.buffer) + len(experience) >= self.buffer_size:\n", | |
| " self.buffer[0:(len(experience)+len(self.buffer))-self.buffer_size] = []\n", | |
| " self.buffer.extend(experience)\n", | |
| " \n", | |
| " def sample(self,size):\n", | |
| " return np.reshape(np.array(random.sample(self.buffer,size)),[size,5])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "This is a simple function to resize our game frames." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "def processState(states):\n", | |
| " return np.reshape(states,[21168])" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "These functions allow us to update the parameters of our target network with those of the primary network." | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "def updateTargetGraph(tfVars,tau):\n", | |
| " total_vars = len(tfVars)\n", | |
| " op_holder = []\n", | |
| " for idx,var in enumerate(tfVars[0:total_vars/2]):\n", | |
| " op_holder.append(tfVars[idx+total_vars/2].assign((var.value()*tau) + ((1-tau)*tfVars[idx+total_vars/2].value())))\n", | |
| " return op_holder\n", | |
| "\n", | |
| "def updateTarget(op_holder,sess):\n", | |
| " for op in op_holder:\n", | |
| " sess.run(op)" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Training the network" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Setting all the training parameters" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "batch_size = 32 #How many experiences to use for each training step.\n", | |
| "update_freq = 4 #How often to perform a training step.\n", | |
| "y = .99 #Discount factor on the target Q-values\n", | |
| "startE = 1 #Starting chance of random action\n", | |
| "endE = 0.1 #Final chance of random action\n", | |
| "anneling_steps = 10000. #How many steps of training to reduce startE to endE.\n", | |
| "num_episodes = 10000 #How many episodes of game environment to train network with.\n", | |
| "pre_train_steps = 10000 #How many steps of random actions before training begins.\n", | |
| "max_epLength = 50 #The max allowed length of our episode.\n", | |
| "load_model = False #Whether to load a saved model.\n", | |
| "path = \"./dqn\" #The path to save our model to.\n", | |
| "h_size = 512 #The size of the final convolutional layer before splitting it into Advantage and Value streams.\n", | |
| "tau = 0.001 #Rate to update target network toward primary network" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false, | |
| "scrolled": true | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "tf.reset_default_graph()\n", | |
| "mainQN = Qnetwork(h_size)\n", | |
| "targetQN = Qnetwork(h_size)\n", | |
| "\n", | |
| "init = tf.initialize_all_variables()\n", | |
| "\n", | |
| "saver = tf.train.Saver()\n", | |
| "\n", | |
| "trainables = tf.trainable_variables()\n", | |
| "\n", | |
| "targetOps = updateTargetGraph(trainables,tau)\n", | |
| "\n", | |
| "myBuffer = experience_buffer()\n", | |
| "\n", | |
| "#Set the rate of random action decrease. \n", | |
| "e = startE\n", | |
| "stepDrop = (startE - endE)/anneling_steps\n", | |
| "\n", | |
| "#create lists to contain total rewards and steps per episode\n", | |
| "jList = []\n", | |
| "rList = []\n", | |
| "total_steps = 0\n", | |
| "\n", | |
| "#Make a path for our model to be saved in.\n", | |
| "if not os.path.exists(path):\n", | |
| " os.makedirs(path)\n", | |
| "\n", | |
| "with tf.Session() as sess:\n", | |
| " if load_model == True:\n", | |
| " print 'Loading Model...'\n", | |
| " ckpt = tf.train.get_checkpoint_state(path)\n", | |
| " saver.restore(sess,ckpt.model_checkpoint_path)\n", | |
| " sess.run(init)\n", | |
| " updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.\n", | |
| " for i in range(num_episodes):\n", | |
| " episodeBuffer = experience_buffer()\n", | |
| " #Reset environment and get first new observation\n", | |
| " s = env.reset()\n", | |
| " s = processState(s)\n", | |
| " d = False\n", | |
| " rAll = 0\n", | |
| " j = 0\n", | |
| " #The Q-Network\n", | |
| " while j < max_epLength: #If the agent takes longer than 200 moves to reach either of the blocks, end the trial.\n", | |
| " j+=1\n", | |
| " #Choose an action by greedily (with e chance of random action) from the Q-network\n", | |
| " if np.random.rand(1) < e or total_steps < pre_train_steps:\n", | |
| " a = np.random.randint(0,4)\n", | |
| " else:\n", | |
| " a = sess.run(mainQN.predict,feed_dict={mainQN.scalarInput:[s]})[0]\n", | |
| " s1,r,d = env.step(a)\n", | |
| " s1 = processState(s1)\n", | |
| " total_steps += 1\n", | |
| " episodeBuffer.add(np.reshape(np.array([s,a,r,s1,d]),[1,5])) #Save the experience to our episode buffer.\n", | |
| " \n", | |
| " if total_steps > pre_train_steps:\n", | |
| " if e > endE:\n", | |
| " e -= stepDrop\n", | |
| " \n", | |
| " if total_steps % (update_freq) == 0:\n", | |
| " trainBatch = myBuffer.sample(batch_size) #Get a random batch of experiences.\n", | |
| " #Below we perform the Double-DQN update to the target Q-values\n", | |
| " Q1 = sess.run(mainQN.predict,feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,3])})\n", | |
| " Q2 = sess.run(targetQN.Qout,feed_dict={targetQN.scalarInput:np.vstack(trainBatch[:,3])})\n", | |
| " end_multiplier = -(trainBatch[:,4] - 1)\n", | |
| " doubleQ = Q2[range(batch_size),Q1]\n", | |
| " targetQ = trainBatch[:,2] + (y*doubleQ * end_multiplier)\n", | |
| " #Update the network with our target values.\n", | |
| " _ = sess.run(mainQN.updateModel, \\\n", | |
| " feed_dict={mainQN.scalarInput:np.vstack(trainBatch[:,0]),mainQN.targetQ:targetQ, mainQN.actions:trainBatch[:,1]})\n", | |
| " \n", | |
| " updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.\n", | |
| " rAll += r\n", | |
| " s = s1\n", | |
| " \n", | |
| " if d == True:\n", | |
| "\n", | |
| " break\n", | |
| " \n", | |
| " myBuffer.add(episodeBuffer.buffer)\n", | |
| " jList.append(j)\n", | |
| " rList.append(rAll)\n", | |
| " #Periodically save the model. \n", | |
| " if i % 1000 == 0:\n", | |
| " saver.save(sess,path+'/model-'+str(i)+'.cptk')\n", | |
| " print \"Saved Model\"\n", | |
| " if len(rList) % 10 == 0:\n", | |
| " print total_steps,np.mean(rList[-10:]), e\n", | |
| " saver.save(sess,path+'/model-'+str(i)+'.cptk')\n", | |
| "print \"Percent of succesful episodes: \" + str(sum(rList)/num_episodes) + \"%\"" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "### Checking network learning" | |
| ] | |
| }, | |
| { | |
| "cell_type": "markdown", | |
| "metadata": {}, | |
| "source": [ | |
| "Mean reward over time" | |
| ] | |
| }, | |
| { | |
| "cell_type": "code", | |
| "execution_count": null, | |
| "metadata": { | |
| "collapsed": false | |
| }, | |
| "outputs": [], | |
| "source": [ | |
| "rMat = np.resize(np.array(rList),[len(rList)/100,100])\n", | |
| "rMean = np.average(rMat,1)\n", | |
| "plt.plot(rMean)" | |
| ] | |
| } | |
| ], | |
| "metadata": { | |
| "kernelspec": { | |
| "display_name": "Python 2", | |
| "language": "python", | |
| "name": "python2" | |
| }, | |
| "language_info": { | |
| "codemirror_mode": { | |
| "name": "ipython", | |
| "version": 2 | |
| }, | |
| "file_extension": ".py", | |
| "mimetype": "text/x-python", | |
| "name": "python", | |
| "nbconvert_exporter": "python", | |
| "pygments_lexer": "ipython2", | |
| "version": "2.7.12" | |
| } | |
| }, | |
| "nbformat": 4, | |
| "nbformat_minor": 0 | |
| } |
@mphielipp
I think you should check your version of TF first.
python -c 'import tensorflow as tf; print(tf.__version__)'
The version should be 0.12.x
@mphielipp Did it work for you? I installed the latest tensorflow 0.12.1 and
pip show tensorflow says 0.12.1
but I still get the same error as you.
@mphielipp Replace that line with:
self.AW = tf.Variable(tf.random_normal([h_size // 2, env.actions]))
It expects an integer, not a float.
Hi, First thanks so much for your detailed write ups and commented implementations. I have been working through them while developing my own RL environment outside of gym.
I have a few questions regarding the implementation for Double-DQN here:
-
The Double-DQN paper (https://arxiv.org/pdf/1511.06581.pdf) algorithm mentions updating \theta with each step t. It looks like the implementation here updates \theta every
update_freqsteps, and updates \theta- immediately afterwards. Is there something I don't understand? I guess it ends up being a heuristic decision when to perform these updates, just wondering what your intuition is for the \theta, \theta- update cycle. -
Second is your nice tensorflow hack to update the targetQ weights. Does it rely on the order of initialization? Might there be a more verbose but explicit way to do it, maybe storing the targetQ ops by name in a dictionary?
-
Last is there a reason for not using a nonlinearity/activation in the network?
I would like to ask a question: do we have to split the inputs in order to achieve dueling DQN?
why can't i just input all the inputs into value layer and advantage layer?
I'm getting this message: -
----> 2 mainQN = Qnetwork(h_size)
---> 16 self.AW = tf.Variable(tf.random_normal([h_size/2,env.actions]))
---> 77 seed2=seed2)
--> 189 name=name)
--> 582 _Attr(op_def, input_arg.type_attr))
lib\site-packages\tensorflow\python\framework\op_def_library.py in _SatisfiesTypeConstraint(dtype, attr_def)
58 "DataType %s for attr '%s' not in list of allowed values: %s" %
59 (dtypes.as_dtype(dtype).name, attr_def.name,
---> 60 ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: DataType float32 for attr 'T' not in list of allowed values: int32, int64