This blogpost introduces version 0.3 of the TensorForce reinforcement learning (RL) library, and motivates major design changes. Development was guided by the aim to provide a better interface and implementation for optimization modules, and later execution. More generally, with version 0.3, we are taking a big step towards *RL models as pure TensorFlow objects*, including all control-flow.

## Motivation

The key architecture problem in the first release of TensorForce were optimizers. The implementation of natural gradients in the TRPO model was an awkward mix of TensorFlow and NumPy intertwined with RL control flow, and evolutionary methods (which were not available then) would have faced similar problems. Such entangled modules are annoying because they interfere with things such as a generic distributed execution mode.

### How typical RL implementations work

The basic operations in RL are `act(state)`

and `observe(reward)`

— an agent encounters a state, reacts with an action and observes a subsequent reward. Every general-purpose RL framework will provide this abstract interface for interaction with the available model implementations. In the case of deep RL, the agent usually first collects a batch of experiences before infrequently updating its model parameters, consequently requiring a third fundamental API operation `update(states, actions, rewards)`

. These updates can be quite complex for some models and thus implementations tend to get messy.

Take the example of a (deep deterministic) policy gradient implementation in a popular deep RL library like OpenAI baselines, or TensorForce until recently: The `act`

equivalent is usually a straightforward call to the underlying deep neural network feeding the given state representation to retrieve the corresponding action, while the `observe`

function just adds reward information to the internal memory. The `update`

operation, however, involves various steps, for instance:

- Call to the TensorFlow baseline model to retrieve and optimize estimates for the state values.
- Calculation of the discounted cumulative reward in Python/NumPy.
- (Multiple) call(s) to the TensorFlow policy gradient model for optimization, e.g. in the case of multiple line search steps in TRPO.

Note that the TensorFlow computation graph created here is an entangled fusion of operations performed for the different steps. This means, for instance, that `act`

fetches values from the same (partially) shared computation graph as `update`

, which illustrates a fundamental lack of modularity.

### Issues with such entangled implementations

This mix of Python and TensorFlow within an `update`

step as well as the entangled computation graphs can lead to undesirable properties, and might cause problems with respect to the following properties:

**Modularity.**We think that, while such mixed implementations are preferable for research, they make it harder to design a modular RL framework, since functions are fundamentally limited to either domain, i.e. creating computation operations in TensorFlow, or implementing procedures in Python which themselves potentially interact with the TensorFlow computation graph. The way components are implemented might vary between models, for example, DQN is typically implemented in pure TensorFlow via the native TensorFlow Adam optimizer, while TRPO’s update (trust-region natural gradient optimization) is typically Python-based with various TensorFlow calls. This leads to a tendency to aggregate the entire model implementation in one large script, which is hard to understand, inflexible, and difficult to adapt to other use cases than the one it was written for (often OpenAI Gym or other simulations).

In contrast, modular implementations of`act`

and`update`

as pure TensorFlow functions enable a cleaner design, which is what we try to provide with TensorForce — version 0.3 provides interchangeable optimization modules (e.g. DQN can be used with an evolutionary optimizer) instead of optimization code baked into model implementations. The fact that (parts of) a model can be reused in other algorithms makes it easy to combine things such as integrating the DQN logic as baseline for a policy gradient model, and create a*single*combined computation graph for the`update`

function of this new model.**Distributed execution.**Apart from being conceptually undesirable, an obvious problem with a computation graph shared between`act`

and`update`

is that configuring distributed execution is harder. More generally, being able to specify the allocation and execution of operations can be desirable. Defining the entire`update`

logic as modularly structured TensorFlow operations gives full control over the computation graph definition and execution. Hence it is possible, for instance, to specify multiple worker model instances which locally only implement the`act`

logic and collect experience into a global memory, while the`update`

logic is implemented in a global model instance which performs updates and periodically updates the parameters of its worker instances.**Exporting models and deployment.**For TensorForce to become a truly useful applied RL library, models should be as independent from the Python script wrapper as possible, so that the TensorFlow computation graph can be imported with other language bindings or deployed with TensorFlow serving. Unlike supervised deep learning models, popular RL model implementations are usually not purely defined in terms of TensorFlow operations, and consequently cannot be exported and used as abstract RL computation graph.**Performance**We expect that many calls to the TensorFlow graph might slow down the optimization procedure, since it involves switching between Python-based and TensorFlow’s C-based execution contexts many times. Implementing the entire model logic in TensorFlow avoids this and gives TensorFlow the possibility to optimize the graph execution internally.

### The role of optimizers for the computation graph

The `update`

component of a model can be divided into the definition of a loss term based on the batch of states, actions and rewards, and the specification of how this loss expression is used to update model parameters. For the latter part, a simple gradient-descent-style optimizer calculates the gradients of the loss function and updates parameters based on it. The PPO model works similarly, but involves multiple updates per optimization step. The `update`

procedure of the TRPO model with its natural-gradient-based optimizer is more complex and comprises many different steps. An explanation of TRPO’s update mechanism and its implementation in TensorForce 0.3 is outlined below.

The optimization method specifies where and how the loss term is integrated into the update procedure, and hence defines how the operations in the computation graph are assembled. As we illustrate in the next section, it makes sense to frame the model as providing the *callable function* of calculating the loss given an input, while the optimizer module uses it *to create* the actual TensorFlow graph. We have not seen this interpretation of optimizers realized in a library design before, and we hope it leads to much cleaner and more modular implementations of e.g. natural gradient models.

## TensorForce 0.3 design details

In the following we are going to have a detailed look at the new implementation. Note that not all features have been fully pushed into the graph yet (e.g. replay memories), but the principal breaking changes for this have been integrated.

TensorForce version 0.3 makes heavy use of TensorFlow’s `tf.make_template`

, which can be understood as turning a Python function of TensorFlow operations into what we call a *‘TensorFlow function’*. It guarantees that every variable introduced as part of the function is unique and repeatedly calling the function uses these same variables, which are created on the first call. By modularly defining the RL models not as a computation graph, but as a *callable* graph creator function which is passed to the optimizer module, the actual computation graph can be instantiated by the respective optimizer according to the underlying algorithm.

A simple example: As mentioned earlier, the optimization procedure in PPO repeatedly updates its weights based on the current loss. The optimizer step can be implemented as follows (simplified and incomplete version of `tf_step`

of the new `MultiStep`

optimizer):

```
def tf_step(self, variables, fn_loss, **kwargs):
deltas = ()
for _ in xrange(self.num_steps):
with tf.control_dependencies(control_inputs=deltas):
deltas = self.optimizer.step(
variables=variables,
fn_loss=fn_loss,
**kwargs
)
```

This snippet repeatedly calls the optimization function `optimizer.step`

of an internal optimizer, passing the loss callable `fn_loss`

as argument, which the internal optimizer can use to create the computation graph necessary for calculating the loss in one iteration *after* application of the previous optimization step (ensured via `tf.control_dependencies`

). The loss is a fairly common required optimizer argument, but it is possible to use others via the additional arguments in `**kwargs`

— the `NaturalGradient`

optimizer, for instance, uses a KL-divergence callable `fn_kl_divergence`

in addition, while the `Synchronization`

optimizer (see next paragraph) uses neither of these, but instead receives a list of source variables to synchronize the weights with.

`MultiStep`

is an instance of the new `MetaOptimizer`

base class, which takes an optimizer and modifies its update in some way, in this case applies it multiple times. Another example is the `OptimizedStep`

optimizer which implements a line search along the update step proposed by its internal optimizer, to find a potentially more optimal fraction of this step (used by the TRPO model in combination with a natural gradient optimizer). Another less common optimizer example is the aforementioned `Synchronization`

optimizer. It defines an infrequent, *‘lagging’* update step towards the weights of another optimized model which, for instance, corresponds to the infrequent parameter updates commonly found in Q-models for the target network.

The function prefix `tf_`

is ubiquitous in version 0.3 and marks a function as a template which is turned into a proper TensorFlow function via `tf.make_template`

. The `Distribution`

class, for instance, implements functions like `Distribution.tf_sample`

, `Distribution.tf_entropy`

or `Distribution.tf_kl_divergence`

, which are all turned into their TensorFlow function equivalents `Distribution.sample`

, `Distribution.entropy`

and `Distribution.kl_divergence`

. The same happens with the `tf_step`

function above in the `Optimizer`

class, so that even optimization functions can be reused, as we saw is the case in the `MultiStep`

optimizer:

```
self.step = tf.make_template(
name_='step',
func_=self.tf_step,
custom_getter=custom_getter
)
```

We pushed this modularization and *‘TensorFlow-ification’* to the point that our entire `Model`

class basically only interacts via `Model.act`

and `Model.observe`

for step-to-step action/reward retrieval, and `Model.update`

for periodic parameter updates based on batches of experience (plus helper function `Model.reset`

). As outlined before, this is different from many other implementations of RL models, where TensorFlow and Python code are very intertwined and at least updating the model involves a mix of both.

The following diagram illustrates the computation flow for the loss function of a policy gradient model with its various calls to model-internal TensorFlow functions:

Note that we have not yet moved the functionality of the `Agent`

class to pure TensorFlow. This includes preprocessing, exploration, internal state handling and experience collection, either based on a batch of preceding timesteps or a replay memory of past experience.

## Optimization modules

The diagram below illustrates all TensorForce optimizers implemented as of yet. To explain the new mechanics, we focus on the optimizer behind TRPO, which combines the `NaturalGradient`

and `MultiStep`

optimizer.

### The optimization algorithm behind TRPO

At its core, the TRPO model is just a variant of a policy gradient model, in which reward signals (or advantage values) are weighted by the likelihood ratio of the updated stochastic policy with respect to the previous policy, instead of by the log likelihood of the updated policy, that is,

\[ \mathrm{L}(w + \Delta w) = \frac{1}{n} \sum_t \frac{ P_{w + \Delta w}(s_t) }{ P_w(s_t) } \cdot r_t. \]

The same loss expression is, for instance, used by the PPO model. The major difference between these two models lies in the way they are optimized. In the following we outline how the natural-gradient-based optimizer for TRPO works (skip math).

### Background: Derivation of the natural gradient

TRPO’s optimization mechanism is comparatively complex and involves a sequence of steps. Starting with the following problem formulation,

\[ {\arg \max}_{\Delta w} \left( -\mathrm{L}(w + \Delta w) \right) \quad\mathrm{such\ that}\quad \mathrm{KL}(P_w || P_{w + \Delta w}) = c, \]

the loss is approximated by its first-order Taylor approximation around \(\Delta w = 0\),

\[ f(\Delta w) := -\mathrm{L}(w) \:- \nabla \mathrm{L}(w) \cdot \Delta w, \]

and the KL-divergence by its second-order Taylor approximation which, all else being zero, corresponds to the second derivative of the KL-divergence,

\[ g(\Delta w) := \mathrm{KL}(P_w || P_{w + \Delta w}) = \frac{1}{2} \Delta w^T \cdot \nabla^2 \mathrm{KL}(P_w || P_{w + \Delta w}) \cdot \Delta w = \frac{1}{2} \Delta w^T \cdot F(w) \cdot \Delta w. \]

\(F(w) := \nabla^2 \mathrm{KL}(P_w || P_{w + \Delta w})\) is the Fisher information matrix. This yields the following approximated optimization problem:

\[ {\arg \max}_{\Delta w} f(\Delta w) \quad\mathrm{such\ that}\quad g(\Delta w) \:- c = 0, \]

for which the corresponding *Lagrangian* looks as follows:

\[ \mathcal{L}(\Delta w, \lambda) = f(\Delta w) \:- \lambda \cdot (g(\Delta w) \:- c). \]

The stationary points are characterized by the gradient of the Lagrangian for \(\Delta w\) and \(\lambda\) being zero. The former gives us

\[ \nabla f(\Delta w) \:- \lambda \cdot \nabla g(\Delta w) = -\nabla \mathrm{L}(w) \:- \lambda \cdot F(w) \cdot \Delta w = 0, \]

which can be rewritten as a system of linear equations

\[ F(w) \cdot (\lambda \cdot \Delta w) = -\nabla \mathrm{L}(w). \]

We can calculate \(F(w)\) and \(\nabla \mathrm{L}(w)\), hence this system can be solved for \(\Delta \tilde{w} := \lambda \cdot \Delta w\), for instance using the iterative conjugate gradients method. Finally, the latter gradient condition given by

\[ g(\Delta w) \:- c = 0.5 \cdot \Delta w^T \cdot F(w) \cdot \Delta w \:- c = 0 \]

allows us to calculate \(\lambda\) via

\[ \lambda^2 = \frac{1}{2 c} \cdot \Delta \tilde{w}^T \cdot F(w) \cdot \Delta \tilde{w}. \]

We obtain the optimal step size \(\Delta w\) and the estimated improvement \(\Delta \mathrm{L} := \nabla \mathrm{L}(w) \cdot \Delta w\).

Due to the Taylor approximations, this estimate does not in general correspond to the actual improvement. Since the above calculation is computationally expensive, while just computing the loss is comparatively cheap, TRPO adds a line search step to determine a more optimal step size before performing the update. As part of the line search, the loss function \(\mathrm{L}(w + \Delta w)\) is repeatedly evaluated with exponentially decreasing step sizes \(m \cdot \Delta w\)$ (at least in John Schulman’s implementation) with \(m \in \{1.0, 0.5, 0.25, \dots\}\) until the first step size is found with an actual improvement of at least the acceptance ratio \(\alpha\) (\(0.1\) in Schulman’s implementation):

\[ \frac{ \mathrm{L}(w) \:- \mathrm{L}(w + m \cdot \Delta w) }{ m \cdot \Delta \mathrm{L} } \geq \alpha. \]

### Natural gradients and line search in TensorForce

Previous versions of TensorForce, and other implementations, define most of the calculations above as TensorFlow operations. However, the two iterative methods, i.e. the conjugate gradient algorithm and the line search, are implemented as Python loops using NumPy and repeated calls to the TensorFlow computation graph.

Our *pure TensorFlow* implementation of the TRPO optimizer is a combination of the `NaturalGradient`

and `OptimizedStep`

optimizer in the `tensorforce.core.optimizers`

sub-package. The code snippets below are simplified and incomplete versions taken from the respective class files. We start with the `NaturalGradient`

optimizer, which comprises essentially all of the math above apart from the last paragraph.

```
def tf_step(self, variables, fn_loss, fn_kl_divergence, **kwargs):
kldiv_gradients = tf.gradients(ys=fn_kl_divergence(), xs=variables)
# Product to prevent having to actually calculate
# the entire matrix explicitly.
def fisher_matrix_product(x):
x = [tf.stop_gradient(input=delta) for delta in x]
x_kldiv_gradients = tf.add_n(inputs=[
tf.reduce_sum(input_tensor=(delta * grad))
for delta, grad in zip(x, kldiv_gradients)
])
return tf.gradients(ys=x_kldiv_gradients, xs=variables)
loss_gradients = tf.gradients(ys=fn_loss(), xs=variables)
deltas = self.solver.solve(
fn_x=fisher_matrix_product,
x_init=None,
b=[-grad for grad in loss_gradients]
)
delta_fisher_matrix_product = fisher_matrix_product(x=deltas)
constant = 0.5 * tf.add_n(inputs=[
tf.reduce_sum(input_tensor=(delta_F * delta))
for delta_F, delta in zip(delta_fisher_matrix_product, deltas)
])
lagrange_multiplier = tf.sqrt(x=(constant / self.learning_rate))
estimated_deltas = [delta / lagrange_multiplier for delta in deltas]
estimated_improvement = tf.add_n(inputs=[
tf.reduce_sum(input_tensor=(grad * delta))
for grad, delta in zip(loss_gradients, estimated_deltas)
])
return estimated_deltas, estimated_improvement
```

The first difficulty is to solve the resulting system of linear equations \(F(w) \cdot (\lambda \cdot \Delta w) = -\nabla \mathrm{L}(w)\) with the iterative conjugate gradient method, which happens as part of `self.solver.solve(fn_x=..., b=...)`

in the code. The `NaturalGradient`

class here uses our new `ConjugateGradient`

solver from the sub-package `tensorforce.core.optimizers.solvers`

, which itself is an instance of the iterative solver super-class `Iterative(Solver)`

. The `Solver`

class specifies a template for a pure TensorFlow solver for any equation/optimization for \(x\) involving an expression \(f(x)\):

```
class Solver(object):
def __init__(self):
self.solve = tf.make_template(name_='solver', func_=self.tf_solve)
def tf_solve(self, fn_x, *args):
raise NotImplementedError
```

The `Iterative`

solver class defines a still problem-agnostic template of iterative solver algorithms with initialization step, iteration loop and termination condition (default is a maximum number of iterations):

```
class Iterative(Solver):
def __init__(self, max_iterations):
super(Iterative, self).__init__()
self.max_iterations = max_iterations
self.initialize = tf.make_template(name_='initialize', func_=self.tf_initialize)
self.step = tf.make_template(name_='step', func_=self.tf_step)
self.next_step = tf.make_template(name_='next-step', func_=self.tf_next_step)
def tf_solve(self, fn_x, x_init, *args):
self.fn_x = fn_x
initial_args = self.initialize(x_init, *args)
final_args = tf.while_loop(
cond=self.next_step,
body=self.step,
loop_vars=initial_args
)
return final_args[0] # First argument contains solution
def tf_initialize(self, x_init, *args):
return x_init, 0
def tf_step(self, x, iteration, *args):
iteration += 1
return (x, iteration) + args
def tf_next_step(self, x, iteration, *args):
return iteration < self.max_iterations
```

The `ConjugateGradient`

class, finally, implements the conjugate gradient algorithm, solving a system of the form \(f(x) = A \cdot x \overset{!}{=} b\). For this, it defines the solver signature as `tf_solve(fn_x, x_init, b)`

with `fn_x`

corresponding to \(A(x)\), `b`

corresponding to \(b\) and `x_init`

corresponding to the initial solution guess \(x_0\) (zero vector if not specified):

```
deltas = self.solver.solve(
fn_x=fisher_matrix_product,
x_init=None,
b=[-grad for grad in loss_gradients]
)
```

Not much is gained from showing the entire implementation here, except for the fact that the initialization `tf_initialize`

as well as the body of the loop `tf_step`

call the `fn_x`

function once each, in `self.fn_x(x=x_init)`

and `self.fn_x(x=conjugate)`

respectively, which correspond to \(A x_0\) and \(A p_k\) appearing twice in the Wikipedia pseudo-code for the algorithm.

Next, the `OptimizedStep`

optimizer takes the estimated optimal step size and improvement of the `NaturalGradient`

optimizer and applies the iterative line search solver, which can also be found in the `tensorforce.core.optimizers.solvers`

sub-package. The basic principle of an iterative solver is the same and we skip over the concrete implementation code here.

### Optimizers as interchangeable configuration entries

Finally, the optimizer can be specified as part of the JSON agent configuration as follows, and in the case of the TRPO agent class is predefined as such:

```
"optimizer": {
"type": "optimized_step",
"optimizer": {
"type": "natural_gradient",
"learning_rate": 1e-3,
"cg_max_iterations": 20,
"cg_damping": 1e-3
},
"ls_max_iterations": 10,
"ls_accept_ratio": 0.9
}
```

To give another example, and illustrate how the new optimizer module can be changed in a few lines, here the corresponding JSON specification for the evolutionary optimizer:

```
"optimizer": {
"type": "evolutionary",
"learning_rate": 1e-2,
"num_samples": 5
}
```

## What’s next

The short-term architectural goal for TensorForce from here on is clear: push all the remaining pieces (memories, preprocessing) into the TensorFlow graph. The logical next step is to think about executing this graph more optimally according to the available hardware. We originally implemented A3C semantics via distributed TensorFlow and a shared-model multi-threaded runner.

Aside from the observation (e.g. noted by OpenAI in the ACKTR/A2C post) that it is unclear whether noise from asynchronous execution really improves training, version 0.4 will center on execution modes for GPU batching (inspired by GA3C/PAAC).