Completed
Pull Request — master (#1079)
by David
05:00
created

GradientDescent.process_batch()   A

Complexity

Conditions 2

Size

Total Lines 4

Duplication

Lines 0
Ratio 0 %
Metric Value
cc 2
dl 0
loc 4
rs 10
1
"""Training algorithms."""
0 ignored issues
show
Bug introduced by
There seems to be a cyclic import (blocks.bricks.base -> blocks.graph -> blocks.graph.bn -> blocks.filter).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.bn -> blocks.bricks.interfaces -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.wrappers -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.sequences -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.sequences -> blocks.bricks.simple -> blocks.bricks.wrappers -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
Bug introduced by
There seems to be a cyclic import (blocks.bricks -> blocks.bricks.sequences -> blocks.bricks.simple -> blocks.bricks.interfaces -> blocks.bricks.base -> blocks.graph -> blocks.graph.bn).

Cyclic imports may cause partly loaded modules to be returned. This might lead to unexpected runtime behavior which is hard to debug.

Loading history...
2
import logging
3
import itertools
4
from abc import ABCMeta, abstractmethod
5
from collections import OrderedDict
6
from six.moves import reduce
7
8
from picklable_itertools.extras import equizip
9
10
import theano
11
from six import add_metaclass
12
from theano import tensor
13
14
from blocks.graph import ComputationGraph
15
from blocks.roles import add_role, ALGORITHM_HYPERPARAMETER, ALGORITHM_BUFFER
16
from blocks.theano_expressions import l2_norm
17
from blocks.utils import (dict_subset, pack, shared_floatx,
18
                          shared_floatx_zeros_matching)
19
20
logger = logging.getLogger(__name__)
21
22
23
def _create_algorithm_buffer_for(param, *args, **kwargs):
24
    buf = shared_floatx_zeros_matching(param, *args, **kwargs)
25
    buf.tag.for_parameter = param
26
    add_role(buf, ALGORITHM_BUFFER)
27
    return buf
28
29
30
@add_metaclass(ABCMeta)
31
class TrainingAlgorithm(object):
32
    """Base class for training algorithms.
33
34
    A training algorithm object has a simple life-cycle.
35
    First it is initialized by calling its :meth:`initialize` method.
36
    At this stage, for instance, Theano functions can be compiled.
37
    After that the :meth:`process_batch` method is repeatedly
38
    called with a batch of training data as a parameter.
39
40
    """
41
    @abstractmethod
42
    def initialize(self, **kwargs):
43
        """Initialize the training algorithm."""
44
        pass
45
46
    @abstractmethod
47
    def process_batch(self, batch):
48
        """Process a batch of training data.
49
50
        Attributes
51
        ----------
52
        batch : dict
53
            A dictionary of (source name, data) pairs.
54
55
        """
56
        pass
57
58
59
variable_mismatch_error = """
60
61
Blocks tried to match the sources ({sources}) of the training dataset to \
62
the names of the Theano variables ({variables}), but failed to do so. \
63
If you want to train on a subset of the sources that your dataset provides, \
64
pass the `sources` keyword argument to its constructor. Or pass \
65
on_unused_sources='warn' or on_unused_sources='ignore' to \
66
the GradientDescent algorithm."""
67
68
source_missing_error = """
69
70
Blocks didn't find all the sources ({sources}) of the training dataset \
71
that match the names of the Theano variables ({variables})."""
72
73
74
determinism_error = """Cannot infer parameter list in a fixed order.
0 ignored issues
show
Bug introduced by
A suspicious escape sequence \ was found. Did you maybe forget to add an r prefix?

Escape sequences in Python are generally interpreted according to rules similar to standard C. Only if strings are prefixed with r or R are they interpreted as regular expressions.

The escape sequence that was used indicates that you might have intended to write a regular expression.

Learn more about the available escape sequences. in the Python documentation.

Loading history...
75
76
Because dictionaries are unordered (and Python uses randomized hashing, \
77
which can change the iteration order over the same dictionary from one \
78
interpreter session to the next), Blocks cannot infer the parameters list \
79
from a plain dictionary of gradients in an order that is reproducible \
80
across interpreter sessions; please either specify the parameters \
81
explicitly or pass gradients as an OrderedDict (though exercise care in \
82
constructing that OrderedDict, as an OrderedDict \ created by iterating \
83
over an unordered iterable (e.g. a dict) will still \ have an arbitrary \
84
and unpredictable order that could cause problems with \
85
reproducibility)."""
86
87
88
class UpdatesAlgorithm(TrainingAlgorithm):
89
    """Base class for algorithms that use Theano functions with updates.
90
91
    Parameters
92
    ----------
93
    updates : list of tuples or :class:`~collections.OrderedDict`
94
        The updates that should be performed.
95
    theano_func_kwargs : dict, optional
96
        A passthrough to `theano.function` for additional arguments.
97
        Useful for passing `profile` or `mode` arguments to the theano
98
        function that will be compiled for the algorithm.
99
    on_unused_sources : str, one of 'raise' (default), 'ignore', 'warn'
100
        Controls behavior when not all sources in a batch are used
101
        (i.e. there is no variable with a matching name in the inputs
102
        of the computational graph of the updates).
103
104
    Attributes
105
    ----------
106
    updates : list of :class:`~tensor.TensorSharedVariable` updates
107
        Updates to be done for every batch. It is required that the
108
        updates are done using the old values of optimized parameters.
109
110
    """
111
    def __init__(self, updates=None, theano_func_kwargs=None,
112
                 on_unused_sources='raise', **kwargs):
113
        self.updates = [] if updates is None else updates
114
        self.theano_func_kwargs = (theano_func_kwargs if theano_func_kwargs
115
                                   is not None else dict())
116
        self.on_unused_sources = on_unused_sources
117
        super(UpdatesAlgorithm, self).__init__(**kwargs)
118
119
    def initialize(self):
120
        logger.info("Initializing the training algorithm")
121
        update_values = [new_value for _, new_value in self.updates]
122
        logger.debug("Inferring graph inputs...")
123
        self.inputs = ComputationGraph(update_values).inputs
124
        logger.debug("Compiling training function...")
125
        self._function = theano.function(
126
            self.inputs, [], updates=self.updates, **self.theano_func_kwargs)
127
        logger.info("The training algorithm is initialized")
128
129
    @property
130
    def updates(self):
131
        return self._updates
132
133
    @updates.setter
134
    def updates(self, value):
135
        self._updates = value
136
137
    def add_updates(self, updates):
138
        """Add updates to the training process.
139
140
        The updates will be done _before_ the parameters are changed.
141
142
        Parameters
143
        ----------
144
        updates : list of tuples or :class:`~collections.OrderedDict`
145
            The updates to add.
146
147
        """
148
        if isinstance(updates, OrderedDict):
149
            updates = list(updates.items())
150
        if not isinstance(updates, list):
151
            raise ValueError
152
        self.updates.extend(updates)
153
154
    def _validate_source_names(self, batch):
155
        in_names = [v.name for v in self.inputs]
156
157
        if not set(in_names).issubset(set(batch.keys())):
158
            raise ValueError("Didn't find all sources: " +
159
                             source_missing_error.format(
160
                                 sources=batch.keys(),
161
                                 variables=in_names))
162
        if not set(batch.keys()).issubset(set(in_names)):
163
            if self.on_unused_sources == 'ignore':
164
                pass
165
            elif self.on_unused_sources == 'warn':
166
                if not hasattr(self, '_unused_source_warned'):
167
                    logger.warn(variable_mismatch_error.format(
168
                        sources=batch.keys(),
169
                        variables=in_names))
170
                self._unused_source_warned = True
171
            elif self.on_unused_sources == 'raise':
172
                raise ValueError(
173
                    "mismatch of variable names and data sources" +
174
                    variable_mismatch_error.format(
175
                        sources=batch.keys(),
176
                        variables=in_names))
177
            else:
178
                raise ValueError("Wrong value of on_unused_sources: {}."
179
                                 .format(self.on_unused_sources))
180
181
    def process_batch(self, batch):
182
        self._validate_source_names(batch)
183
        ordered_batch = [batch[v.name] for v in self.inputs]
184
        self._function(*ordered_batch)
185
186
187
class GradientDescent(UpdatesAlgorithm):
188
    """A base class for all gradient descent algorithms.
189
190
    By "gradient descent" we mean a training algorithm of the following
191
    form:
192
193
    .. code-block::  python
194
195
        for batch in data:
196
            steps = step_rule.compute_steps(parameters,
197
                                            gradients_wr_parameters)
198
            for parameter in parameters:
199
                parameter -= steps[parameter]
200
201
    Note, that the step is *subtracted, not added*! This is done in order
202
    to make step rule chaining possible.
203
204
    Parameters
205
    ----------
206
    cost : :class:`~tensor.TensorVariable`, optional
207
        The objective to be minimized.
208
    parameters : list of :class:`~tensor.TensorSharedVariable`, optional
209
        The parameters to be tuned. If not provided, inferred from the
210
        keys of `gradients` (in which case `gradients` *must* be an
211
        `OrderedDict`).
212
    step_rule : instance of :class:`StepRule`, optional
213
        An object encapsulating most of the algorithm's logic. Its
214
        `compute_steps` method is called to get Theano expression for
215
        steps.  Note, that the step rule might have a state, e.g. to
216
        remember a weighted sum of gradients from previous steps like it is
217
        done in gradient descent with momentum. If ``None``, an instance of
218
        :class:`Scale` is created.
219
    gradients : OrderedDict, optional
220
        A dictionary mapping a parameter to an expression for the cost's
221
        gradient with respect to the parameter. If ``None``, the gradient
222
        are taken automatically using :func:`theano.gradient.grad`.
223
    known_grads : dict, optional
224
        A passthrough to `theano.tensor.grad`'s `known_grads` argument.
225
        Useful when you know the [approximate] gradients of some
226
        sub-expressions and would like Theano to use that information
227
        to compute parameter gradients. Only makes sense when `gradients`
228
        is `None`.
229
    consider_constant : list, optional
230
        A passthrough to `theano.tensor.grad`'s `consider_constant`
231
        argument.  A list of expressions through which gradients will not
232
        be backpropagated. Only makes sense when `gradients` is `None`.
233
234
    Attributes
235
    ----------
236
    gradients : OrderedDict
237
        The gradient dictionary.
238
    step_rule : instance of :class:`StepRule`
239
        The step rule.
240
241
    Notes
242
    -----
243
    Changing `updates` attribute or calling `add_updates` after
244
    the `initialize` method is called will have no effect.
245
246
    `gradients` must be an `OrderedDict` if `parameters` is unspecified
247
    because ordinary dictionaries have an unpredictable iteration
248
    order due to hash randomization (which is enabled by default since
249
    versions 2.7.3 and 3.2.3 of Python). This source of variability,
250
    when combined with Theano's heuristic graph optimizations, can cause
251
    serious reproducibility issues.
252
253
    .. todo::
254
255
       Some shared variables are not parameters (e.g. those created by
256
       random streams).
257
258
    .. todo::
259
260
       Due to a rather premature status of the :class:`ComputationGraph`
261
       class the parameter used only inside scans are not fetched
262
       currently.
263
264
    """
265
    def __init__(self, cost=None, parameters=None, step_rule=None,
266
                 gradients=None, known_grads=None, consider_constant=None,
267
                 **kwargs):
268
        # Set initial values for cost, parameters, gradients.
269
        self.cost = cost
270
        self.parameters = parameters
271
        self.gradients = gradients
272
273
        # If we don't have gradients, we'll need to infer them from the
274
        # cost and the parameters, both of which must not be None.
275
        if not self.gradients:
276
            if self.cost is None:
277
                raise ValueError("can't infer gradients; no cost specified")
278
            elif self.parameters is None or len(self.parameters) == 0:
279
                raise ValueError("can't infer gradients; "
280
                                 "no parameters specified")
281
282
            # While this strictly speaking could be a dict and not an
283
            # OrderedDict (because we iterate over it in the order of
284
            # self.parameters), this guards a little bit against
285
            # nondeterminism introduced by future refactoring.
286
            logger.info("Taking the cost gradient")
287
            self.gradients = OrderedDict(
288
                equizip(self.parameters, tensor.grad(
289
                    self.cost, self.parameters,
290
                    known_grads=known_grads,
291
                    consider_constant=consider_constant)))
292
            logger.info("The cost gradient computation graph is built")
293
        else:
294
            if self.parameters is None and isinstance(gradients, OrderedDict):
295
                # If the dictionary is ordered, it's safe to use the keys
296
                # as they have a deterministic order.
297
                self.parameters = list(self.gradients.keys())
298
            elif self.parameters is not None:
299
                # If parameters and gradients.keys() don't match we can
300
                # try to recover if gradients is ordered.
301
                if set(self.parameters) != set(self.gradients.keys()):
302
                    logger.warn("Specified parameters list does not match "
303
                                "keys in provided gradient dictionary; "
304
                                "using parameters inferred from gradients")
305
                    if not isinstance(self.gradients, OrderedDict):
306
                        raise ValueError(determinism_error)
307
                    self.parameters = list(self.gradients.keys())
308
            else:
309
                # self.parameters is not None, and gradients isn't
310
                # an OrderedDict. We can't do anything safe.
311
                raise ValueError(determinism_error)
312
            if known_grads:
313
                raise ValueError("known_grads has no effect when gradients "
314
                                 "are passed in")
315
            if consider_constant is not None:
316
                raise ValueError("consider_constant has no effect when "
317
                                 "gradients are passed in")
318
319
        # The order in which the different gradient terms appears
320
        # here matters, as floating point addition is non-commutative (and
321
        # Theano's graph optimizations are not order-independent).
322
        # This is why we do not use .values().
323
        gradient_values = [self.gradients[p] for p in self.parameters]
324
        self.total_gradient_norm = (l2_norm(gradient_values)
325
                                    .copy(name="total_gradient_norm"))
326
327
        self.step_rule = step_rule if step_rule else Scale()
328
        logger.debug("Computing parameter steps...")
329
        self.steps, self.step_rule_updates = (
330
            self.step_rule.compute_steps(self.gradients))
331
332
        # Same as gradient_values above: the order may influence a
333
        # bunch of things, so enforce a consistent one (don't use
334
        # .values()).
335
        step_values = [self.steps[p] for p in self.parameters]
336
        self.total_step_norm = (l2_norm(step_values)
337
                                .copy(name="total_step_norm"))
338
339
        # Once again, iterating on gradients may not be deterministically
340
        # ordered if it is not an OrderedDict. We add the updates here in
341
        # the order specified in self.parameters. Keep it this way to
342
        # maintain reproducibility.
343
        kwargs.setdefault('updates', []).extend(
344
            itertools.chain(((parameter, parameter - self.steps[parameter])
345
                             for parameter in self.parameters),
346
                            self.step_rule_updates)
347
        )
348
        # The required inputs are going to be the inputs of the graph
349
        # used in the actual step computation. This may be different from
350
        # those used in the gradient computation, if a step rule elects
351
        # to ignore certain parameters for some reason. This is usually
352
        # presumably an error, and basing the inputs list on the updates
353
        # ensures that the unused inputs check catches it.
354
        # Once again, we use step_values which is painstakingly ordered
355
        # in a reproducible manner.
356
        self.inputs = ComputationGraph(step_values).inputs
357
        super(GradientDescent, self).__init__(**kwargs)
358
359
360
@add_metaclass(ABCMeta)
361
class StepRule(object):
362
    """A rule to compute steps for a gradient descent algorithm."""
363
    def compute_step(self, parameter, previous_step):
364
        """Build a Theano expression for the step for a parameter.
365
366
        This method is called by default implementation of
367
        :meth:`compute_steps`, it relieves from writing a loop each time.
368
369
        Parameters
370
        ----------
371
        parameter : :class:`~tensor.TensorSharedVariable`
372
            The parameter.
373
        previous_step : :class:`~tensor.TensorVariable`
374
            Some quantity related to the gradient of the cost with respect
375
            to the parameter, either the gradient itself or a step in a
376
            related direction.
377
378
        Returns
379
        -------
380
        step : :class:`~theano.Variable`
381
            Theano variable for the step to take.
382
        updates : list
383
            A list of tuples representing updates to be performed. This
384
            is useful for stateful rules such as :class:`Momentum` which
385
            need to update shared variables after itetations.
386
387
        """
388
        raise NotImplementedError
389
390
    def compute_steps(self, previous_steps):
391
        """Build a Theano expression for steps for all parameters.
392
393
        Override this method if you want to process the steps
394
        with respect to all parameters as a whole, not parameter-wise.
395
396
        Parameters
397
        ----------
398
        previous_steps : OrderedDict
399
            An :class:`~OrderedDict` of
400
            (:class:`~tensor.TensorSharedVariable`
401
            :class:`~tensor.TensorVariable`) pairs. The keys are the
402
            parameters being trained, the values are the expressions for
403
            quantities related to gradients of the cost with respect to
404
            the parameters, either the gradients themselves or steps in
405
            related directions.
406
407
        Returns
408
        -------
409
        steps : OrderedDict
410
            A dictionary of the proposed steps in the same form as
411
            `previous_steps`.
412
        updates : list
413
            A list of tuples representing updates to be performed.
414
415
        """
416
        parameter_wise = [self.compute_step(parameter,
417
                                            previous_steps[parameter])
418
                          for parameter in previous_steps]
419
        steps, updates = equizip(*parameter_wise)
420
        steps = OrderedDict((parameter, step) for parameter, step
421
                            in equizip(previous_steps.keys(), steps))
422
        updates = list(itertools.chain(*updates))
423
        return steps, updates
424
425
426
class CompositeRule(StepRule):
427
    """Chains several step rules.
428
429
    Parameters
430
    ----------
431
    components : list of :class:`StepRule`
432
        The learning rules to be chained. The rules will be applied in the
433
        order as given.
434
435
    """
436
    def __init__(self, components):
437
        self.components = components
438
439
    def compute_steps(self, previous_steps):
440
        steps = previous_steps
441
        updates = []
442
        for rule in self.components:
443
            steps, more_updates = rule.compute_steps(steps)
444
            updates += more_updates
445
        return steps, updates
446
447
448
class Scale(StepRule):
449
    """A step in the direction proportional to the previous step.
450
451
    If used in :class:`GradientDescent` alone, this step rule implements
452
    steepest descent.
453
454
    Parameters
455
    ----------
456
    learning_rate : float
457
        The learning rate by which the previous step is multiplied to
458
        produce the step.
459
460
    Attributes
461
    ----------
462
    learning_rate : :class:`~tensor.TensorSharedVariable`
463
        The shared variable storing the learning rate used.
464
465
    """
466
    def __init__(self, learning_rate=1.0):
467
        self.learning_rate = shared_floatx(learning_rate, "learning_rate")
468
        add_role(self.learning_rate, ALGORITHM_HYPERPARAMETER)
469
470
    def compute_step(self, parameter, previous_step):
471
        return self.learning_rate * previous_step, []
472
473
474
class BasicMomentum(StepRule):
475
    """Accumulates step with exponential discount.
476
477
    Parameters
478
    ----------
479
    momentum : float, optional
480
        The momentum coefficient. Defaults to 0.
481
482
    Notes
483
    -----
484
    This step rule is intended to be used in conjunction with another
485
    step rule, _e.g._ :class:`Scale`. For an all-batteries-included
486
    experience, look at :class:`Momentum`.
487
488
    """
489
    def __init__(self, momentum=0.):
490
        self.momentum = shared_floatx(momentum, "momentum")
491
        add_role(self.momentum, ALGORITHM_HYPERPARAMETER)
492
493
    def compute_step(self, parameter, previous_step):
494
        velocity = _create_algorithm_buffer_for(parameter, "velocity")
495
        step = self.momentum * velocity + previous_step
496
        updates = [(velocity, step)]
497
        return step, updates
498
499
500
class Momentum(CompositeRule):
501
    """Accumulates step with exponential discount.
502
503
    Combines :class:`BasicMomentum` and :class:`Scale` to form the
504
    usual momentum step rule.
505
506
    Parameters
507
    ----------
508
    learning_rate : float, optional
509
        The learning rate by which the previous step scaled. Defaults to 1.
510
    momentum : float, optional
511
        The momentum coefficient. Defaults to 0.
512
513
    Attributes
514
    ----------
515
    learning_rate : :class:`~tensor.SharedVariable`
516
        A variable for learning rate.
517
    momentum : :class:`~tensor.SharedVariable`
518
        A variable for momentum.
519
520
    See Also
521
    --------
522 View Code Duplication
    :class:`SharedVariableModifier`
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
523
524
    """
525
    def __init__(self, learning_rate=1.0, momentum=0.):
526
        scale = Scale(learning_rate=learning_rate)
527
        basic_momentum = BasicMomentum(momentum=momentum)
528
        self.learning_rate = scale.learning_rate
529
        self.momentum = basic_momentum.momentum
530
        self.components = [scale, basic_momentum]
531
532
533
class AdaDelta(StepRule):
534
    """Adapts the step size over time using only first order information.
535
536
    Parameters
537
    ----------
538
    decay_rate : float, optional
539
        Decay rate in [0, 1]. Defaults to 0.95.
540
    epsilon : float, optional
541
        Stabilizing constant for RMS. Defaults to 1e-6.
542
543
    Notes
544
    -----
545
    For more information, see [ADADELTA]_.
546
547
    .. [ADADELTA] Matthew D. Zeiler, *ADADELTA: An Adaptive Learning
548
       Rate Method*, arXiv:1212.5701.
549
550
    """
551
    def __init__(self, decay_rate=0.95, epsilon=1e-6):
552
        if not 0.0 <= decay_rate <= 1.0:
553
            raise ValueError("decay rate needs to be in [0, 1]")
554
        self.decay_rate = shared_floatx(decay_rate, "decay_rate")
555
        add_role(self.decay_rate, ALGORITHM_HYPERPARAMETER)
556
        self.epsilon = shared_floatx(epsilon, "epsilon")
557
        add_role(self.epsilon, ALGORITHM_HYPERPARAMETER)
558
559
    def compute_step(self, parameter, previous_step):
560
        mean_square_step_tm1 = _create_algorithm_buffer_for(
561
            parameter, "mean_square_step_tm1")
562
        mean_square_delta_x_tm1 = _create_algorithm_buffer_for(
563
            parameter, "mean_square_delta_x_tm1")
564
565
        mean_square_step_t = (
566
            self.decay_rate * mean_square_step_tm1 +
567
            (1 - self.decay_rate) * tensor.sqr(previous_step)
568
        )
569
570
        rms_delta_x_tm1 = tensor.sqrt(mean_square_delta_x_tm1 + self.epsilon)
571
        rms_step_t = tensor.sqrt(mean_square_step_t + self.epsilon)
572
        delta_x_t = rms_delta_x_tm1 / rms_step_t * previous_step
573
574
        mean_square_delta_x_t = (
575
            self.decay_rate * mean_square_delta_x_tm1 +
576
            (1 - self.decay_rate) * tensor.sqr(delta_x_t)
577
        )
578
579
        step = delta_x_t
580
        updates = [(mean_square_step_tm1, mean_square_step_t),
581
                   (mean_square_delta_x_tm1, mean_square_delta_x_t)]
582
        return step, updates
583
584
585
class BasicRMSProp(StepRule):
586
    """Scales the step size by a running average of the recent step norms.
587
588
    Parameters
589
    ----------
590
    decay_rate : float, optional
591
        How fast the running average decays, value in [0, 1]
592
        (lower is faster).  Defaults to 0.9.
593
    max_scaling : float, optional
594
        Maximum scaling of the step size, in case the running average is
595
        really small. Needs to be greater than 0. Defaults to 1e5.
596
597
    Notes
598
    -----
599
    This step rule is intended to be used in conjunction with another
600
    step rule, _e.g._ :class:`Scale`. For an all-batteries-included
601
    experience, look at :class:`RMSProp`.
602
603
    In general, this step rule should be used _before_ other step rules,
604
    because it has normalization properties that may undo their work.
605
    For instance, it should be applied first when used in conjunction
606
    with :class:`Scale`.
607
608
    For more information, see [Hint2014]_.
609
610
    """
611
    def __init__(self, decay_rate=0.9, max_scaling=1e5):
612
        if not 0.0 <= decay_rate <= 1.0:
613
            raise ValueError("decay rate needs to be in [0, 1]")
614
        if max_scaling <= 0:
615
            raise ValueError("max. scaling needs to be greater than 0")
616
        self.decay_rate = shared_floatx(decay_rate, "decay_rate")
617
        add_role(self.decay_rate, ALGORITHM_HYPERPARAMETER)
618
        self.epsilon = 1. / max_scaling
619
620
    def compute_step(self, parameter, previous_step):
621
        mean_square_step_tm1 = _create_algorithm_buffer_for(
622
            parameter, "mean_square_step_tm1")
623
        mean_square_step_t = (
624
            self.decay_rate * mean_square_step_tm1 +
625
            (1 - self.decay_rate) * tensor.sqr(previous_step))
626
        rms_step_t = tensor.maximum(
627
            tensor.sqrt(mean_square_step_t), self.epsilon)
628
        step = previous_step / rms_step_t
629
        updates = [(mean_square_step_tm1, mean_square_step_t)]
630
        return step, updates
631
632
633
class RMSProp(CompositeRule):
634
    """Scales the step size by a running average of the recent step norms.
635
636
    Combines :class:`BasicRMSProp` and :class:`Scale` to form the step rule
637
    described in [Hint2014]_.
638
639
    .. [Hint2014] Geoff Hinton, *Neural Networks for Machine Learning*,
640
       lecture 6a,
641
       http://cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
642
643
    Parameters
644
    ----------
645
    learning_rate : float, optional
646
        The learning rate by which the previous step scaled. Defaults to 1.
647
    decay_rate : float, optional
648
        How fast the running average decays (lower is faster).
649
        Defaults to 0.9.
650
    max_scaling : float, optional
651
        Maximum scaling of the step size, in case the running average is
652
        really small. Defaults to 1e5.
653
654
    Attributes
655
    ----------
656
    learning_rate : :class:`~tensor.SharedVariable`
657
        A variable for learning rate.
658
    decay_rate : :class:`~tensor.SharedVariable`
659
        A variable for decay rate.
660
661
    See Also
662
    --------
663
    :class:`SharedVariableModifier`
664
665
    """
666
    def __init__(self, learning_rate=1.0, decay_rate=0.9, max_scaling=1e5):
667
        basic_rms_prop = BasicRMSProp(decay_rate=decay_rate,
668
                                      max_scaling=max_scaling)
669
        scale = Scale(learning_rate=learning_rate)
670
        self.learning_rate = scale.learning_rate
671
        self.decay_rate = basic_rms_prop.decay_rate
672
        self.components = [basic_rms_prop, scale]
673
674
675
class StepClipping(StepRule):
676
    """Rescales an entire step if its L2 norm exceeds a threshold.
677
678
    When the previous steps are the gradients, this step rule performs
679
    gradient clipping.
680
681
    Parameters
682
    ----------
683
    threshold : float, optional
684
        The maximum permitted L2 norm for the step. The step
685
        will be rescaled to be not higher than this quanity.
686
        If ``None``, no rescaling will be applied.
687
688
    Attributes
689
    ----------
690
    threshold : :class:`.tensor.TensorSharedVariable`
691
        The shared variable storing the clipping threshold used.
692
693
    """
694
    def __init__(self, threshold=None):
695
        if threshold:
696
            self.threshold = shared_floatx(threshold, "threshold")
697
            add_role(self.threshold, ALGORITHM_HYPERPARAMETER)
698
699
    def compute_steps(self, previous_steps):
700
        if not hasattr(self, 'threshold'):
701
            return previous_steps
702
        norm = l2_norm(previous_steps.values())
703
        multiplier = tensor.switch(norm < self.threshold,
704
                                   1, self.threshold / norm)
705
        steps = OrderedDict(
706
            (parameter, step * multiplier)
707
            for parameter, step in previous_steps.items())
708
        return steps, []
709
710
711
class VariableClipping(StepRule):
712
    """Clip the maximum norm of individual variables along certain axes.
713
714
    This :class:`StepRule` can be used to implement L2 norm constraints on
715
    e.g. the weight vectors of individual hidden units, convolutional
716
    filters or entire weight tensors. Combine with :class:`Restrict`
717
    (and possibly :class:`CompositeRule`), to apply such constraints only
718
    to certain variables and/or apply different norm constraints to
719
    different variables.
720
721
    Parameters
722
    ----------
723
    threshold : float
724
        Maximum norm for a given (portion of a) tensor.
725 View Code Duplication
    axis : int or iterable, optional
0 ignored issues
show
Duplication introduced by
This code seems to be duplicated in your project.
Loading history...
726
        An integer single axis, or an iterable collection of integer
727
        axes over which to sum in order to calculate the L2 norm. If
728
        `None` (the default), the norm is computed over all elements
729
        of the tensor.
730
731
    Notes
732
    -----
733
    Because of the way the :class:`StepRule` API works, this particular
734
    rule implements norm clipping of the value *after* update in the
735
    following way: it computes ``parameter - previous_step``, scales it
736
    to have (possibly axes-wise) norm(s) of at most `threshold`,
737
    then subtracts *that* value from `parameter` to yield an 'equivalent
738
    step' that respects the desired norm constraints. This procedure
739
    implicitly assumes one is doing simple (stochastic) gradient descent,
740
    and so steps computed by this step rule may not make sense for use
741
    in other contexts.
742
743
    Investigations into max-norm regularization date from [Srebro2005]_.
744
    The first appearance of this technique as a regularization method
745
    for the weight vectors of individual hidden units in feed-forward
746
    neural networks may be [Hinton2012]_.
747
748
    .. [Srebro2005] Nathan Srebro and Adi Shraibman.
749
       "Rank, Trace-Norm and Max-Norm". *18th Annual Conference
750
       on Learning Theory (COLT)*, June 2005.
751
752
    .. [Hinton2012] Geoffrey E. Hinton, Nitish Srivastava,
753
       Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov.
754
       "Improving neural networks by preventing co-adaptation of
755
       feature detectors". arXiv:1207.0580.
756
757
    """
758
    def __init__(self, threshold, axis=None):
759
        axis = pack(axis) if axis is not None else ()
760
        self.axis = set(axis)
761
        self.threshold = shared_floatx(threshold, "threshold")
762
        add_role(self.threshold, ALGORITHM_HYPERPARAMETER)
763
        if len(axis) != len(self.axis):
764
            raise ValueError("axis must be unique")
765
766
    def compute_step(self, parameter, previous_step):
767
        if any(ax >= previous_step.ndim for ax in self.axis):
768
            raise ValueError("Invalid axis {} for {}, ndim={}".format(
769
                self.axis, parameter, previous_step.ndim))
770
        if len(self.axis) == 0:
771
            norms = l2_norm([parameter - previous_step])
772
        else:
773
            squares = tensor.sqr(parameter - previous_step)
774
            norms = tensor.sqrt(
775
                reduce(lambda t, a: t.sum(axis=a, keepdims=True),
776
                       sorted(self.axis), squares))
777
        # We want a step s* that is the same as scaling
778
        # (parameter - previous_step) by threshold / norm
779
        # when threshold < norm.
780
        shrinking_step = (parameter -
781
                          (self.threshold / norms) *
782
                          (parameter - previous_step))
783
        return tensor.switch(norms > self.threshold,
784
                             shrinking_step,
785
                             previous_step), ()
786
787
788
class AdaGrad(StepRule):
789
    """Implements the AdaGrad learning rule.
790
791
    Parameters
792
    ----------
793
    learning_rate : float, optional
794
        Step size.
795
        Default value is set to 0.0002.
796
    epsilon : float, optional
797
        Stabilizing constant for one over root of sum of squares.
798
        Defaults to 1e-6.
799
800
    Notes
801
    -----
802
    For more information, see [ADAGRAD]_.
803
804
    .. [ADADGRAD] Duchi J, Hazan E, Singer Y.,
805
       *Adaptive subgradient methods for online learning and
806
        stochastic optimization*,
807
       http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
808
809
    """
810
    def __init__(self, learning_rate=0.002, epsilon=1e-6):
811
        self.learning_rate = shared_floatx(learning_rate, "learning_rate")
812
        self.epsilon = shared_floatx(epsilon, "epsilon")
813
        add_role(self.learning_rate, ALGORITHM_HYPERPARAMETER)
814
        add_role(self.epsilon, ALGORITHM_HYPERPARAMETER)
815
816
    def compute_step(self, parameter, previous_step):
817
        name = 'adagrad_sqs'
818
        if parameter.name:
819
            name += '_' + parameter.name
820
        ssq = _create_algorithm_buffer_for(parameter, name=name)
821
822
        ssq_t = (tensor.sqr(previous_step) + ssq)
823
        step = (self.learning_rate * previous_step /
824
                (tensor.sqrt(ssq_t) + self.epsilon))
825
826
        updates = [(ssq, ssq_t)]
827
828
        return step, updates
829
830
831
class Adam(StepRule):
832
    """Adam optimizer as described in [King2014]_.
833
834
    .. [King2014] Diederik Kingma, Jimmy Ba,
835
       *Adam: A Method for Stochastic Optimization*,
836
       http://arxiv.org/abs/1412.6980
837
838
    Parameters
839
    ----------
840
    learning_rate : float, optional
841
        Step size.
842
        Default value is set to 0.002.
843
    beta1 : float, optional
844
        Exponential decay rate for the first moment estimates.
845
        Default value is set to 0.1.
846
    beta2 : float, optional
847
        Exponential decay rate for the second moment estimates.
848
        Default value is set to 0.001.
849
    epsilon : float, optional
850
        Default value is set to 1e-8.
851
    decay_factor : float, optional
852
        Default value is set to 1 - 1e-8.
853
854
    """
855
    def __init__(self, learning_rate=0.002,
856
                 beta1=0.1, beta2=0.001, epsilon=1e-8,
857
                 decay_factor=(1 - 1e-8)):
858
        self.learning_rate = shared_floatx(learning_rate, "learning_rate")
859
        self.beta1 = shared_floatx(beta1, "beta1")
860
        self.beta2 = shared_floatx(beta2, "beta2")
861
        self.epsilon = shared_floatx(epsilon, "epsilon")
862
        self.decay_factor = shared_floatx(decay_factor, "decay_factor")
863
        for param in [self.learning_rate, self.beta1, self.beta2, self.epsilon,
864
                      self.decay_factor]:
865
            add_role(param, ALGORITHM_HYPERPARAMETER)
866
867
    def compute_step(self, parameter, previous_step):
868
        mean = _create_algorithm_buffer_for(parameter, 'mean')
869
        variance = _create_algorithm_buffer_for(parameter, 'variance')
870
        time = shared_floatx(0., 'time')
871
        add_role(time, ALGORITHM_BUFFER)
872
873
        t1 = time + 1
874
        learning_rate = (self.learning_rate *
875
                         tensor.sqrt((1. - (1. - self.beta2)**t1)) /
876
                         (1. - (1. - self.beta1)**t1))
877
        beta_1t = 1 - (1 - self.beta1) * self.decay_factor ** (t1 - 1)
878
        mean_t = beta_1t * previous_step + (1. - beta_1t) * mean
879
        variance_t = (self.beta2 * tensor.sqr(previous_step) +
880
                      (1. - self.beta2) * variance)
881
        step = (learning_rate * mean_t /
882
                (tensor.sqrt(variance_t) + self.epsilon))
883
884
        updates = [(mean, mean_t),
885
                   (variance, variance_t),
886
                   (time, t1)]
887
888
        return step, updates
889
890
891
class RemoveNotFinite(StepRule):
892
    """A step rule that skips steps with non-finite elements.
893
894
    Replaces a step (the parameter update of a single shared variable)
895
    which contains non-finite elements (such as ``inf`` or ``NaN``) with a
896
    step rescaling the parameters.
897
898
    Parameters
899
    ----------
900
    scaler : float, optional
901
        The scaling applied to the parameter in case the step contains
902
        non-finite elements. Defaults to 1, which means that parameters
903
        will not be changed.
904
905
    Notes
906
    -----
907
    This rule should be applied last!
908
909
    This trick was originally used in the GroundHog_ framework.
910
911
    .. _GroundHog: https://github.com/lisa-groundhog/GroundHog
912
913
    """
914
    def __init__(self, scaler=1):
915
        self.scaler = scaler
916
917
    def compute_step(self, parameter, previous_step):
918
        step_sum = tensor.sum(previous_step)
919
        not_finite = (tensor.isnan(step_sum) +
920
                      tensor.isinf(step_sum))
921
        step = tensor.switch(
922
            not_finite > 0, (1 - self.scaler) * parameter, previous_step)
923
        return step, []
924
925
926
class Restrict(StepRule):
927
    """Applies a given :class:`StepRule` only to certain variables.
928
929
    Example applications include clipping steps on only certain parameters,
930
    or scaling a certain kind of parameter's updates (e.g. adding an
931
    additional scalar multiplier to the steps taken on convolutional
932
    filters).
933
934
    Parameters
935
    ----------
936
    step_rule : :class:`StepRule`
937
        The :class:`StepRule` to be applied on the given variables.
938
    variables : iterable
939
        A collection of Theano variables on which to apply `step_rule`.
940
        Variables not appearing in this collection will not have
941
        `step_rule` applied to them.
942
943
    """
944
    def __init__(self, step_rule, variables):
945
        self.step_rule = step_rule
946
        self.variables = frozenset(variables)
947
948
    def compute_steps(self, previous_steps):
949
        filtered_previous_steps = dict_subset(previous_steps, self.variables)
950
        steps, updates = self.step_rule.compute_steps(filtered_previous_steps)
951
        actual = OrderedDict((parameter, steps[parameter])
952
                             if parameter in steps
953
                             else (parameter, previous_steps[parameter])
954
                             for parameter in previous_steps)
955
        return actual, updates
956