Neural Architecture Search A Survey

[ deep-learning nas network-architecture-search ]

Deep Learning has enabled remarkable progress over the last years on a variety of tasks, such as image recognition, speech recognition, and machine translation. One crucial aspect for this progress are novel neural architectures. Currently employed architectures have mostly been developed manually by human experts, which is a time-consuming and error-prone process. Because of this, there is growing interest in automated neural architecture search methods. We provide an overview of existing work in this field of research and categorize them according to three dimensions: search space, search strategy, and performance estimation strategy.

This post is a summary of paper Neural Architecture Search: A Survey.

The network architecture search (NAS) method typically contains three components.

Search Space

The search space defines which architectures can be represented in principle. Incorporating prior knowledge can reduce the size of the search space and simplify the search. However, this also introduces a human bias.

Whole Network Search

A relatively simple search space is the space of chain-structured neural networks, which is illustrated below:

Later works allows mutiple branches in the network:

Cell based Search

Motivated by hand-crafted architectures consisting of repeated motifs, more recent works propose to search for such motifs, dubbed cells or blocks, respectively, rather than for whole network.

For example, Zoph et al. (2018) optimize two different kind of cells: a normal cell that preserves the dimensionality of the input and a reduction cell which reduces the spatial dimension. The final architecture is then built by stacking these cells in a predefined manner. This search space has three major advantages compared to the ones discussed above:

The size of the search space is drastically reduced since cells usually consist of significantly less layers than whole architectures.
Architectures built from cells can more easily be transferred or adapted to other data sets by simply varying the number of cells and filters used within a model.
Creating architectures by repeating building blocks has proven a useful design principle in general, such as repeating an LSTM block in RNNs or stacking a residual block.

Zoph et al. (2018) build a sequential model from cells, in which each cell receives the outputs of the two preceding cells as input, while in the work of Cai et al. (2018b), cells can be combined arbitrarily, like DenseNet.

Other works also studies how to learn to organize the learned cells into the final neural network, namely marco structure. One step in the direction of optimizing macro-architectures is the hierarchical search space introduced by Liu et al. (2018b), which consists of several levels of motifs.

The first level consists of the set of primitive operations,
the second level of different motifs that connect primitive operations via a directed acyclic graph,
the third level of motifs that encode how to connect second-level motifs, and so on. The cell-based search space can be seen as a special case of this hierarchical search space where the number of levels is three, the second level motifs correspond to the cells, and the third level is the hard-coded macro-architecture.

Search Strategy

The search strategy details how to explore the search space. It encompasses the clas- sical exploration-exploitation trade-off since, on the one hand, it is desirable to find well-performing architectures quickly, while on the other hand, premature convergence to a region of suboptimal architectures should be avoided.

Many different search strategies can be used to explore the space of neural architectures, including random search, Bayesian optimization, evolutionary methods, reinforcement learning (RL), and gradient-based methods.

reinforcement learning

NAS became a mainstream research topic in the machine learning community after Zoph and Le (2017) obtained competitive performance on the CIFAR-10 and Penn Treebank benchmarks with a search strategy based on reinforcement learning.

Under reinforcement learning framework, the generation of a neural architecture can be considered to be the agent’s action, with the action space identical to the search space. The agent’s reward is based on an estimate of the performance of the trained architecture on unseen data.

Different RL approaches differ in how they represent the agent’s policy and how they optimize it:

Zoph and Le (2017) use a recurrent neural network (RNN) policy to sequentially sample a string that in turn encodes the neural architecture. They initially trained this network with the REINFORCE policy gradient algorithm (Williams, 1992), but in their follow-up work (Zoph et al., 2018) use Proximal Policy Optimization (Schulman et al., 2017) instead.
Baker et al. (2017a) use Q-learning to train a policy which sequentially chooses a layer’s type and corresponding hyperparameters.
Cai et al. (2018a) frames NAS as a sequential decision process where the action corresponds to an application of function-preserving mutations, dubbed network morphisms (Chen et al., 2016; Wei et al., 2017). In order to deal with variable-length network architectures, they use a bi-directional LSTM to encode architectures into a fixed-length representation. Based on this encoded representation, actor networks decide on the sampled action.

neuro-evolutionary

Evolutionary algorithms evolve a population of models, i.e., a set of (possibly trained) networks; in every evolution step, at least one model from the population is sampled and serves as a parent to generate offsprings by applying mutations to it. In the context of NAS, mutations are local operations, such as adding or removing a layer, altering the hyperparameters of a layer, adding skip connections, as well as altering training hyperparameters. After training the offsprings, their fitness (e.g., performance on a validation set) is evaluated and they are added to the population.

Miller et al. (1989) use genetic algorithms to propose architectures and use backpropagation to optimize their weights.
Angeline et al., 1994; Stanley and Miikkulainen, 2002; Stanley et al., 2009 use genetic algorithms to optimize both the neural architecture and its weights.
(Real et al., 2017; Suganuma et al., 2017; Liu et al., 2018b; Real et al., 2019; Miikkulainen et al., 2017; Xie and Yuille, 2017; Elsken et al., 2019 therefore again use gradient-based methods for optimizing weights and solely use evolutionary algorithms for optimizing the neural architecture itself.

Neuro-evolutionary methods differ in how they sample parents, update populations, and generate offsprings.

How to sample parentes

Real et al. (2017), Real et al. (2019), and Liu et al. (2018b) use tournament selection (Goldberg and Deb, 1991) to sample parents
Elsken et al. (2019) sample parents from a multi-objective Pareto front using an inverse density
Real et al. (2017) remove the worst individual from a population
Real et al. (2019) found it beneficial to remove the oldest individual (which decreases greediness)
Liu et al. (2018b) do not remove individuals at all.

How to generate offsprings

most approaches initialize child networks randomly,
Elsken et al. (2019) employ Lamarckian inheritance, i.e, knowledge (in the form of learned weights) is passed on from a parent network to its children by using network morphisms.
Real et al. (2017) also let an offspring inherit all parameters of its parent that are not affected by the applied mutation;

Bayesian Optimization

Bayesian Optimization (BO, see, e.g., (Shahriari et al., 2016)) is one of the most popular methods for hyperparameter optimization, but it has not been applied to NAS by many groups since typical BO toolboxes are based on Gaussian processes and focus on low-dimensional continuous optimization problems.

Swersky et al. (2013) and Kandasamy et al. (2018) derive kernel functions for architecture search spaces in order to use classic GP-based BO methods.
several works use tree-based models (in particular, tree Parzen estimators (Bergstra et al., 2011), or random forests (Hutter et al., 2011)) to effectively search high-dimensional conditional spaces
Negrinho and Gordon (2017) and Wistuba (2017) exploit the tree-structure of their search space and use Monte Carlo Tree Search.
Elsken et al. (2017) propose a simple yet well performing hill climbing algorithm that discovers high-quality architectures by greedily moving in the direction of better performing architectures without requiring more sophisticated exploration mechanisms.

Performance Estimation Strategy

To guide their search process, these strategies need to estimate the performance of a given architecture A they consider. The simplest way of doing this is to train A on training data and evaluate its performance on validation data. However, training each architecture to be evaluated from scratch frequently yields computational demands in the order of thousands of GPU days for NAS.

This naturally leads to developing methods for speeding up performance estimation:

Lower fidelity estimates: Training time reduced by training for fewer epochs, on subset of data, downscaled models, downscaled data, …
Learning Curve Extrapolation: Training time reduced as performance can be extrapolated after just a few epochs of training.
Weight Inheritance/ Network Morphisms: Instead of training models from scratch, they are warm-started by inheriting weights of, e.g., a parent model.
One-Shot Models/ Weight Sharing: Only the one-shot model needs to be trained; its weights are then shared across different architectures that are just subgraphs of the one-shot model.

Written on May 30, 2019