Neural architecture search (NAS) is a popular area of machine learning, with the goal of automating the development of the best neural network for a given dataset. Since 2017, hundreds of NAS algorithms have been proposed, and with the recent release of two NAS benchmark datasets [1, 2], the computational cost of running NAS experiments has dropped from GPU-days to CPU-minutes. In this post, we summarize our recent paper, which suggests that existing NAS benchmarks may be too small to effectively evaluate NAS algorithms.

ArXiv paper: https://arxiv.org/abs/2005.02960

Source code: https://github.com/realityengines/local_search

## Local search for NAS

Local search is one of the simplest families of algorithms in combinatorial optimization, yet it yields strong approximation guarantees for canonical NP-Complete problems such as the traveling salesman problem and vertex cover. While it is a ubiquitous algorithm in theoretical computer science, local search is often neglected in hyperparameter optimization and neural architecture search. This might be because of the recent trend in NAS literature to use increasingly more complex techniques, such as using reinforcement learning, or a neural network as a subroutine during the search. In contrast to prior work, we study a NAS algorithm which can be implemented in five lines of code.

In the simplest version of local search (often called the hill-climbing algorithm), we start by training a random neural architecture, and then we iteratively train all architectures in its neighborhood, choosing the best one for the next iteration. The neighborhood is typically defined as all architectures which differ by one operation or edge. Local search finishes when it reaches a (local or global) optimum, or when it exhausts its runtime budget.

We show that despite its simplicity, this algorithm achieves state-of-the-art results on all four of the datasets in the popular benchmarks, NAS-Bench-101 and NAS-Bench-201, all of which contain at most 10^{5} pretrained architectures. In contrast, the same local search algorithm performs worse than *random search* on the popular DARTS search space, consisting of 10^{18} architectures. This suggests that existing NAS benchmark datasets might be too small to effectively compare NAS algorithms.

In the above plots, we see that local search outperforms many popular algorithms on the NAS-Bench-101 and NAS-Bench-201 datasets.

## A theory of local search

In the last section, we showed experimentally that local search achieves state-of-the-art performance on some search spaces and subpar performance on other search spaces. Motivated by those findings, we conducted a theoretical analysis of local search, giving a full characterization of its performance.

First we define the local search function LS, where LS(*v*) denotes the result of running an iteration of local search (Algorithm 1 above) from an architecture *v*. In other words, LS(*v*) is the neighbor *u* of *v* with the minimum *l(u)*.

In the left figure, each point is an architecture from NAS-Bench-201 trained on CIFAR10, and each edge denotes the LS function. We plotted the trees of the nine architectures with the highest accuracies. The right figure is similar, but the architectures are assigned validation losses at random. We see that we are much more likely to converge to an architecture with low loss on structured data (CIFAR10) rather than unstructured (random) data.

In the theory below, we also define LS^{*}(*v*) as the final output of local search when starting at *v*, and we define LS^{–}*(*v*) as the set of architectures whose output of local search is *v*. For more details, see our arXiv paper.

Neural architecture search is an optimization problem that is a hybrid between discrete and continuous optimization. On the one hand, there is a finite, fixed search space of architectures. For example, NAS-Bench-201 consists of 5^{6}=15625 architectures. On the other hand, each architecture has a real-valued validation accuracy, which is a random variable depending on the random initialization of the neural network. That is, the accuracy will be slightly different each time you retrain any given architecture. To model this hybrid optimization problem, we assume that there is a fixed search space, and the validation losses are defined by a global probability density function pdf_{n}, and a local probability density function pdf_{e} (note that just one of these distributions would not fully specify the problem). Given an architecture *v*, its loss *l(v)* is drawn from pdf_{n}, and the loss of a random neighbor of v is drawn from pdf_{e}(*l(v)*).

In this model, we are able to derive a function for the total number of local minima, as well as the probability that running local search from a random architecture will reach within epsilon of the global minimum. Let *n* denote the total number of architectures, and assume that each architecture has exactly *s* neighbors.

To use this theorem, we need an expression for **E**[|LS^{–}*(*v*)|], which is what we show in the next lemma.

See our paper for all details and proofs. This theorem helps us understand the settings in which local search performs well. In particular, if pdf_{e}(*u,v)* exhibits locality, that is, there is a high probability that neighboring architectures have similar losses, then Theorem 5.1 tells us that there is a much higher chance that local search will output a near-optimal architecture. We also see that if the neighborhood size, *s*, is lower, then local search is more efficient.

Finally, if we have estimates of pdf_{e} and pdf_{n} for our dataset, we can plug in Theorem 5.1 to predict the performance of local search. This is what we did on four datasets based on NAS-Bench-201: CIFAR-10, CIFAR-100, ImageNet16-120, and assigning losses uniformly at random. We see that the theory matches the performance on real data fairly well.

In summary, we studied a very simple algorithm, local search, for NAS, showing that it achieves state-of-the-art results on the most popular existing NAS benchmarks (NAS-Bench-101 and NAS-Bench-201), and subpar performance on the much larger DARTS search space. This suggests that existing NAS benchmarks may be too small and/or simple to effectively compare NAS methods. Since local search is a simple technique that sometimes gives strong performance, we encourage local search to be used as a benchmark for NAS in the future. Finally, we define a theoretical framework to study local search for NAS, and we derive a characterization of the performance of local search.

## References

[1] Chris Ying, Aaron Klein, Esteban Real, Eric Christiansen, Kevin Murphy, and Frank Hutter. NAS-Bench-101: Towards reproducible neural architecture search. In Proceedings of the International Conference on Machine Learning (ICML), 2019.

[2] Xuanyi Dong and Yi Yang. NAS-bench-201: Extending the scope of reproducible neural architecture search. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.

- On the Generalizability and Predictability of Recommender Systems - November 6, 2022
- Abacus.AI at NeurIPS 2022 - September 23, 2022
- Beginners Guide To Transformer Models - August 18, 2020