Skip to content

Latest commit

 

History

History
349 lines (293 loc) · 112 KB

【神经网络搜索】NAS总结.md

File metadata and controls

349 lines (293 loc) · 112 KB

[TOC]

  1. 摘要里得出本文干了两件事:提出一个NAS baseline——random search,探索了已发表的NAS方法的可复现性。而提出这个random search是基于两个观察:1)NAS是特殊的超参优化问题; 2)random search在超参优化中是具有竞争力的baseline。这个科研思路非常好,具有借鉴意义——某种形式的等价+技术挪用。提出了两种random search,其效果是random search with early stopping与ENAS相当,不如random search with weight sharing。
  2. 这里random search with early stopping里的early stopping和DARTS++的early stopping什么区别和联系?
  3. 似乎本文的random search并不是其它NAS方法中的那种在搜索空间均匀采样的random search?考察一下各个方法里的random search分别是如何实现的
  4. 指出有两种reproducibility:1)"exact" reproducibility,能否重现所声称的实验结果; 2)"broad" reproducibility,所声称的实验结果是否robust和generalizable。robust可以理解,就是多次实验算方差,但generalizable是指什么
  5. 这篇文章宣称的两大贡献点:1) " We help ground existing NAS results by providing a new perspective on the gap between traditional hyperparameter optimization and leading NAS methods. Specifically, we evaluate a general hyperparameter optimization method combining random search with early-stopping "——将NAS和超参优化联系起来,搞出了random search with early stopping; 2) " We identify a small subset of NAS components that are sufficient for achieving good empirical results."——确认了NAS中有用的部分,搞出了random search with weight sharing。怎么联系的?怎么确认的?确认了哪些
  6. 这篇文章的related work值得反复读
  7. "DARTS is particularly commendable in acknowledging its dependence on random initialization, prompting the use multiple runs to select the best architecture. " ——DARTS的搜索结构受随机性影响还挺大,而这居然还是DARTS的一个优点[狗头].... 但不得不说,这种特性能够使得DARTS通过多次run来扩大local的范围从而缓解local最优的问题。
  8. 方法部分主要描述的是random search with weight sharing,这部分是是这篇文章的核心(下面均将其简称为RandomNAS), random search with early stopping只在第9页寥寥几行提了一下。
  9. RandomNAS的做法与OneShot、SPOS类似,都是decouple supernet training and architecture search,在supernet training阶段与SPOS一样,都是random sampling来削弱weights之间的co-adaptation,而SPOS则是用path dropout:" In order to combine random search with weight-sharing, we simply use randomly sampled architectures to train the shared weights. Shared weights are updated by selecting a single architecture for a given minibatch and updating the shared weights by backpropagating through the network with only the edges and operations as indicated by the architecture activated. Hence, the number of architectures used to update the shared weights is equivalent to the total number of minibatch training iterations. "
  10. "Our work is inspired by the result of Bender et al.[1] "——自称受《Understanding One-Shot NAS》的random search的启发,不像它那样要加path dropout等技巧来训练,简化了很多。
  11. 在把supernet训练完以后就和《Understanding One-shot NAS》一样,随机采样一堆然后利用这些训练好的shared weights评估,挑出最好的。
  12. 实验部分提到的3个stage,DARTS是这么做的吗?—— 是的。而且还强调了 stage 1 用的是proxy network,stage 2 用的是proxyless network——" Again, we will refer to the network used in the first stage as the proxy network and the network in the second stage the proxyless network. "
  13. 4.2节开始没咋看,下次看
  1. "We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze****its advantages over existing NAS approaches."——文章深入分析了one-shot NAS,这是这篇文章的一大重点,我要好好研究一下。
  2. "Existing one-shot method, however, is hard to train and not yet effective on large scale datasets like ImageNet"——现有的one-shot NAS两个缺点:1)难训练(难在哪?); 2)在ImageNet等大数据集上效果不好(比如说?ProxylessNAS是直接在ImageNet上搞的,总结一下还有哪些也是)。
  3. "Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally."——这篇文章的核心想法就是single-path + uniform path sampling。
  4. "It effortlessly supports complex search spaces (e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency)."——SPOS的两大优点:1)支持复杂的搜索空间; 2)支持不同的计算成本约束。
  5. 以上部分皆出自SPOS的摘要,不得不说SPOS摘要写的很好,所有要素都出现了,而且简洁且清晰,值得学习。
  6. 这篇文章投过ICLR被拒稿了,OpenReview上的对线这里,还是蛮精彩滴,chair和reviewer主要对SPOS和RandomNAS的相似性导致的novelty不足、correlation弱提出了质疑。知乎上有讨论这次对线的,我觉得chair说的都有道理但是太苛刻了...
  7. "Most weight sharing approaches use a continuous relaxation to parameterize the search space [23,4,12,26,31]."——DARTS摘要中提到continuous relaxation的上下文是"continuous relaxation of the architecture",这里的意思就是把原本离散的结构表示(如one-hot向量)变为连续的结构参数。虽然Proxyless实现了可微分的采样且没有使用混合算子的方式进行联合训练,但其和DARTS一样引入了结构参数因此也称得上是continuous relaxation。
  8. "Formally, space A is relaxed to A(θ), where θ denotes the continuous parameters that represent the distribution of the architectures in the space. Note that the new space subsumes the original one, A⊆A(θ). An architecture sampled from A(θ) could be invalid in A."——这里解释了什么叫continous relaxation,和我上一点解释的一样,就是指通过引入了结构参数将原本离散的网络结构空间变为网络结构分布。
  9. 引入了结构参数的好处就是可以使用基于梯度的优化方法来实现搜索,要么如式(3)用single-level optimisation,要么如式(4)用bi-level optimisation。—— "An advantage of the continuous search space is that gradient-based methods [12,4,23,22,26,31] is feasible."
  10. "There are two issues in this formulation. First, the weights in the supernet are deeply coupled. It is unclear why inherited weights for a specific architecture are still effective. Second, joint optimization introduces further coupling between the architecture parameters and supernet weights. The greedy nature of the gradient-based methods inevitably introduces bias during optimization and could easily mislead the architecture search."——这种构建supernet然后对结构参数和权重参数进行联合优化的搜索方式存在两大问题:1)supernet中的权重参数deeply coupled (估计和co-adaption是一个意思),这使得PE时使用sharing weights预测给定arch的acc的rank的准确度存疑; 2)joint optimization会导致权重参数和结构参数的耦合(基于梯度的方法天然具有贪心的特性,因此会偏好前期表现良好的算子,或者说在exploitation和exploration二者之间偏向前者),其影响也会反映在不同局部的权重参数上,即不同局部的权重参数的成熟度不同,而这种耦合带来的马太效应进一步加剧了这种成熟度的不同。
  11. "The one-shot paradigm [3,2] (SMASH、One-Shot) alleviates the second issue. It defines the supernet and performs weight inheritance in a similar way. However, there is no architecture relaxation. The architecture search problem is decoupled from the supernet training and addressed in a separate step. Thus, it is sequential. "——OneShot、RandomNAS、SPOS、FairNAS、OFA都采用了这种decoupling search and training的模式。
  12. "It combines the merits of both nested and joint optimization approaches above."——这里说的nested究竟啥意思?为何说结合了这些方法的优点?
  13. "we hope this work will renew interest in the one-shot paradigm, which combines the merits of both via sequential optimization."——这里作者说希望能够重新唤起大家对decouple search and training模式的认可。他确实做到了,韩松组从ProxylessNAS到OFA确实体现了decouple模式的进一步发展。
  14. 讲了这么多现有的方法的缺点,SPOS方法是如何解决的?——我认为是:decouple search and training + single path search space + uniform sampling。前者是受OneShot启发,single path和sampling可视为延续ProxylessNAS的链式搜索空间和子网络采样,uniform sampling是decouple后的自然结果(没有可学习的分布可用了)。
  15. 第四页的开头两段从supernet和joint optimization两方面,具体分析了引入结构参数的one-shot NAS方法存在的问题。
  16. "This problem is in analogy to the 'dilemma of exploitation and exploration' problem in reinforcement learning. "——这里提到joint optimization带来的马太效应类似于强化学习中“探索-利用”困境。但是dilemma of exploitation and exploration应该是NAS中普遍存在的,这句话的意思似乎是SPOS和以往的one-shot NAS相比极大地削弱了这一困境吗?如果是,如何做到的?——先研究完EA再回答这个问题。——从理论上看的话,这个问题可以上升到对优化方式的对比——RL-、EA-、Gradient-based这三者,哪种方式的探索-利用困境最弱?我个人感觉是EA,因为其它两种都有用到梯度下降,而梯度下降自带贪心特性,偏向利用,而EA是可以通过控制超参来调整探索-利用权衡的。也就是说,对所谓的探索-利用困境应对得好不好,就在于能否给出调节权衡的方法。RL-和EA-偏向利用,EA则是给出了调节的超参,所以EA做得更好。——以上纯属瞎想,我得全面理解这三种算法和exploitation and exploration trade-off后再回答
  17. "Some works augment the loss function L train in Eq. (3) with soft loss terms that consider the architecture latency [4,23,26,22]. However, it is hard, if not impossible, to guarantee a hard constraint like Eq. (5)"——这里说ProxylessNAS、FBNet、SNAS等方法都是通过一种soft loss来实现计算成本约束的,而这样是无法实现hard constraint的,这就让人好奇SPOS在实现计算成本约束方面会有什么高明之处?——与DNAS在连续的search space中使用gradient descent不同,EA的优化过程是离散的,因此可以直接丢弃不满足约束的arch。
  18. SPOS和OneShot都实现了decouple training and search,二者区别在哪?—— 1)对于supernet training阶段,同是为了削弱co-adaptation,OneShot采用path dropout策略,SPOS则是uniform sampling + single path,SPOS把这称作stochastic supernet。我认为但凡使用了supernet + samping subnet都可以称之为stochastic supernet
  19. 我认为stochastic supernet也可以形容所有使用了differential sampling的one-shot NAS,包括SNAS、ProxylessNAS、FBNet等。SPOS和它们的区别在于,SPOS用uniform distribution来sampling,而这些方法是要learn这个distribution。但是这篇文章里也说了,使用固定的分布比学习来的分布更好这一点目前并没有理论保证,前者表现比后者更好或许只是因为后者更难以优化——"Comprehensive experiments in Sec. 4 show that our approach achieves better results than the SOTA methods. Note that there is no such theoretical guarantee that using a fixed prior distribution is inherently better than optimizing the distribution during training. Our better result likely indicates that the joint optimization in Eq.(3) is too difficult for the existing optimization techniques."
  20. SPOS和RandomNAS如何对比?似乎第一阶段是一毛一样的?——"This paper [10] achieved competitive results to several SOTA NAS approaches on CIFAR-10, but didn't verify the method on large dataset ImageNet. It didn't prove the effectiveness of single path sampling compared to the 'path dropout' strategy and analyze the correlation of the supernet performance and the final evaluation performance."——我的理解是:1)搜索空间不同,一个是DAG,一个是single-path; 2)第二阶段不同,SPOS第二阶段用的是EA,RandomNAS似乎是采样一堆挑最好的; 3)SPOS可以直接在大规模数据集ImageNet上搞事情。
  21. "To reduce the co-adaptation between node weights, we propose a supernet structure that each architecture is a single path, as shown in Fig. 3 (a). "——这里说用single path缓解co-adpatation我能理解,supernet里没有DAG那种多edge相加,也就避免了不同edge的weights之间的co-adpatation。另外,uniform sampling也能进一步削弱co-adaptation,因为不用DARTS那种混合算子而是sampling path,这样能够避免了同一edge上不同operator的weights之间的co-adapatation。但是SPOS中所采用的single path搜索空间和ProxylessNAS、FBNet的layer-wise search space其实是一个思路。我觉得SPOS的可贵之处是在于十分明确地指出这种single-path search space的可以削弱co-adaptation between weights
  22. Fig.1比较了OneShot中所采用的path dropout策略和SPOS的single path策略,结论是后者强太多。此外,二者的联系是——"Our single path strategy corresponds to using drop rate1."
  23. "The prior distribution Γ(A) is important. In this work, we empirically find that uniform sampling is good."——但最后还是用了比这更复杂一点的uniform constraint sampling——"We also experimented with a variant that samples the architectures uniformly according to their constraints, named uniform constraint sampling. Specifically, we randomly choose a range, and then sample the architecture repeatedly until the FLOPs of sampled architecture falls in the range ... In this work, we find the uniform constraint sampling method is slightly better. So we use it by default in this paper."
  24. search space似乎还是没写清楚?究竟何为“rich search space”?——Section 4 写清楚了,从单独搜choice block和单独搜channel number到联合搜索空间,总共有四种搜索空间。
  25. 还是不清楚怎么EA的。1)有无剔除种群中弱/旧的个体?——并没有,那是《Regularized Evolution》那篇的做法,这篇是每次迭代直接使用Topk里的网络结构的变异和杂交作为新的种群。2)crossover具体来说是个什么样的操作?——在Topk里随机挑选两个,然后像交换染色体那样交换各层block的index。
  26. "The evolutionary algorithm is flexible in dealing with different constraints inEq. (5), because the mutation and crossover processes can be directly controlled to generate proper candidates to satisfy the constraints."——为何说变异杂交可以直接控制产生满足计算成本约束的architecuture candidates?——不符合的直接pass掉。究其根本原因在于EA的优化方式就是离散地进行,这与DNAS使用gradient descent在连续化的搜索空间有着根本的差异。
  27. "Previous RL-based [21] and gradient-based [4,23,22] methods design tricky rewards or loss functions to deal with such constraints. For example, [23] uses a loss function CE(a,wa)·αlog(LAT(a))^β to balance the accuracy and the latency. It is hard to tune the hyperparameter β to satisfy a hard constraint like Eq. (5)."——如果不用EA只能搞Soft constraint,调参费劲儿。
  28. "Before the inference of an architecture, the statistics of all the Batch Normalization (BN)[9] operations are recalculated on a random subset of training data (20000 images on ImageNet). It takes a few seconds. This is because the BN statistics from the supernet are usually not applicable to the candidate nets."——是否所有one-shot NAS都需要这样变换BN?还是只有decoupling search and training的one-shot NAS需要?
  29. 好好研究Table 1:1)看看这里总结的是否有自己的知识盲区; 2)看看这里总结的是否有与自己认知不符的地方?
  30. "Table 1 performs a comprehensive comparison of our approach against previous weight sharing approaches on various aspects. Ours is the easiest to train, occupies the smallest memory, best satisfies the architecture (latency) constraint, and easily supports large datasets. Extensive results in Sec.4 verify that our approach is the state-of-the-art."——研究Table 1,从实验结果算法原理上分析这些SOTA结果。
  31. Fig.2似乎有两个问题:1)为何random search也会随着iter的进行而上升?这里random search如何操作的?2)random search和EA的表现似乎差异不大?如何评价这点?
  32. Table 3的倒数三行的结果也对SPOS搜索的意义构成了质疑。
  33. Table 5牛逼啊!把ProxylessNAS和FBNet都重新run了一下,并且在和这二者相同的搜索空间上分别执行SPOS,这可以说是十分公平的对比了!acc对比结果显示,SPOS稍好
  34. "In contrast, previous methods[23,4] have to train multiple supernets under various constraints. According toTable 7, searching is much cheaper than supernet training." ——一次训练,多种约束下的搜索,我甚至觉得SPOS是OFA的灵感来源!OFA比SPOS设计了更大的搜索空间以适应更多样化的约束。
  35. Table 7对比了SPOS与ProxylessNAS、FBNet的search cost,结果显示SPOS在memory cost、training time、search time方面优势很大,假如考虑不同约束下的多次搜索的话,SPOS的优势会大得更多——"Note Table 7 only compares a single run. In practice, our approach is more advantageous and more convenient to use when multiple searches are needed."
  36. Section 4的correlation analysis那一部分读不大懂,结合着NAS evaluation is frustratingly hardEvaluating the search phase of neural architecture search 一起读试试。
  37. 搜索空间是基于ShuffleNetV2的。
  38. 关于混合精度搜索这一段,需要先读这两篇文章:PACT: Parameterized clipping activation for quantized neural networksMixed precision quantization of convnets via differentiable neural architecture search
  1. "Despite being widely utilized due to searching efficiency, weight sharing approaches are roughly built on empirical experiments instead of solid theoretical ground. Several fundamental issues remain to be addressed. Namely, a) Why is there a large gap between the range of supernet predicted accuracies and that of 'ground-truth' ones by stand-alone training from scratch [2,1]? b) How to build a good evaluator that neither overestimates nor underestimates subnetworks? c) Why does the weight-sharing mechanism work, if under some conditions"——这三个问题可以帮我导读这篇文章。第二页所列的三个contribution分别回答了这三个问题
  2. "Generally speaking, the weight-sharing approaches all involve training a supernet that incorporates many candidate subnetworks. They can be roughly classified into two categories: those who couple searching and training within one stage [32,26,4,39,46] and others who decouple them into two stages, where the trained supernet is treated as an evaluator for final searching [2,1,12,29]."——把现有的one-shot NAS分为两类:耦合与解耦。值得注意的是,SMASH也被算作解耦类型,Single-Path NAS被算作耦合类型。考虑到这两个方法各自出现的时间,耦合 ---》解耦的的发展规律也变得没那么严格成立,我想这主要是因为SPOS明确指出了解耦的好处并应用到大规模数据集上以后才使得解耦的路线得到重视。
  3. "In this paper, we attempt to answer the above three questions for two-stage weight-sharing approaches."——这篇文章搞的是 decoupled one-shot NAS 。
  4. "Our analysis and experiments are conducted in a widely used search space as in [4,46,12,39]."——这篇文章使用的是layer-wise的搜索空间,自称和ProxylessNAS、FBNet、SPOS、Single-Path NAS用的是同一种搜索空间,我得去核对一下这些搜索空间是否相同
  5. "our fair single-path sampling is memory-friendly, and its GPU costs can also be linearly amortized to the number of different hardware settings (see Fig. 1)."——这就是decoupled (two-stage) one-shot NAS的好处,第一阶段supernet training的cost被均摊到了不同的taget hardware上了。
  6. "We then incorporate our supernet with an EA-based multi-objective searching framework."——又是个使用EA来做第二阶段的,和SPOS一样。
  7. "Those who have better initial performance are more likely to be sampled or to maintain higher coefficients, resulting in a suboptimal or an even worse solution. For instance, architectures from DARTS usually contain an excessive numberof skip connections [48,23], which damage the outcome performance. Therefore, the prior-learning DARTS is biased as per skip connections, while a random approach doesn't suffer [22]. DARTS overrated 'bad' models (jammed by skip connections), meantime many other good candidates are depreciated."——如此评价DARTS。
  8. 1)对SMASH的评价——"For a set of randomly sampled models, a correlation between predicted validation errors and ground-truth exists, but it has a large discrepancy between the ranges (40%-90% vs. 25%-30% on CIFAR-100 [21])."; 2) 对OneShot的评价——"there is also an evident performance gap of submodels with inherited weights compared with their ground-truth (30%-90% vs. 92%-94.5% on CIFAR-10 [21])."。但是我不觉得range差别大是个大问题,只要rank的correlation够好就行了啊。但是这个差别很大意味着对acc的predict能力其实很弱。
  9. SMASH和OneShot对acc绝对值预测能力弱,但是却能做到对acc的rank预测能力较好,这是一个值得深思的问题,说不定这里边就蕴含着提高rank correlation的奥秘。OneShot中关于KL散度的那个实验似乎对解释这个有所启发。因为OneShot能够自动优化有用的部分,因此采样出的subnets其实已经失去了很大一部分有用的(都是前期厉害的?),然后就会使得差别特别大。但是为何能够一定程度地保持rank?
  10. "SPOS [12] uniformly samples a single-path from the supernet during the training so that all architectures can be optimized simultaneously. However, we find it offers limited fairness and its supernet performance is somewhat restricted."——如此评价SPOS,之后再回头看看是啥意思。
  11. "However, Expectation Fairness is not enough. For example, we can randomly sample each model and keep it training for k times, then switch to another. This procedure also meets Definition 2, but it's very unstable to train."——这里举例说明为何expectation fairness为何不合理,但是我看不懂。
  12. expectation fairness有两大问题:1)随着采样次数n增大,方差n(m-1)/m^2增大,不同算子的采样次数差别越来越大。但即便差别越来越大,但是差别的比例相对于采样的总次数来说还是小数而且n越大比例越小(如采样10000次占1/40,采样100000次占1/80)。所以我还是不能理解为何这小比例的次数会加强rank correlation的降低。 2)ordering issue,先采样到的choices会影响到后采样的choices ——"Even in SPOS [12] with uniform sampling, there is a latent ordering issue. For a sequence of choices (M1,M2,M3), it implies an inherent training order M1→M2→M3. Since each model is usually trained by back-propagation, the trained weights of M1 are immediately updated to the supernet and those of M2 are renewed next while carrying the effect of the former update, so for M3. A simple permutation of (M1,M2,M3) does comply with Expectation Fairness but yields different results. Besides, if the learning rate lr is changed within the sequence, the situation thus becomes even more complicated"。说实话,我不理解第二点的ordering issue为何会加强rank correlation的降低
  13. 为了解决expectation fairness遗留的两个问题,FairNAS使用两个关键技术:1)uniform sampling without replacement; 2)一个step里不放回采样得到的数个subnet反向传播的梯度累积起来再更新。第一点保证strict fairness,第二点避免了ordering issue。
  14. "Here we adopt a searching algorithm from an NSGA-II [9] based method called MoreMNAS [7] with a small variation by using Proximal Policy Optimization[37] as the default reinforcing algorithm"——FairNAS的search stage用了MoreNAS的搜索算法—— Multi-Objective Reinforced Evolution in Mobile Neural Architecture Search ,有空读一下。
  15. "Apart from evaluation, a recent work indicates that one-shot supernet weights can also specialize for submodels [3]. We emphasize this is an important advantage that other methods couldn't provide."——似乎是说OFA中one-shot supernet的权重直接被继承的话效果也很好这件事——"Without retraining, OFA achieves 76.0% top1 accuracy on ImageNet, which is 0.8% higher than MobileNetV3-Large while maintaining similar latency."——**但是不知道FairNAS有没有做类似的实验——**从Fig.1.Left的右上图可以看出,one-shot acc和stand-alone acc之间的差距是几个百分点,而OFAzhong,直接继承是零点几个百分点,因此在这点上FairNAS和OFA完全没法比。为啥会这样呢
  16. 第八页第一段的搜索空间有点不清,搞清楚一下各个方法究竟是搜几层。FBNet写得很明确,是22层,但是ProxylessNAS有点含糊。FairNAS这里声称搜的层数和MBConv的层数对不上。
  17. "Figure 4 exhibits the resulting FairNAS-A, B and C models, which are sampled from our Pareto front (Section 3.2) with equal distance to meet different requirements. The result is shown in Table 2."——学习FairNAS的搜索算法,理解这里说的sampleequal distance
  18. Table 5 显示FairNAS的Kendall Tau 高达0.9487,很厉害,但是我有个困惑,为啥rank correlation强SPOS这么多但是FairNAS-C只比SPOS好0.3%这么一丢丢。可能是这样的,排列正确的比例P=(t+1)/2,看起来FairNAS和SPOS的t差很多,是0.9487和0.6153,但是算P的话对比是0.97和0.80,这下差别其实不大了。而在Sec 5.3中显示的实验则是好上0.7%,这样的话更说的通了。
  19. Table 5显示无论用不用BatchNorm recalibration得到的Kendall Tau都是一样的,这是OneShot和SPOS做不到的,因此可以省下recalibration的时间。我觉得这一点倒是文章中提出的fairness mechanism带来的实打实的好处。
  20. Section 5.2,搞定FairNAS的搜索算法后再回来看。
  21. Fig.6 显示,同一layer内不同choice block产生的feature map高度相似,这个特点可以帮助stabilize the whole training process 。
  22. 尝试总结一下FairNAS:1)提出了一个fairness视角来分析当前的one-shot NAS,要么是压根没有fairness(one-stage),要么是只有expectation fairness(two-stage); 2)提出了strick fairness的概念,并使用不放回的均匀采样来实现,同时使用累积同一step内采样的subnet的梯度来更新supernet这一策略解决ordering issue 。3)FairNAS虽然把rank correlation提到了一个很高的程度,但是和SPOS相比搜索到的网络acc提高似乎不大,仅仅零点几个百分点,但是省下了BatchNorm recalibration这个好处是实打实的。
  1. 从摘要中导读这篇文章:1)"In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and searching, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training."。2)"To efficiently train OFA networks, we also propose a novel progressive shirinking algorithm, a generalized pruning method that ..."。
  2. "Therefore, with 5 units, we have roughly ((3×3)^2+ (3×3)^3+ (3×3)^4)^5≈2×10^19 different neural network architectures and each of them can be used under 25 different input resolutions."——**1)**不考虑resolution仅考虑w、d、k的情况下搜索空间的大小是2×10^19(Section 3.2),而ProxylessNAS的是6^21=10^16。2)这种在一个unit内堆叠相同layer的搜索空间和MnasNet在一个block内堆叠layer的思路是一样的,但是ProxylessNAS则是每个layer都不同。
  3. "Following the common practice of many CNN models (He et al., 2016; Sandler et al., 2018; Huang et al., 2017), we divide a CNN model into a sequence of units with gradually reduced feature map size and increased channel numbers. Each unit consists of a sequence of layers where only the first layer has stride 2 if the feature map size decreases (Sandler et al., 2018). All the other layers in the units have stride 1."——这个common practice也正是MnasNet、ProxylessNAS起众多NAS方法所采取的MobileNetV2-based search space。具体总结是咋样的(几个stage等等,OFA的分辨率怎么降)。
  4. "It is computationally prohibitive to enumerate all sub-networks to get the exact gradient in each update step, while randomly sampling a few sub-networks in each step will lead to significant accuracy drops. The challenge is that different sub-networks are interfering with each other, making the training process of the whole once-for-all network inefficient. To address this challenge, we propose a progressive shrinking algorithm for training the once-for-all network. "——这里的challenge就是提出PS的motivation,但是我无法理解为何randomly sampling会导致significant accuracy drops?SPOS不都是这么做的吗?——像SPOS这样直接sample会带来weights co-adptation,SPOS以acc的rank correlation较低而闻名。事实上Table 1对比了使用和不使用PS,以及使用PS后retrain和不retrain,Figure 7也做了相关比较,结果显示PS取得了这里承诺的效果。
  5. "Another naive training approach is to sample a few sub-networks in each update step rather than enumerate all of them, which does not have the issue of prohibitive cost. However, with such a large number of sub-networks that share weights, thus interfere with each other, we find it suffers from significant accuracy drop."——这里的意思应该是SPOS这种sampling and training的方式会带来significant accuracy drop,我觉得应该是weights co-adapatation造成的。但问题是,为何PS可以缓解这个问题?后面几点回答了原因。
  6. "Instead of directly optimizing the once-for-all network from scratch, we propose to first train the largest neural network with maximum depth, width, and kernel size, then progressively fine-tune the once-for-all network to support smaller sub-networks that share weights with the larger ones."——后期并不是只训练smaller sub-networks,而是逐渐解封不同维度来纳入小网络,同时训练采样到的大网络和小网络。
  7. 附录B描述了PS的细节:1)全程保持不同分辨率的输入。2)固定D、W,采样K ——》固定W,采样D、K ——》采样W、D、K。
  8. PS的过程中搜索空间不断向下扩张,即在保留大网络存在的同时逐渐将小网络纳入搜索空间的过程。
  9. "The resolution is elastic throughout the whole training process, which is implemented by sampling different image sizes for each batch of training data."——对于PS的全程保持不同分辨率的输入这一点我认为不妥,但实验结果显示目前的weights的可继承性非常好。
  10. "Compared to the naive approach, PS prevents small sub-networks from interfering large sub-networks, since large sub-networks are already well-trained when the once-for-all network is fine-tuned to support small sub-networks."——这里说PS能够防止weights co-adaptation的原因是大网络已经训练好了所以不会受小网络的影响。我认为是因为progressive shirink提供了一个有序的progressive weights freezing机制。1)**有序的:**pruning的原则是遵循着pruning前后共享程度最大的原则(depth方面保留前D个,width方面选L1-norm最大的filter留下),相比于SPOS那种random sampling的随机选择filter,这样按照重要性保留下来的权重其被fine-tune时更改的程度会大大减小,这样就削弱了大网络和小网络之间的weights co-adaptation。2)**freezing:**在PS的过程中,被prune的weights其实就相当于被freeze起来了,这部分的weights就避免了co-adaptation。3)避免了同样大小的网络之间的weights co-adaptation:像SPOS那样random sampling的话,同样拓扑结构的网络可以有着不同的采样(如D=3时可采样前三个block也可以采样后三个block,对于W也是如此),很显然,训练这些同样结构的子网这会大大加剧weights co-adaptation。
  11. "As such, it provides better initialization by selecting the most important weights of larger sub-networks, and the opportunity to distill smaller sub-networks, which greatly improves the training efficiency."——PS的好处。但是这里的distill是啥意思?
  12. "We also use the knowledge distillation technique after training the largest neural network."——这里的知识蒸馏是如何操作的
  13. "The weights of centered sub-kernels may need to have different distribution or magnitude as different roles. Forcing them to be the same degrades the performance of some sub-networks. Therefore, we introduce kernel transformation matrices when sharing the kernel weights."——这里的kernel transformation是如何操作的?——1)就拿一个矩阵做个矩阵乘法,因此所需的额外参数量是25×25 + 9×9 = 706。2)同是superkernel,这个kernel transformation就是OFA胜过Single-Path NAS的地方。
  14. OFA的search阶段可以概括为:acc predictor + latency predictor + EA 。
  15. "On the ImageNet mobile setting (less than 600M MACs)"——原来mobile的标准就是<600M MACs。
  16. "OFA achieves a new SOTA 80.0% top1 accuracy with 595M MACs (Figure 2). To the best of our knowledge, this is the first time that the SOTA ImageNet top1 accuracy reaches 80% under the mobile setting."——关注一下后续方法与OFA的PK情况(如RegNet)。
  17. "Early NAS methods (Zoph et al., 2018; Real et al., 2019; Cai et al., 2018b) search for high-accuracy architectures without taking hardware efficiency into consideration."——Early NAS methods这个称呼可以。
  18. 这个progressive shrink到底是怎么搞得?这篇文章提出的progressive shrink基于一个假设:对small subnetwork的fine-tune不会影响已经训练好的大网络(Figure 7证实了这个假设的合理性)。基于这一假设,整个OFA网络就沿着K、D、W的顺序一路prune下去,同时全程都采用不同的Resolution的image作为输入(究竟是一个batch里不同的R还是不同batch不同R?)。
  19. OFA和MobileNetV2的对比说明:"using the same model for different deployment scenarios with only the width multiplier modified has a limited impact on efficiency improvement:the accuracy drops quickly as the latency constraint gets tighter"
  20. 附录A提到OFA采用的acc predictor在test set上的RMSE仅有0.21%。
  21. 摘要里说相同acc下比EfficientNet是2.6×faster,那和RegNet比如何?
  22. 为何MnasNet的arithmetric intensity低于OFA?——个人推测:1)OFA还搜索了分辨率。2)OFA没搞SE,SE费内存。
  23. Roofline Model,计算强度的概念很有用。 Figure 13中几个点没位于roofline上是因为——“Roofline 模型讲的是程序在计算平台的算力和带宽这两个指标限制下,所能达到的理论性能上界,而不是实际达到的性能,因为实际计算过程中还有除算力和带宽之外的其他重要因素,它们也会影响模型的实际性能,这是 Roofline Model 未考虑到的。例如矩阵乘法,会因为 cache 大小的限制、GEMM 实现的优劣等其他限制,导致你几乎无法达到 Roofline 模型所定义的边界(屋顶)。
  24. 如何实现的latency constraint?是给定target直接优化,还是直接利用predictor求出Prateto Optimal Solutions,然后按照需求挑选(e.g., DPPNet)? ——学完EA再回答。
  25. OFA即便不retrain也能有较高的acc,这一点或许和SNAS的后续工作DSNAS有联系,研究一下。
  26. 对比OFA和SPOS。看完二者源码后再搞这一点。
  27. "From this perspective, progressive shrinking can be viewed as a generalized network pruning method that shrinks multiple dimensions (depth, width, kernel size, and resolution) of the full network rather than only the width dimension. Besides, it targets on maintaining the accuracy of all sub-networks rather than a single pruned network."——这里说OFA还是一种广义的pruning方法。对比OFA和剪枝:1)许多pruning都集中在width,有的还会涉及depth,考虑resolution的已经很少了(Accelerate CNNs from 3 Dimensions),像OFA这样还考虑kernel size的就更少了。2)OFA的目标是保持所有子网络的acc(不是acc rank而是acc本身!),pruning的目标是恢复a single pruned network的acc。3)OFA具备once for all的能力,且是hardware-aware的。其实第2和3点应该来说是NAS项对于pruning的普遍优势。4)Figure 4对比得很形象。
  28. 尝试总结一下OFA:1)ProxylessNAS的github主页上称OFA是其下一代,但我认为把OFA视为SPOS的进阶反倒是更加符合逻辑的。或许从这也可以看出,decouple search and training是被这个团队更加认可的思路。2)
  1. 核心思想就是把supernet影分身成多个sub-supernet,这样可以提高rank correlation,Figure 3说明了这一点。
  2. Section 4.3把Few-shot和OFA和ProxylessNAS结合的细节没说,需要去看看代码,然鹅并没有开源。
  1. "In our work, we focus on three factors of the problem: a) the search space. b) The loss function L(a,wa) that considers actual latency. c) An efficient search algorithm."——这段话可以帮我领读这篇文章。
  2. "For example, NasNet-A [31] has a similar FLOP count as MobileNetV1 [6], but its complicated and fragmented cell-level structure is not hardware friendly, so the actual latency is slower."——cell-based的搜索空间惨遭吐槽。
  3. "we can directly train the architecture distribution using gradient-based optimization such as SGD."——是否所有的NAS都可以归结为学习结构分布?探究一下这个问题。先搞清楚EA就能总结这个问题,因为RL-based和one-stage one-shot NAS都可以这么总结。
  4. "To estimate the latency of an architecture, we measure the latency of each operator in the search space and use a lookup table model to compute the overall latency by adding up the latency of each operator."——由此看来,整个网络的latency是可以由operator的latency线性叠加而来,因此为每个operator测量好latency以后弄成lookup table就行,ProxylessNAS也是这么做的。但似乎只有链式结构满足这种latency线性叠加的规律,而cell-based则不行,对吗为什么?——cell-based的结构同时会有多个支路并行,若把不同支路上并行的op的时延直接相加作为这一cell的总时延,对于GPU来说显然不对,但对于CPU来说似乎可行?但是一般来说,cell-based的NAS都是最后通过控制stack的数量来控制时延的,所以一般没必要对cell施加时延约束。
  5. "This assumes that on the target processor, the runtime of each operator is independent of other operators. The assumption is valid for many mobile CPUs and DSPs, where operators are computed sequentially one by one."——那么针对GPU实现latency约束的话,这个假设还适用吗?我感觉可以诶。。。
  6. "More importantly, as will be explained in section 3.3, using the lookup table model makes the latency term in the loss function (2) differentiable with respect to layer-wise block choices, and this allows us to use gradient-based optimization to solve problem (1)."——使用lookup table搞latency约束的另一个好处是可以使得latency loss相对于结构参数可微。SNAS的资源约束其实采用的也是类似于lookup table的方式(写进每个operator类的属性)。
  7. Mixed precision quantization of convnets via differentiable neural architecture search。听起来是结合了NAS和量化压缩,挺有趣的,可以学习一下啊。
  8. 第3.1节描述了搜索空间的设计,对比一下和MNasNet的搜索空间
  9. "In addition, we can choose to use group convolution for the first and the last 1x1 convolution to reduce the computation complexity."——丧心病狂,连1×1卷积都用上group conv了
  10. 这里的(2)式和MnasNet的(2)式是等价的,为何MnasNet中的T就能起到target的作用?我当时猜想MnasNet的(2)中起到target作用的是参数w,结合这点好好探究一下这个问题。
  11. "We first represent the search space by a stochastic super net. The super net has the same macro-architecture as described in Table 1, and each layer contains 9 parallel blocks as described in Table 2."——高度概括了这篇文章的search space。
  12. "Our search process is now equivalent to training the stochastic super net. During the training, we compute ∂L/∂w to train each operator's weight in the super net. This is no different from training an ordinary ConvNet. After operators get trained, different operators can have a different contribution to the accuracy and the efficiency of the overall network. Therefore, we compute∂L/∂θ to update the sampling probability P_θ for each operator. This step selects operators with better accuracy and lower latency and suppresses the opposite ones. After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution P_θ."——以stochastic supernet的视角高度概括了FBNet的搜索过程:1)训练stochatic supernet与训练ordinary convnet没有区别; 2)ops得到训练后更新ops的分布。——这里说的第一点很扯淡的。首先这个Gumbel-Softmax的temperature参数一开始是比较高的,这时Gumbel-Softmax并无法近似Gumbel-Max。其次,Gumbel-Softmax毕竟还是softmax,无论前向还是反向,其计算图都与DARTS相同,因此计算成本和DARTS这种是一样的。值得一起比较的是GDAS和SNAS,其前向过程用Gumbel-Max,但是反向用Gumbel-Softmax,因此节约了很大的成本。SNAS看起来应该和FBNet一样无论前向还是反向都是涉及全部路径,但是其附录B的Figure 6显示其反向只涉及单路径,这值得我仔细检查
  13. 式(1)是针对search space而言,式(7)是针对supernet 而言——"represent the search space by a stochastic super net"。
  14. 在bi/single-level optimization、temperature schedule、整个supernet参与前向和反向方面,这篇文章和SNAS是否相同?看看代码。——1)与SNAS的single-level不同,DNAS搞的是bi-level:"w_a is trained on 80% of ImageNet training set using SGD with momentum. The architecture distribution parameter θ is trained on the rest 20% of ImageNet training set with Adam optimizer."——SNAS只用一个单独的训练集,前向一下再反向一下,同时得到结构参数和权重参数的梯度,然后更新,但是DNAS和DARTS一样分了训练集和验证集,分别负责权重参数和结构参数。2)temperature schedule : "we use an exponentially decaying temperature.",SNAS则是均匀衰减。3)前向和反向过程:应该和DARTS相同,涉及所有path,也就是说计算成本和DARTS一样是supernet级别的。但是不确定SNAS是图和,其Figure 6尚有存疑
  15. DNAS在搜索阶段完成后derive architecture的方式与SNAS有较大的不同,SNAS是argmax,DNAS是采样——"After the super net training finishes, we can then obtain the optimal architectures by sampling from the architecture distribution P_θ.","After the search finishes, we sample several architectures from the trained distribution P_θ, and train them from scratch.","At theend of the super net training, we sample 6 architectures fromthe final distribution to be trained from scratch."——这样的话搜索成本岂不是还得考虑所有被sample的arch?实验里报告的search cost岂不是很有水分?
  16. "To reduce the training time, we randomly choose 100 classes from the original 1000 classes to train the stochastic super net." —— 其它NAS方法是否有如此设置proxy task
  17. "1.33x faster than DARTS"——DARTS的search cost从其原文看来明明就只有96个GPU hours,但DNAS是216,这里究竟怎么算出快1.33倍的
  18. "As shown in Figure 5, the upper three operators are faster on iPhone X, therefore they are automatically adopted in FBNet-iPhoneX. The lower three operators are significantly faster on Samsung S8, and they are also automaticallya dopted in FBNet-S8. "——不同算子的运行时间对比在不同的设备上可能是颠倒的,这就造成了target device不同时搜索得到的网络结构对算子的偏好不同,ProxylessNAS也展示了相关实验现象。
  19. "It also achieves better accuracyand lower latency than MnasNet."——再好好对比MnasNet和DNAS,假如二者的主要区别真的只在搜索策略上且搜索结果差异显著,或许可以empirically说明RL的搜索策略明显不如可微分的搜索策略。
  20. 没搞完的问题:3、4、5、7、8、10、12、14-3)、15、16、17、19、21-3)
  21. 尝试总结一下DNAS:利用Gumbel-Softmax trick把MnasNet的搜索策略换为可微分的策略即为DNAS。不足:1)宏结构——也就是输入分辨率和channel size要手工设定,这点不够自动化,其它layer-wise搜索空间的NAS方法在这点上可以对比一下,貌似OFA解决了这一点。2)其搜索过程训练supernet,无论前向还是反向都设计所有路径,所以其搜索成本和DARTS一样都是supernet级别的。3)搜索空间上有一点创新,对MBConv中的1×1卷积引入了同道分组,并将group作为结构参数之一,但是效果如何并没有ablation。
  1. 看Table 1,GDAS搜索速度似乎不比DARTS快到哪去(0.21-0.38 GPU days),但是节约|F|=8× GPU memory应该确实做到了。
  2. 按照论文中的描述是应该和DARTS一样的搜索空间,但是其搜索空间里有两个candidate ops有些奇怪,"3x3 depth-wise separate con"和"5x5 depth-wise separate con",代码里显示的却是"dua_sepc_3x3",即把单个separate conv堆叠了两次当作一个算子。
  3. GDAS用的是1st or 2nd order?
  4. "The searching costs listed in Tab.1 and Tab. 2 are not normalized across different GPU devices. Different algorithms might run on different machines, and we simply refer the searching costs reported in their papers."——这里说Tab.1和Tab.2里列出的search cost都是从原paper里抄的,但是至少这里报告的DARTS的GPU hours我在原paper里找不到,不知道作者怎么搞出来的。
  5. 像DARTS那样整个supernet拿去训练的搜索算法存在两个问题:1)计算成本大; 2)不同算子之间相互竞争。——"Directly optimizing this DAG [24] instead of sampling a subgraph leads to two disadvantages. First, it costs a lot of time to update numerous parameters in one training iteration, increasing the overall training time to more than one day [24]. Second, optimizing different operations together could make them compete with each other."——这里所说的相互竞争和SNAS所说的bias一样吗?
  6. "the typical RL-based method utilizes the validation accuracy as a reward to optimize the architecture generator [46]. An EA-based method leverages the validation accuracy to decide whether a model will be removed from the population or not [33]. "——这里似乎可以回答我FBNet笔记中的第3点。RL-based NAS和one-stage Gradient-based NAS都可以归结为对结构分布的学习,但是EA-based NAS似乎就不行
  7. "For RL-based and EA-based methods, feedback (reward) is obtained after a prolonged training trajectory, while feedback (loss) in our gradient-based methodis instant and is given in every iteration."——这个我觉得没道理,RL-based和EA-based不也可以改成每个iteration就更新一次的吗?——看看ProxylessNAS里的RL优化,搞清EA-based回答这个问题。
  8. "Most NAS approaches can be categorized in two modalities: macro search and micro search."——多大程度对应着layer-wise search space和 cell-based search space?cell-based search和micro search是对应的,但是marcro search和cell-based search似乎并不完全对应?前者以Zoph_16为代表,后者以MnasNet为代表。
  9. "To be noticed, we use the arg max function in Eq. (5) during the forward pass but the soft max function in Eq. (7) during the backward pass to allow gradient backpropagation."——前向用Gumbel Max,反向传播用Gumbel-Softmax。**1)**这看起来似乎就不用在训练中逐渐降低Tau了吧?——还是使用了退火,从10到0.1逐epoch下降; **2)**这样应该会导致梯度不正确吧,有啥负面影响吗?——随着Tau的逐渐降低,Gumbel-Softmax愈发接近Gumbel Max,梯度也会变得愈发接近; 3)这样前向传播时候就不是整个supernet了,对吗?能带来提速吗?——是的,可以的,代码中对这一点的实现可谓神来之笔:"hardwts = one_h - probs.detach() + probs",在数值上实现one-hot向量,保证前向传播为单路径,同时又讲编码op分布的结构参数纳入计算图中,保证了反向传播时梯度能够到达这些结构参数。这使得无论前向还是反向传播,supernet中都只涉及单个candidate op的权重参数,但是又涉及了全部的结构参数,总结起来就是实现了前向严格采样且反向梯度可传这一目标(SNAS和FBNet都是前向近似采样且反向梯度可传)。
  10. "hardwts = one_h - probs.detach() + probs"——这行代码利用pytorch的特性实现了一个厉害的技术——one-hot向量由ops分布采样而来,但forward时参与计算的为one-hot向量,backward时梯度可以传播到编码ops分布的结构参数。这意味着即使即便不使用Gumbel-Softmax也能实现可微分采样,事实上,在这个issue里就有网友不用gumbel-softmax实现了可微分采样,满足前向严格采样且反向梯度可传这一目标,且效果和使用gumbel-softmax没啥区别,但是作者回应缺乏理论保证。——我觉得作者的说法站不住脚,Gumbel-Max的所具备的理论保证是其能够产生符合给定分布的采样(样本以one-hot向量表示),而Gumbel-Softmax所具备的理论保证是其Tau越小就越逼近Gumbel-Max
  11. 这样看来GDAS的作者在这里应该是绕了点远路。SNAS是与GDAS同期的工作,其作者提出使用Gumbel-Softmax是为了实现可微分的近似采样以消除DARTS的偏置,并没有去考虑降低supernet training而导致的supernet级的search cost。而GDAS作者已经利用这行代码技巧性地实现了可微分的严格采样,那直接像issue里这样直接"index=torch.multinomial(probs, 1)" 即可,压根没有必要再绕道去借助Gumbel-Softmax。这样看来,就实现differential sampling这点来说,GDAS和ProxylessNAS其实是一样的,都是通过使用近似的梯度来实现可微分的严格采样。总而言之,Gumbel-Softmax是为了实现可微分的近似采样所须的技术,GDAS直接实现了可微分的严格采样,因此没必要用到Gumbel-Softmax
  12. 对比SNAS和GDAS的前向传播,看看二者相比于DARTS的搜索成本差距是否源于此。——1)作者在github上回应GDAS和SNAS的对比,指出二者最主要的区别就是GDAS多了个加速的技巧,在Gumbel-Softmax上加了个argmax,得到货真价实的one-hot向量,从而将训练成本从supernet级降低至single-net级。GDAS前向传播是严格的采样,SNAS是近似的采样。2)二者都能缓解DARTS的co-adaptation,但是相比于DARTS的search time(0.38),GDAS降低了(0.21),SNAS反而上升了(1.5)。3)GDAS和SNAS的实验协议与DARTS几乎都一样,但是搜索时的epoch不同,DARTS是50,GDAS是240,SNAS是150,而三者用的GPU分别为1080ti、V100、Titanxp,考虑到超参设置和硬件条件,GDAS的search time和DARTS几乎一致,SNAS反倒不止是1.5/0.38这么多。4)GDAS和SNAS所需的epoch更多可以归结为可微分采样操作导致的权重梯度稀疏(每次仅有采样到的op对应的权重被更新)以及结构参数梯度较小(越逼近one-hot向量梯度越小),而且我认为前者是主要原因
  13. 由此推广开来,可微分采样操作的优点可以归结为:1)消除DARTS式的co-adaptation; 2)ProxylessNAS和GDAS这种可微分的严格采样的可以减小GPU memory; 3)无论是严格采样还是近似采样,都无法降低GPU hour,甚至后者还会显著增加。
  14. "In Eq. (5),h_i,j is a one-hot vector. As a result, in the forward procedure, we only need to calculate the function F_argmax(h_i,j). During the backward procedure, we only back-propagate the gradient generated at the argmax( ̃h_i,j). In this way, we can save most computation time and also reduce the GPU memory cost by about |F| times."——前向传播时用Gumbel-Max,因此无须整个supernet,反向传播时只传被采样到的结构的相应梯度,因此可以节约搜索成本|F| 倍。但是反向传播时的梯度处理我还是不理解,看看代码
  15. "In the same time, without this acceleration step, it requires less training epochs to converge but still costs more time than applying the acceleration step."——这部分解答了我之前的疑惑,用采样结构而非整个supernet一起的训练方式确实会需要更多的epoch来训练,但是总体来看还是能减少训练时间。
  16. "If we do not applythe acceleration step introduced in Sec. 3.2, each iteration will cost |F|=8× more time and GPU memory than GDAS."——这里有问题。省下8倍GPU memory我信,但是省下8倍训练时间我不大相信。一是考虑到并行计算,二是考虑到整个supernet一起训练的话会需要更少的epoch(上一点说的)。
  17. derive architecture的方式和DARTS一样,这么看还是SNAS更高一筹——"Each node i connects with T previous nodes. Following the previous works, we use T = 2 for CNN [47, 24] and T= 1 for RNN [31, 24]."
  18. 从(3)式和GDAS的derive architecture方式可以看出,GDAS所提出的differential sampling仅针对ops分布,由于采用了和DARTS一样的derive architecture方式,其sampling并不针对拓扑结构。SNAS对每个连接是否存在都严格根据zero算子的概率来决定,因此其拓扑连接的也被包含在differential sampling之中。
  19. "One benefit of this acceleration trick is that it allows us to directly search on the large-scale dataset (e.g., ImageNet) due to the saved GPU memory. We did some experiments to directly search on ImageNet using the same hyperparameters as on the small datasets, however, failed to obtain a good performance"——尝试直接在ImageNet上search,但是结果并不好,文中把这归结为超参没调好。这其实值得探究,因为MnasNet和ProxylessNAS这种直接在ImageNet上search的,效果就好得很。说不定是搜索空间的问题?
  20. "with this human-designed reduction cell, GDAS finds a better architecture"——手工设计了reduction cell,效果还比search的更好。
  21. "This result implies that the reduction cell might have a negligible effect on the performance of networks and the handcrafted reduction cell could be on par with the automatically discovered one."——这里的逻辑不对。Tab. 5只说明了手工设计和自动搜索出的reduction cell一样,但不能说明reduction cell对性能影响小。要说明影响小,得random search出一个reduction cell对比一下才行。
  22. 尝试总结GDAS:1)和DARTS比用了Gumbel-Max和Gumbel-Softmax; 2)和SNAS比前向传播是Gumbel-Max,因而前向传播时无须supernet。3)指出了DARTS除了。4)GDAS比DARTS快的原因在于第二点,SNAS比DARTS慢的原因也在于此。5)GDAS和ProxylessNAS一样都是通过使用近似梯度(但是近似的方式不同)来实现可微分的严格采样,且GDAS引入Gumbel-Softmax其实是绕了远路,多余了。
  1. 这里Single-Path NAS的Single-Path和SPOS里single path不是一个概念。SPOS里supernet本身是multi-path的,但是搜索空间里的每个样本点(网络结构)是single-path的,训练supernet的时候是采样出单个single-path网络来训练。但在Single-Path NAS中,整个supernet就是single-path的,不同的operator被整合进一个superkernel中。——"By sharing the convolutional kernel weights, we encode all candidate NAS operations into a single 'superkernel', i.e., with a single path, for each layer of the one-shot NAS supernet. This novel encoding of the design space yields a drastic reduction to the number of trainable parameters/gradients, allowing our NAS method to use batch sizes of 1024, a four-fold increase compared to prior art's search efficiency."
  2. 概括地说,SPOS这种是network-level weights sharing,而Single-Path NAS这种是operator-level weights sharing 。个人感受,这篇可以作为NAS和pruning边界模糊的典型。——复习完pruning再看看这个观点
  3. SPOS虽然在训练supernet的时候只采样单条路径,因而在每个training step时所需投入的训练资源和stand-alone network training相同,但是因为有着更多的trainable weights,因此总体来说比Single-Path NAS这种在operator-level weights sharing的方法需要投入更多训练资源。
  4. 我感觉OFA就吸收了Single-Path NAS的这个superkernel思想,查证一下。
  5. 这篇文章还将之前的one-shot NAS总结为multi-path NAS——"Specifically, current NAS methods relax the combinatorial optimization problem of finding the optimal ConvNet architecture to an operation/path selection problem."——并且指出这会使得trainable parameters随着candidate operations线性增长——"As expected, naively branching out all paths is inefficient due to an intrinsic limitation: the number of trainable parameters that need to be maintained and updated during the search grows linearly with respect to the number of candidate operations per layer."
  6. Section 2讨论了FBNet和ProxylessNAS的缺点:1)FBNet为了减小训练成本,只能用proxy dataset(ImageNet的子集); 2)ProxylessNAS虽然相比于DARTS减小了memory cost,但是还是需要bi-level optimization。
  7. Section 3.2具体展开描述了如何实现superkernel,可归结为——weight sharing + group Lasso + trainable thresholds ,太像pruning了。似乎可以直接将Single-Path NAS归类为稀疏化训练的pruning方法?——复习完pruning再回答
  8. "To compute the gradients for thresholds, we relax the indicator function g(x,t) =1(x > t) to a sigmoid function, σ(·), when computing gradients, i.e., g'(x,t) =σ(x > t)."——为何要这样改变梯度?——这个问题其实和differtial architecture sampling(DAS)面临的问题是一样的,严格的指示函数的话梯度没法BP到编码这个指示函数的threhold上,因此需要将其soften。另外,这里也可以像DAS里那样,搞可微分的严格指示函数可微分的近似指示函数
  9. "However, solving Equation 7 gives rise to a challenging bi-level optimization problem [14]. Existing methods interchangeably update the α's while freezing the w's and vice versa, leading to more gradient steps."——指出了bi-level优化的缺点是交替训练造成的。交替训练意味着交替freezing,freezing意味着learnable参数的“旷课”,“旷课”意味着需要“补课”——即更多的gradient steps。
  10. Single-Path NAS快有两方面:1)没有结构参数,因此避免了bi-level优化的交替更新所导致更多的gradients step; 2)supernet小很多,trainable weights大大减少。——"Therefore, our formulation eliminates the need for separate gradient steps between the ConvNet weights and the NAS parameters. Moreover, the reduction of the trainable parameters w per se, further leads to a drastic reduction of the search cost down to just a few epochs."
  11. DMaskingNAS和Single-Path NAS(下面分别简称D和S)都实现了operator-level weight sharing,二者值得放在一起对比:1)mask和indicator:S搞的indicator起到的作用其实和D搞的mask作用相同,都是起到掩模的作用以实现选择,但是S的掩模是作用在kenerl上的,D的掩模似乎是作用在feature map上的。造成这种区别的原因可能是因为D需要同时选择各层feature map的分辨率和channel number,而S是同时选择kernel size和filter(channel) number,为了各自方便吧。2)结构参数:S里编码indicator的learnable threshold和D的Gumbel-Softmax mask里的可学习参数起到的作用其实就是结构参数。而且为了保证梯度能够BP到编码各自掩模的结构参数,二者都对掩模进行了软化。3)可微分结构采样(DAS):综合前两点其实不难看出,D和S都可以归为DAS类NAS方法,它们都是在引入了operator-level weight sharing后的延续。当然,二者在实现DAS上用了不同的思路。D去sample特征图,而S去sample卷积核。D利用结构参数编码掩模的方法与之前的DAS是一个思路,而S走出了不一样的路,在稀疏训练(似乎是)的基础上将threshold作为可学习的。这两种编码结构参数的方式孰优孰劣值得研究。4)single/bi-level optimization:D仍旧是bi-level的,S是却single-level的,这可能是跟S的结构参数较少有关。
  12. hardware-aware的实现是通过:1)构造lookup table; 2)将trainable threhold (文中称作NAS-related decision) 纳入hardware-aware loss中。
  13. 实验部分需要仔细思考的地方:1)这篇文章所宣称的search cost是否真的优势很大?2)搜索得到的网络acc效果好吗?3)Single-Path NAS的优势应该主要落在search cost的节省上,如果搜到的网络acc也高,那是什么原因?
  14. "while random search does not outperform NAS methods, the overall accuracy is comparable to MobileNetV2. This highlights that the effectiveness of NAS methods heavily relies upon the properties of the MobileNetV2-based design space."——这里也吐槽了现有大多数方法不比random search强多少,并将原因归于现有NAS方法的成功很大程度上依赖于MobileNetV2-based design space
  15. 关于random search。文中所采用的random search是random sample十个,然后每个训练5个epoch,再挑出最好的,这种方式的random search的cost也是蛮高的。——"Nonetheless, the search cost of random search is not representative: to avoid training all ten samples, we would follow a selection process similar to MnasNet, by training each sample for few epochs and picking the one with highest accuracy. Hence, the actual search cost for random search is not negligible, and for ≥10 samples it is in fact comparable to automated NAS methods."
  16. "Such finding is significant in the context of NAS, since choosing over subsets of kernels can effectively capture the accuracy-runtime trade-offs similar to their individually trained counterparts. We therefore conjecture that our efficient superkernel-based design search can be flexibly adapted and benefit the guided search space exploration in other RL-based NAS methods"——Section 4.3的实验很有意义。这里说的就是superkernel里直接继承不同kernel size的kernel网络进行对比,其acc的rank与stand-alone网络的acc的rank是相同的。因此这里说superkernel这个idea可以拓展到其它类型的NAS方法上。但是我觉得这里的关于rank对比的实验还是很粗糙的,只考虑了两种简单的网络(核大小全为3或5),这就得出了rank可信的结论我觉得还是不够的。
  17. "Beyond the NAS literature, our finding is closely related to Slimmable networks [23]. Slimmable Nets limit however their analysis across the channel dimension, and our work is the first to study trade-offs across the NAS kernel dimension."——关于上一点,Slimmable networks [23]也有着类似的结论,但是在channel维度上的。
  18. "Hence, Single-Path NAS could enable future work that builds upon the efficiency of our single-path, one-shot design space for RL- or evolutionary-based NAS methods."——这篇文章给NAS带来的一大启发就是 efficiency of our single-path, one-shot design space
  19. 这篇文章用TPU,估计难以复现。
  20. 尝试总结一下Single-Path NAS:1)所谓的superkernel可以概括为op-level weight sharing + group Lasso + trainable thresholds; 2)search space 似乎不够rich; 3)search strategy就是稀疏化pruning的那套,在pruning那虽然司空见惯了,但是放到NAS这和其它one-shot NAS一比倒是显得清新脱俗。
  1. 导读这篇文章:1)A memory and computationally efficient DNAS that optimizes both macro- (resolution, channels) and micro- (building blocks) architectures jointly in a 10^14× larger search space using differentiable search. 2) A masking mechanism and effective shape propagation for feature map reuse. This is applied to both the spatial and channel dimensions in DNAS.
  2. FBNet还是带latency约束的,到了FBNetV2咋开起了倒车,又搞起了FLOPS约束了。
  3. 就Table 1而言,OFA应该也是和DMaskingNAS一样全满。
  4. DMaskingNAS搜索空间中包含SE,OFA不知道有没有。
  5. 和OFA(230M-FLOPs,76.9-ACC,1200+75N-GPU hours)相比,FBNetV(238M-FLOPs,76.0-ACC,200-GPU hours)仅仅在search hours上占优,但这也仅仅是针对单一应用场景而言,对于多设备部署的应用场景,这个优势也要被OFA追平。值得一提的是,以上对比的是在给定接近的FLOPs下的二者搜出的网络的acc,FBNetV2正是FLOPs-constraint,而OFA却是latency-constraint,因此这个对比其实是偏向FBNetV2的。总而言之,FBNetV2和OFA的PK结果就是较多设备部署时OFA,单一/较少设备部署时FBNetV2。目前我暂时把二者分别视为coupling和decoupling流派的巅峰。还需要去调研一下后续发展
  6. 总结一下各路方法search space的大小。
  7. "For example, ProxylessNAS tackles the memory constraint by training only one path in the supergraph each iteration. However, this means ProxylessNAS would take a prohibitively long time to converge on an order-of-magnitude larger search space."——1)对比一下ProxylessNAS和DARTS的搜索空间大小就可以知道,这里说前者的搜索空间大于后者是不对的,ProxylessNAS的是6^21≈2^16,DARTS的是*(∏[1~4] (k+1)k/2×(7^2) )^2≈10^18*。2)这里说prohibitively也是夸张了,把ProxylessNAS踩得太过了,只采样一个path缺少会比train整个supernet需要更多迭代次数,但也是可以承受的(200 V100 GPU hours)。
  8. one-shot NAS系列都研究完以后再琢磨一下Table 1
  9. 这篇文章听上去蛮有趣——Haq: Hardware-aware automated quantization with mixed precision.
  10. "However, with heuristics-based simplifications, pruning methods train potential architectures separately, one after another — in some cases, pruning methods consider only one architecture.","As described above, network pruning suffers from inefficient and sequential exploration of architectures, one-by-one"——如何理解这里说的pruning是one by one的?是把pruning和DNAS相比,相当一部分pruning methods是一个arch一个arch地fine-tune,而DNAS则是train整个supernet,是这个意思吗?
  11. 论文第三页左下角那段究竟在吐槽Single-Path NAS什么问题?——那个吐槽Single-Path NAS虽然实现了两个kernel之间的weight sharing,但为了BP还是得store两个kernel的卷积所产生的feature map。——我觉得这个吐槽错了,Single-Path NAS和DMaskingNAS搞weight sharing的思路是一样的,都是只用一个最大size的tensor装着kernel/feature map,然后用indicator/mask点乘这个tensor以实现选择。
  12. "Searching along spatial and channel dimensions has been studied both with and without NAS."——关于搜索resolution和channel number,NAS和pruning都有涉及。
  13. DMaskingNAS和Single-Path NAS一样把weight sharing搞到了op-level的程度,对比一下二者的区别。
  14. "Furthermore, the approximation falls short of equivalence only because weights are shared, which is shown to reduce train time and boost accuracy in a."——强调了一下,Figure 3中step E只能近似step E,而不能等价,这是因为从step C到step D,把几个不同的kernel替换为一个weight sharing的kernel。
  15. DMaskingNAS搜索不同层的resolution,OFA只搜索input resolution。
  16. 关于训练协议:1)"we randomly select 10% of classes from the original 1000 classes and train the supergraph for 90 epochs."——仅用10%的数据集,search时supernet只训练90个epoch。2)"In each epoch, we train the network weightswith 80% of training samples using SGD. We then train the Gumbel Softmax sampling parameterαwith the remaining 20% using Adam [17]. "——是bi-level optimization,而且权重参数和结构参数的优化器不一样。3)"We set initial temperature τ to 5.0 and exponentially anneal by e^−0.045≈0.956 every epoch"——Gumbel-Softmax里的Tau逐epoch指数衰减。
  17. 文章没说如何搜索kernel size的,估计就是搞path没搞weight sharing了。再看看Single-Path NAS是怎么搜kernel size的,看看DMaskingNAS怎么弄比较好。
  18. FBNetV2是否就是DNAS的最新成果?其后还有什么工作吗?
  19. 尝试总结一下DMaskingNAS:1)创新点主要在于提出了针对macro-architecture (resolution和channel number)的可微分搜索方法。2)和OFA等方法在acc和search cost方面对比。3)不开源,感觉很飘。
  1. one-shot NAS的概念就是SMASH最早提出的,其摘要中给出了对one-shot NAS的定义: "effectively search over a wide range of architectures at the cost of a single training" 。也就是说one-shot是针对training而言的。
  2. SMASH是基于这样一个假设而提出的:只要HyperNet训练充分,使用了HyperNet产生的weights的网络的error rate rank会和正常训练得到的rank一样。Fig.4 和Fig.5展示了这一假设的有效程度。基于这一假设,SMASH分为三个阶段:1)训练HyperNet; 2)利用HyperNet评估随机采样的网络; 3)选择SMASH score最高的网络进行训练。总而言之,SMASH就是random search with HyperNet
  3. 关于memory-bank视角下的搜索空间我实在看不懂,放弃....
  1. 这个pytorch实现的ENAS代码清晰、注释详细,虽然只有搜索RNN cell的部分,但作为参考还是非常好用。
  2. 用了DropPath。
  3. one-shot NAS的四步框架适用于ENAS吗
  4. ENAS和DARTS类NAS(SNAS、ProxylessNAS这些)有多像?——1)从整体上看都是同样的交替训练的思路:"In ENAS, there are two sets of learnable parameters: the parameters of the controller LSTM, denoted by θ, and the shared parameters of the child models, denoted by ω. The training procedure of ENAS consists of two interleaving phases....."。2)从update采样分布上看: "In our image classification experiments, the reward function is the accuracy on a minibatch of validation images."。reward信号都是一个minibatch上的acc**(从代码里看是只采样了1个arch,但是要更新2000 steps**,这和直接采样了2000个又有所不同,前者是在θ空间中连续走2000步且每一步都是由当前θ决定的,后者是在θ空间中一口气走2000步那么大的一大步且这一大步由初始的θ决定。前者随机性更大,更利于跳出局部最优,类似于SGD的效果**)**,而非NASNet那种整个测试集上的acc,加大了update的频率。但关于update采样分布的频率多大值得关注,因为MDENAS、以及某个用了Graph的NAS也把update的频率改成epoch-wise了,这个值得考察探究一下。3)同样都是交替训练,ENAS和DARTS有一点差别很大——DARTS每次更新参数只走1个steps就交替,但是ENAS的交替是更新weight走45000/128 steps,更新结构参数走2000 steps。
  5. ENAS的交替训练:1)将CIFAR-10的50000个training example分为45000和5000的train/val set 。2)交替更新ω和θ:更新ω的阶段batch size为128,每走一个batch采样一次arch,走完全部45000个样本才切换到更新θ; 更新θ的阶段要走2000个step,每走一个step采样一次arch,也就是执行2000次“采样arch-评估arch-acc作为reward更新θ”。看代码验证一下——确实如此。
  6. 搜索阶段的训练完成后推导出架构的方式与DARTS类NAS不同。这里就是采样一堆model,然后在一个minibatch上评估,把最好的一个拿去train from scratch。DARTS类NAS普遍做法是取个argmax。
  7. "Nevertheless – and this is perhaps surprising – we find that M= 1 works just fine,i.e.we can update ω using the gradient from any single model m sampled from π(m;θ) ." ENAS的实践表明,每次只采样一个模型就能很好地更新模型参数。这一点很重要,后续ProxylessNAS、SNAS等每次只采样一个network而非DARTS、《Understanding One-Shot》那样更新整个supernet,其底气就源于这里。
  8. 对于ENAS的搜索空间,其DAG中的每个node表示的是局部计算(对CNN的cell来说就是op)而非feature:" the nodes represent the local computations and the edges represent the flow of information ."这点和后来的DARTS类NAS正好相反。
  9. " ENAS's controller is an RNN that decides: 1) which edges are activated and 2) which computations are performed at each node in the DAG. " ENAS的controller就干这两件事。2)中需要decide的"computation",对于RNN cell来说是acticvation function,对于CNN cell来说是op。
  10. " In addition to ENAS's strong performance, we also find that the models found by ENAS are, in a sense, the local minimums in their search spaces. " 文章这里蛮骄傲地宣称自己的ENAS找到了局部最优。对于NAS而言,局部最优是无法避免的,问题就在于这是多大的局部以及有多少局部,局部越大越多这个NAS方法就越强。同时还利用这个“局部最优”的观点解释为啥比Zoph_16弱:" the performance gap between NAS and ENAS is due to the fact that we do not sample multiple architectures from our trained controller, train them, and then select the best architecture on the validation data. " ——采样的arch不够多,因此局部不够大。考察对比一下二者采样的arch数量
  11. ENAS的macro和micro两层search space。为啥macro里搜出的arch比micro里搜出的参数量大了那么多效果还更差?使用了macro搜到了很多5×5标准conv; 而搜micro的,清一色的3×3 sepconv,同时也因此能每个stage叠了N=6。但是为啥两种搜索空间会有这种不同的偏好?i.e., 为啥链式搜索空间偏好大核,cell-based搜索空间则不会?这是一个值得深入探究的问题。
  12. " We therefore conclude that the appropriate training of the ENAS controller is crucial for good performance." 其它地方random search是个强baseline,为何这里就不是?比如NASNet那篇,random search那篇。
  13. "We suspect this is the reason behind ENAS's superior empirical performance to SMASH. " 这里讨论了SMASH为何会不如ENAS,值得好好看看
  14. 核对和ENAS还有DARTS的实验协议
  1. 文中举了个例子来说明one-shot model:“a single model containing all three operations”。这里对one-shot model的理解其实就是weight sharing model,事实上文章的摘要就声明了文章是在搞weight sharing——"We aim to understand weight sharing for one-shot architecture search"
  2. 关于one-shot和weight sharing这两个概念在应用上的重合范围。one-shot是针对训练而言,weight sharing则是针对model的概念。在我所了解的范围内,除了SMASH通过使用HyperNet生成weights来实现one-shot NAS,其它one-shot NAS都是利用weight sharing实现的。但事实上,SMASH使用HyperNet来生成给定arch的权重,这其实也可以视为一种特殊的weight sharing。总而言之,绝大多数情况下可以将one-shot NAS视为weight-sharing NAS(SMASH是特例)。
  3. 文中将自己与MorphNets进行比较,都视作pruning。但MorphNets是针对filter的pruning,而文中提出的one-shot NAS则是针对ops和skip-connections的pruning。
  4. 文中提出将one-shot NAS分为四步:1)设计一个能够代表足够多网络结构的搜索空间; 2)训练one-shot模型使得它对不同架构的验证集精度具有足够的预测性; 3)利用预训练好的模型在验证集上评采样出的估候选架构的精度; 4)重新训练上一步中表现最好的架构,并在测试集上评估。
  5. 用文中提出的one-shot四步框架看待SMASH,ENAS,ProxylessNAS,random search,DARTS等方法:1)SMASH把第2、3步中用到的one-shot模型改成HyperNet。1)DARTS是将第2、3步耦合,同时第3步中学到的分布不是用于采样而是用于加权求和。3)ProxylessNAS同DARTS一样将第2、3步耦合,但与DARTS不同的是第3步中依据分布采样架构。4)SPOS认为DARTS、ProxylessNAS等方法耦合architecture search和one-shot model training是不好的,赞同了OneShot的decouple search and training的策略。
  6. 为了削弱one-shot model中各个部分的co-adaptation效应,文章提出了path dropout(和NASNet使用的DropPath差不多)。Fig. 4讨论了这个path dropout 所须的唯一超参r对one-shot acc和stand-alone acc的相关性的影响,path dropout的程度无论过高还是过低都会影响相关度。
  7. Fig. 5展示了随机采样的20000个arch的one-shot acc和它们分组采样后的arch的stand-alone acc。从图中看来是高度相关的,这很有力地说明了文中提出的one-shot NAS的合理性(似乎path dropout功不可没)。
  8. Table 2展示出的Top、Small、Random三种模型的对比:1)Top+Small vs. Random,搜出来的确实胜过random sample的,无论是单挑参数量还是单挑acc; 2)Top vs. Small,在搜索模型时,若固定F,参数量大的模型acc更高; 但若是允许F可变,可以搜索出参数量相同但是acc更高的模型(如Small(F=24) vs. Top(F=16)),这说明了对于NAS而言粒度精细到filter的搜索是很有必要的
  9. 整理精细到filter的NAS方法。
  10. 文章的标题是simplify和unsderstand:1)simplify,相对于SMASH和ENAS来说,文中提出的方法既不需要一个HyperNet也不需要RL constroller,仅用SGD就实现了one-shot NAS。事实上,在这以后的one-shot NAS都沿袭着文中提出的四步的框架并做出相应的改进(如DARTS,Proxyless NAS); 2)understand,理解为何weight sharing有效,即为何使用了sharing的weights得到的one-shot acc能和stand-alone acc具备较强相关性。这里的理解就是提出了一个假设,然后再用比较KL散度的方式验证了这种假设的正确性。这个假设就是one-shot model在训练过程中会变得更依赖更有用的op,这一方面造成了不同架构的one-shot acc的差异变得较大,使得rank变得更方便,另一方面还可以通过one-shot acc的大小来筛选出最好的op的组合。个人观点:相关性如此强是因为用了path dropout来削弱co-adaptation,不用的话又要弱好多。
  11. 这里提一下path dropout和DropPath的区别。目前仅在这篇文章中见到path dropout,而DropPath在NASNet、ENAS、DARTS等方法中均有使用。path dropout是在搜索阶段使用,作用是削弱不同op的weight之间的co-adaptation。而DropPath是在搜索结束后的训练阶段使用,为cell-based的网络起到类似dropout的正则化作用。后续跟进查证一下。
  1. 了解实验配置,超参设置,以便继续阅读RandomNAS的实验部分。" ... we run DARTS four times with different random seeds and pick the best cell based on its validation****performance obtained by training from scratch for a short period (100 epochs on CIFAR-10 ) ... To evaluate the selected architecture, we randomly initialize its weights, train it from scratch, and report its performance on the test set. " ——1)搜索阶段:用DARTS搜索4次,四次搜出的arch都从头训练100个epoch,选出验证集表现最好的那个作为最终结果; 2)评估阶段——从头训练四个里挑选出的那个最好的arch,在测试集上评估。
  2. 搜索阶段的实验配置和超参设置(CIFAR-10)。8 cells,50 epochs (both for training and validation sets), batch size 64, initial channel number 16。优化weiht: momentum SGD, initial lr 0.025, momentum 0.9, weight decay 3×10^-4。优化结构参数: lr 3×10^-4,momentum (0.5,0.999),weight decay 10^-3。1 GPU days。
  3. 评估阶段的实验配置和超参设置。1)CIFAR-10:20 cells, 600 epochs, batch size 96, initial channel number 36, path dropout 0.2, auxiliary towers with weight 0.4。1.5 GPU days。2)ImageNet(mobile setting):14 cells,250 epochs,batch size 128, weight decay 3×10^-5, initial lr 0.1(decay 0.97/epoch),other hyperparameters follow Zoph et al. (2018); Real et al. (2018); Liu et al. (2018a)。12 GPU days。
  4. 文章中跑了DARTS四次,然后挑了val acc最好的arch重新训练,因为—— "This is particularly important for recurrent cells, the optimization outcomes can be initialization-sensitive (Fig. 3) ",但后面又说"This practice is less important for convolutional cells however, because the performance of discovered architectures does not strongly depend on initialization (Fig. 3)."——这他娘的也是很精分了。
  5. 关于搜索空间的compelexity analysis有两种。1)discretized search space:对单个cell有 " ∏[1~4] (k+1)k/2×(7^2)≈10^9这里的2应该不能除,因为两个输入节点不可交换possible DAGs without considering graph isomorphism (recall we have 7 non-zero ops, 2 input nodes, 4 intermediate nodes with 2 predecessors each ) " ,再把normal cell和reduce cell一起考虑上,整体网络结构的数量就是10^18。如果考虑isomorphism的话,应该是多大?2)continuous search space: 对单个cell有" each relaxed cell (a fully connected graph) contains2 + 3 + 4 + 5 = 14 learnable edges, allowing(7 + 1)^14≈4×10^12 possible configurations (+1 to include the zero op indicating a lack of connection) ",考虑整个网络结构的搜索空间就是1.6×10^25。
  6. 这篇文章中random search的实现方式:随机采样24个arch,train 100个epoch,挑val acc最低的那个。
  7. 关于SNAS(还有哪些方法?)等方法引入Temperature参数来进行单模型采样思路,DARTS在conclusion部分就提到了:" ... the current method may suffer from discrepancies between the continuous architecture encoding and the derived discrete architecture. This could be alleviated, e.g., by annealing the softmax temperature (with a suitable schedule) to enforce one-hot selection."
  8. DARTS所声称的4 GPU hours的search cost是算了跑四次的吗?还是只算一次?
  1. "We address the high memory consumption issue of differentiable NAS and reduce the computational cost (GPU hours and GPU memory) to the same level of regular training while still allowing a large candidate set."——我认为ProxylessNAS相比于DARTS减小GPU memory和减小GPU hours靠以下两点:1)子网络采样(避免了不同candidate ops 的相加); 2)layerwise搜索空间(避免了cell-based的multi-branch相加)。——这两点保证了search cost低到足以实现proxyless。
  2. "Therefore, directly applying NAS to a large-scale task (e.g. ImageNet) is computationally expensive or impossible, which makes it difficult for making practical industry impact. As a trade-off, Zoph et al. (2018) propose to search for building blocks on proxy tasks, such as training for fewer epochs, starting with a smaller dataset (e.g. CIFAR-10), or learning with fewer blocks. Then top-performing blocks are stacked and transferred to the large-scale target task. This paradigm has been widely adopted in subsequent NAS algorithms."——这里以proxy task的视角梳理了一下NAS的发展逻辑:1)最开始的Zoph_16是full-training,只能勉强搜个CIFAR-10; 2)后来NASNet开始用proxy task(搜cell,search阶段只堆叠少量cell,且只训练20个epoch,在CIFAR上train,search结束后再到ImageNet上full training); 3)NASNet开始到ProxylessNAS之前的工作都是采用这个proxy task的思路,包括有PNAS、RENAS、DARTS、MnasNet、NAO。MnasNet虽然没采用cell-based的搜索空间,但是其每个block内会堆叠相同的层进行堆叠,堆叠层数也是靠搜索,而且搜索阶段只训练5个epoch。
  3. "inspired by recent works (Liu et al., 2018c; Bender et al., 2018), we formulate NAS as a path-level pruning process."——这里把DARTS、OneShot等oneshot方法称之为path-level pruning,我觉得很贴切。
  4. 关于hardware-aware约束:1)文章中称Proxyless-G用的是(7)式,Hardware-aware NAS综述说(7)式中的scaling factor是learnable的; 2)文章中称Proxyless-R用的是和MnasNet一样的objective function。以上两点看代码验证一下。——1)看了代码,scaling factor并不是learnable的,这样的话调参这个factor似乎会比较艰难; 2)是的,但是 w 的取值不一样。
  5. "As shown in Eq. (1), the output feature maps of all N paths are calculated and stored in the memory, while training a compact model only involves one path. Therefore, One-Shot and DARTS roughly need N times GPU memory and GPU hours compared to training a compact model."——这个N倍GPU内存我信,但是这个N倍的GPU hours我不信:1)学习率一定的话,因为每次只采样一个path,那更新所需的step数肯定也会成倍地增加,因此若不考虑硬件特性,训练所需的FLOPs也会成倍增加; 2)如果考虑硬件特性就复杂得多了,但是总得来说,因为使用GPU训练supernet时是可以并行计算的,因此对于每一个training step,训练N个path和单path之间的时间成本之比应该小于N。
  6. ProxylessNAS和FBNet的对比:1)arxiv版本号都是1812; 2)两者都实现了结构采样,减小了显存占用,但ProxylessNAS对于结构参数的梯度是用了近似处理,而FBNet则是使用Gumbel-Max和Gumbel-Softmax技巧;
  7. "In BinaryConnect (Courbariaux et al., 2015), the real-valued weight is updated using the gradient w.r.t. its corresponding binary gate. In our case, analogously, the gradient w.r.t. architecture parameters can be approximately estimated using∂L/∂g_i in replace of ∂L/∂p_i"——为何可以用∂L/∂g_j替代∂L/∂p_j ?——BinaryConnect 中说道——"One way to picture all this is to hypothesize that what matters most at the end of training is the sign of the weights","BinaryConnect's noise is a binary sampling process.","we can view discretization into a small number of values as a form of noise"——SGD的目的就是沿着梯度指示的方向行走,以这样的视角来看的话,不难理解SGD本身对噪声的容忍程度较高,行走方向允许有较大的偏差。——对于BinaryConnect而言,其binary sampling相当于用对原向量进行sign近似,即各个维度只取正负号; 对于ProxylessNAS而言,其multinomial samping得到的one-hot向量指示了原向量的最大方向,这种one-hot近似似乎误差比sign近似大很多,但经过多次sampling,也能取得和后者接近的近似效果。——形象地说,sign近似好比沿着象限的中线进行近似,one-hot近似则是沿着坐标轴近似,但是多次沿着坐标轴移动也能到达象限的中线。
  8. "However, computing ∂L/∂g_j requires to calculate and store o_j(x)."——我不理解公式(4):1)为何要有求N个path的梯度之和?只计算被采样到的那条不就好了吗?这样也不必扯到后面那个采样两条的近似方法了。2)另外,GDAS似乎就是只计算采样到的那条path的梯度,看看代码是怎么搞的。——**1)**涉及到N条path是因为对于α_i不仅会出现在p_i的分子上,还会出现在其它p_j的分母上,因此会涉及所有path。**2)**对于权重参数,无论是forward还是backward,GDAS都只考虑采样到的那条path(参考maxpooling原理),对于结构参数,会涉及所有op对应的结构参数,原因与ProxylessNAS相同。
  9. 附录D说的不用采样两个path来近似的算法又是啥意思?——附录D的意思就是以时间换空间。本来激活值o_j(x)是要保存在计算图中,以待backward时计算梯度∂L/∂g_j用,但是附录D换了个方法,改成不保存o_j(x),只在计算梯度∂L/∂g_j的时候即时计算o_j(x),再根据式(9)计算出∂L/∂g_j,然后即时释放o_j(x)所占用的内存。论文的意思我是看懂了,但是这部分的代码实现我还是看不懂
  10. 关于ProxylessNAS的可微分采样的代码实现,我有几处看不懂:1)为何代码中出现torch.multinomial( )这个采样操作却没有出现不可导的问题?——在loss.backward()之后调用函数set_arch_param_grad( ),从而按照式(4)来修改梯度,即用∂L/∂g_j来替换∂L/∂p_j,这是ProxylessNAS实现可微分采样的核心技术2)为何two和full的实现中,不同path的结果是直接相加?这个可和论文中要求的不一样啊。——其实是和原文一样的,训练权重参数的时候,要求只采样一条path,这里虽然把sample到和没sample到的path都相加了,但是没sample到的path对应的g_i也就是AP_path_wb[_i]为0,因此在数值上相当于没相加,这里执行相加操作是为了计算各个路径的∂L/∂g_j。**3)**g是one-hot向量,因此大部分g_j=0,那么∂L/∂g_j怎么求得出来?——函数在某处为0但导数不为0很常见,比如y=2x。式(9)也给出了如何∂L/∂g_j,可以看出与g_j无关。4)附录D中所说的即时释放o_j(x)的内存是如何实现的?——代码中计算o_j(x)的代码为"out_k = candidate_opsk",其中".data"使得requires_grad属性变为False,这之后的计算不被autograd跟踪,也就是被剥离出了计算图。**5)**为何fullv2的实现当中,无论foward还是backward函数都要使用detach后的x作为输入?——是因为要即时计算得到的o_j(x)要即时释放?——应该不是,调用backward_func时以及backward_func内部都使用了.data来使得后续计算剥离出计算图,已经达到了即时释放o_j(x)的目的了,我搞不明白为何要使用detach操作来将输入x剥离出原计算图
  11. "λ_1||w||^2 is the weight decay term"——公式(7)是学习结构参数对应的loss,为何会有weight decay项?——若这里的w表示权重参数,则多余; 若w表示结构参数,则如代码所示λ_1=0。
  12. "The GPU latency is measured on V100 GPU with a batch size of 8 (single batch makes GPU severely under-utilized)."——在GPU上的latency是以batch size为8测量的。在CPU和mobile上则是1。
  13. 似乎只有MnasNet是用即时测量的方式得到网络的latency,其它的NAS方法都是靠查表+线性相加的方式。
  14. "We sampled 5k architectures from our candidate space, where 4k architectures are used to build the latency model and the rest are used for test. We measured the latency on Google Pixel 1 phone using TensorFlow-Lite. The features include (i) type of the operator (ii) input and output feature map size (iii) other attributes like kernel size, stride for convolution and expansion ratio."——用这5000个采样得到的结构求解latency model(线性模型)的参数,最后利用这个latency model制成lookup table,这个表中枚举了所有可能的block及其对应的lantency。
  15. 从代码中可以看出,对于模型在mobile上的latency是查表得到的,但是对于CPU和GPU上的则是即时测量的。
  16. ProxylessNAS在ImageNet分类任务上的搜索空间:1)"parser.add_argument('--n_cell_stages', type=str, default='4,4,4,4,4,1') parser.add_argument('--stride_stages', type=str, default='2,2,2,1,2,1')"——总共有21可searchable MBConv; 2)以ProxylessNAS-Mobile为例,最终有20个MBConv,但是第一个MBConv是固定配置的而非搜索得到的,而21个searched MBConv中有2个是ZeroLayer。
  17. 为何更新结构参数的时候采样两个而不是一个?是因为只有一个的话误差太大了吗?——应该是的,使用两个比一个更能近似原始的梯度。
  18. 有一说一,更新结构参数的时候采样两个path,有那么一点EA中使用tournament selection策略时将tournament size设置为2的感觉。
  19. ProxylessNAS在不同平台下搜索到的结构的各自特点。——参考这篇回答:1)GPU prefers shallow and wide model with early pooling; 2)CPU prefers deep and narrow model with late pooling; 3)Pooling layers prefer large and wide kernel; 4)Early layers prefer small kernel了;5)Late layers prefer large kernel
  20. "As illustrated in Eq. (3) and Figure 2, by using the binary gates rather than real-valued path weights(Liu et al., 2018c), only one path of activation is active in memory at runtime and the memory requirement of training the over-parameterized network is thus reduced to the same level of training a compact model. That’s more than an order of magnitude memory saving."——与DARTS的对比。但是这有个问题,每次只更新一个path,相比于DARTS多个path一起更新的方式,在学习率相同的情况下,所需要的step是不是会更多?假如是,相比DARTS这类方法,ProxylessNAS所需要的epoch会不会更多?学习率会不会更大?FBNet似乎有提到过这个问题,结合所有sampling的方法,观察一下它们训练supernet的实验协议,琢磨一下这个问题。似乎搜索空间不同没法比,用SNAS和DARTS比较似乎更合适。
  21. "T is the target latency and w is a hyperparameter for controlling the trade-off between accuracy and latency."——有w控制trade-off的话哪还需要T作为target?这个问题和我对MnasNet里的objective的问题是一样的。
  22. 还有哪些是每次仅采样一个arch进行权重更新的,调研整理一下,对比一下search time和acc 。
  23. "Additionally, we also present a REINFORCE-based (Williams, 1992) algorithm as an alternative strategy to handle hardware metrics."——这个REINFORCE-based的优化标记一下。可否通过Proxyless的两种解法从而对比RL-based和Gradient-based的优劣?
  24. 未搞完的问题:11-5)、20、21、22、23。
  25. 尝试着总结ProxylessNAS:1)可微分采样是ProxylessNAS提出的一项关键技术,这个技术使得其相对于DARTS大大地降低了GPU memory(GPU hours是否因这里的可微分采样而降低还待定,因为这里涉及更新次数变多的问题),5、6、7、8、9、10集中了我对这方面的思考。
  1. 这里把DARTS等方法称为attention-based很到位。one-shot源于SMASH,weight-sharing和交替训练源于ENAS,这么看的话DARTS的最大贡献就是引入结构参数作为attention。
  2. "Normally, a Neural Architecture Search (NAS) pipeline comprises architecture sampling, parameter learning, architecture validation, credit assignment and search direction update." ——credit assignment、 search direction update是啥?
  3. 如何用这五个步骤分析现有的NAS算法
  4. "maintaining the completeness and differentiability of the NAS pipeline " 似乎以上展示的这五个步骤确实难以省去其中的任何一个
  5. "We prove that this search gradient optimizes the same objective as reinforcement-learning-based NAS, but assigns credits to structural decisions more efficiently. " 为啥可以等价为和RL NAS一样的优化目标?——参考第6点的回答。这里的search gradient是什么?——就是loss相对于结构参数的梯度。assign credits是干啥的?——credits就是policy gradient中用于更新policy的梯度项的系数(也就是每个action对应的adavantage),对search gradient进行进一步推导可以观察到和policy gradient相同的数学结构,其中每个梯度项的系数自然就是credit。
  6. "Due to the pervasive non-linearity in neural operations, it introduces untractable bias to the loss function. This bias causes inconsistency between the performance of derived child networks and converged parent networks, thus parameter retraining comes up as necessary. " 这里说导致需要重训练的原因是神经网络中普遍存在的非线性,如何理解?附录B中的(18)和(19)式从数学上解释了这个原因,将NAS的搜索空间表述为DAG以后,NAS的目标函数可以定义为(17)式,以Figure 6中的cell结构为例,其目标函数为(18)式。DARTS的目标函数为(19)式,因为算子O(x)中非线性的存在,(18)和(19)式并不等价。
  7. 一方面"We reformulate NAS with a new stochastic modeling to bypass the MDP assumption in reinforcement learning. ",另一方面 " From a global view, we prove that SNAS optimizes the same objective as reinforcement-learning-based NAS, except the training loss is used as reward. " 如何能在bypass MDP的同时还能优化与RL-based NAS相同的目标?个人感觉并不是相同的目标,而是由RL-based NAS的目标演化而来的目标:1)去掉MDP假设使得轨迹的分布分解为每个算子分布的连乘;2)把loss作为reward,使得权重参数和结构参数耦合在同一个目标函数之中,但这其实是DARTS相较于ENAS的创新。
  8. RL-based NAS将神经网络的搜索过程建模为MDP这一出发点似乎是存在问题的。"In terms of how to parameterize and factorize p(Z), SNAS is built upon the observation that NAS is a task with fully delayed rewards in a deterministic environment. That is, the feedback signal is only ready after the whole episode is done and all state transition distributions are delta functions. Therefore, a Markov Decision Process assumption as in ENAS may not be necessary." ——这段话说,只有当整个网络构建好并完成评估才会给出一个acc作为reward,因此MDP的假设在ENAS中是比较多余的。我的理解是,逐步构建整个网络的过程,并没有不断地接收环境的反馈并根据反馈做出下一个action(action之间是相互独立的,因此可以将p(Z)分解为连乘),因此不适合视为agent与环境的持续互动过程,倒是 [采样整个网络结构-获得评估acc] 这还更接近与环境互动的一步(真个轨迹的分布被分解为每个独立的算子分布的连乘正是体现了这一点)。那问题来了,ProxylessNAS有一种优化结构参数的方法是用RL,这时它的p(Z)是哪一种参数化方式
  9. 文中提到RL-based NAS用来自环境的常数(acc)的reward作为反馈,而SNAS reformulate了整个优化过程,把可微分的loss作为reward,使得结构参数和权重参数都可以被纳入到一个优化过程中(其实DARTS就是这么做,只不过SNAS把这个做法改进、表述得比较像RL NAS)。对于**确定性环境(action所导致的state是确定的)**来说,似乎确实是后者比较合适。但是似乎还是比不出两种NAS的高低。把ENAS、DARTS、SNAS的文章、代码都再仔细看看,加深理解
  10. "However, to avoid the sampling process and gradient back-propagation through discrete random variables, DARTS takes analytical expectation at the input of each node over operations at incoming edges and optimizes a relaxed loss with deterministic gradients."——同样是面临采样过程难以微分的问题,DARTS选择continuous relaxation,但是导致了几个问题:参考第11点; SNAS则是利用Gumbel-Softmax技巧实现对离散随机变量分布采样操作的可微。ProxylessNAS呢?
  11. 目前看来,RL-based NAS相比于Gradient-based NAS似乎有两个区别:1)采用了似乎是多余的MDP假设;2)reward是来自环境的常数。从这两个区别或许可以比较出RL-based NAS和Gradient-based NAS的高低。对于第二点带来的影响我得先理解score function estimator。文中似乎有提到过为何MDP多余,大概就是TD learning会导致误差放大啥的,再去好好学习一下
  12. 作者的知乎专栏文章中解释了DARTS使用连续化带来的两个问题以及其应对办法:“如果说 L 对于每一个 Z 都是线性的,(19)与(18)就是等价的。但是因为设计了 ReLU-Conv-BN 的堆叠,带来了非线性,这两个目标并不等价。也就是说,在DARTS的连续化近似中带来了很大的偏差(bias)。这一方面带来了最终优化的结果并没有理论保证的问题,使得一阶优化(single-level optimization)的结果不尽人意;另一方面因为连续化近似并没有趋向离散的限制,最终通过删除较低权重的边和神经变换产生的子网络将无法保持训练时整个母网络的精度。Liu et al. 提出用二阶优化(bi-level optimization)通过基于梯度的元学习(gradient-based meta learning)来解决第一个问题,但是对于第二个问题,并没有给出一个自动化的解法,而是人工定义了一些规则来挑选边和神经变换,构建子网络,再重新训练。
  13. "This renders SNAS a differentiable version of evolutionary-strategy-based NAS."为何SNAS还跟EA-based NAS有关?引的参考文献是这篇08年的paper——Natural evolution strategies之后专攻EA算法了可以回头看看这个。SNAS这篇文章真牛逼,同时联系了RL-based NAS、EA-based NAS和Gradient-based NAS三种。
  14. "Thus for each structural decision, no delayed reward exists, the credits assigned to it are valid from the beginning. This proves why SNAS is more efficient than ENAS." 似乎第2.3节可以解释为何SNAS不存在delayed reward?实验结果上的对比如何体现ENAS不如SNAS更efficient的明明ENAS的search cost低SNAS很多诶
  15. 第2.4节为何又扯上policy gradients了?——以latency约束为例,测得的latency其实相当于强化学习中环境给的一个reward,因此用policy gradient来优化这部分的目标是合理的。但是,其它方法用的是什么方法来优化latency目标?为何SNAS一定要用policy gradient?
  16. " In ENAS, proximalpolicy optimization (PPO) (Schulman et al., 2017) is used to optimize the architecture policy, whichdistributes credits with TD learning and generalized advantage estimator (GAE) (Schulman et al.,2015). However, as the reward of NAS task is only obtainable after the architecture is finalized andthe network is tested for accuracy, it is a task with delayed rewards. " ——要理解SNAS中所说的这段话,我需要先再仔细看看ENAS的paper和code来验证以上这段话。——关于credit assignment,credit就相当于policy gradient中的advatage; 关于delay reward,可以参考ENAS笔记的第5点。
  17. 博客中的动图图7,为何反向传播会涉及所有路径,不是应该只有被采样的路径吗?——因为SNAS和DARTS在前向和反向传播时的计算图类似,都涉及整个supernet(参考下一点和最后一点)。
  18. 似乎SNAS并不能像ProxylessNAS那样节省显存?——确实不能。Gumbel-Softmax终究还是Softmax,它相比于于普通的Softmax就是多个temperature和Gumbel random variable,因此训练的时候还是整个supernet一起训练。
  19. 似乎Gumbel-Softmax trick在训练过程中就应该逐渐地降低temperature:“一开始\tau大一点,可以帮助收敛,而当\tau小的时候,更近似离散分布。”SNAS论文或者代码中有提到这样训练吗?——代码里看确实是这样,刚开始是1,最后是0.33,随着训练逐epoch均匀减小。
  20. 公式(10)中的C(O_i,j)是和ProxylessNAS一样用lookup table实现的吗?——是的,直接写在*operation.py*文件里作为每个op对象的属性。
  21. "DARTS manually selects two inputs for each intermediate nodes, thus the topology is inconsistent with that in the training stage."——SNAS如何derive最终的结构?会人工强制要求每个node仅有两条输入边吗?——下一点解释了。
  22. "In the code from Liu et al. (2019),zerois omitted inchild graphderivation as empirically it tends to learnthe largest weight."——似乎SNAS是靠zero这个operator来确定丢哪条边的,而DARTS则是忽略zero op的weight,用强制选择两条输入边来确定丢边。——从代码来看确实如此。
  23. 看代码SNAS是如何实现single-level optimization的?DARTS的bi-level优化需要交替在training set 和 validation set 上分别优化权重参数结构参数,两个set上的loss轮流反向传播权重参数的梯度和结构参数的梯度并更新参数。SNAS的single-level优化就全在training set上更新参数,一次反向传播同时计算出权重参数的梯度和结构参数的梯度并更新参数。
  24. 对比DARTS,搜索、评估的实验协议。
  25. 为什么SNAS比ENAS和DARTS搜索需要更多的时间?——先对比一下实验协议
  26. "Bilevel optimization could be regarded as a data-drivenmeta-learning method to resolve the bias proved above, whose bias from the exact meta-learning objective is still unjustified due to the ignorance of separate child network derivation scheme."——如何理解?或许我应该再研究附录B和DARTS中推导出一阶和二阶优化部分来理解这段话。
  27. 为何其它的sampling arch式的NAS不同样采用single-level optimization?一般的强化学习(如打游戏)是bi-level还是single-level?GAN据说和强化学习相通,GAN是bi-level的,那RL-based NAS是不是bi-level的?
  28. SNAS如何设置resource-efficient constraint?"This global constraint could be linearly decomposed for structural decisions, hence the proof of SNAS's efficiency still applies." 这里说的linearly decomposed是啥
  29. 代码里'policy_gradient'和'discrete'这两种优化方式如何理解?
  30. 学习DSNAS。
  31. 似乎可以这样总结SNAS——在实现上把DARTS中的softmax用Gumbel-Softmax替换了就是SNAS(从代码来看确实就是这样),但在理论上SNAS有着更为深刻的基础:1)相较于ENAS,SNAS保留了与RL-based NAS相同形式的objective,但舍弃了多余的MDP假设,并引入包含了结构参数和权重参数的loss作为reward而非环境所返回的常数acc作为reward; 2)相较于DARTS采用continous relaxation的方式解决计算图中离散变量分布采样不可微分的问题,SNAS使用Gumbel-Softmax trick来实现可微分,由此绕开了continous relaxation导致的内在偏置(参考第5点)和必须重训练(原因参考paper第7页的第一段)的问题。