Post

arXiv cs.AR Daily Update

arXiv cs.AR Daily Update

cs.AR 领域 2026年4月10日 共有 16 篇论文更新:

  • 8 篇新投稿:LLM Inference (SHIELD [3], [1], [2]), Edge Computing (FILCO [4], [1], [7]), Energy Efficiency (SHIELD [3], [6], [7]), Circuit Design ([2], [6]), EDA ([5], [8])
  • 4 篇跨领域投稿:Edge Computing (PG-MDP [12], [9]), Energy Efficiency (Wattlytics [11], [9]), Optimization (PG-MDP [12], [10]), LLM Inference ([9]), DNN Deployment ([10])
  • 4 篇替换投稿:Energy Efficiency (DHFP-PE [14], [13], [16]), LLM Inference (DeepStack [15], [13]), High Performance Computing (DeepStack [15], [13]), Model Compression (DHFP-PE [14], [16]), Circuit Design (DHFP-PE [14])

整体趋势:今日论文主要聚焦于LLM Inference、Energy Efficiency、Edge Computing等方向。

已录用论文[14](NEleX-2026)

开源论文:无


新投稿 (8)

[1] Position Paper: From Edge AI to Adaptive Edge AI

  • arXiv: 2604.07360
  • Authors: Fabrizio Pittorino, Manuel Roveri
  • Subjects: cs.AR; cs.AI; cs.LG
  • Tags: Edge Computing, Continual Learning, LLM Inference
  • Summary: 本文提出边缘AI在实际部署中必须具备自适应性,以应对数据和运行条件的持续变化。作者引入了Agent-System-Environment (ASE)框架来精确定义边缘端的自适应性,并提出了未来十年的十个研究挑战。

[2] Self-Calibrating LLM-Based Analog Circuit Sizing with Interpretable Design Equations

  • arXiv: 2604.07387
  • Authors: Antonio J. Bujana, Aydin I. Karsilayan
  • Subjects: cs.AR; cs.AI
  • Tags: Circuit Design, LLM Inference
  • Summary: 本文提出了一种自校准框架,利用大语言模型从电路网表中推导拓扑特定的解析设计方程来实现模拟电路尺寸设计。该方法在六种OTA拓扑和两种工艺节点上实现了全部规格达标,收敛仅需2-9次仿真。

[3] SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs

  • arXiv: 2604.07396
  • Authors: Jintao Zhang, Xuanyao Fong
  • Subjects: cs.AR; cs.LG
  • Tags: LLM Inference, Memory Architecture, Energy Efficiency
  • Summary: 本文提出SHIELD,一种分段层次化eDRAM架构,通过利用BF16激活值的时间驻留特性和位级敏感度来降低边缘NPU上LLM推理的刷新能耗。该方法在保持准确性的同时实现了35%的能耗降低。

[4] FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration

  • arXiv: 2604.07523
  • Authors: Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, Peipei Zhou
  • Subjects: cs.AR
  • Tags: DNN Deployment, Heterogeneous Computing, Edge Computing
  • Summary: 本文提出FILCO,一种可实时重配置的灵活组合架构,能够高效匹配多样化的DNN工作负载以实现最优存储和计算资源效率。在AMD Versal VCK190平台上,该设计实现了1.3x-5x的吞吐量和硬件效率提升。

[5] From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference

  • arXiv: 2604.07526
  • Authors: Ravindra Ganti, Steve Xu
  • Subjects: cs.AR; cs.LG
  • Tags: LLM Inference, Reinforcement Learning, EDA
  • Summary: 本文提出了一种基于强化学习的编译器,联合优化ASIC架构、存储层次和工作负载划分,用于3nm到28nm工艺节点的AI推理。该方法使用Soft Actor-Critic和Mixture-of-Experts门控机制探索设计空间。

[6] Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration

  • arXiv: 2604.07628
  • Authors: Md Zesun Ahmed Mia, Jiahui Duan, Kai Ni, Abhronil Sengupta
  • Subjects: cs.AR; cs.ET; cs.NE
  • Tags: LLM Inference, Energy Efficiency, Circuit Design
  • Summary: 本文提出TrilinearCIM,一种基于双栅极FeFET的存内计算架构,通过背栅调制实现三操作数乘累加原语,无需动态铁电重编程即可完成Transformer注意力计算。该设计实现了最高46.6%的能耗降低和20.4%的延迟改善。

[7] The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency

  • arXiv: 2604.07935
  • Authors: Robin Geens, Jonas De Schouwer, Marian Verhelst, Thierry Tambe
  • Subjects: cs.AR
  • Tags: LLM Inference, Edge Computing, Energy Efficiency
  • Summary: 本文识别出一种”超大规模彩票”现象,即Mamba等状态空间模型为追求云端吞吐量而牺牲了边缘效率。作者证明Mamba-3的架构变化导致边缘设备上28-48%的延迟增加。

[8] A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators

  • arXiv: 2604.08044
  • Authors: Cong Li, Chenhao Xue, Yi Ren, Xiping Dong, Yu Cheng, Yinbo Hu, Fujun Bai, Yixin Guo, Xiping Jiang, Qiang Wu, Zhi Yang, Zhe Cheng, Yuan Xie, Guangyu Sun
  • Subjects: cs.AR
  • Tags: LLM Inference, High Performance Computing, EDA
  • Summary: 本文提出ATLAS,首个经过硅验证的3D-DRAM LLM加速器仿真框架,提供统一的系统架构和编程原语抽象。该框架与真实硅片验证相比达到≤8.57%的仿真误差和97.26-99.96%的相关性。

跨领域投稿 (4)

[9] Reduced-Mass Orbital AI Inference via Integrated Solar, Compute, and Radiator Panels

  • arXiv: 2604.07760 (cross-listed)
  • Authors: Stephen Gaalema, Samuel Indyk, Clinton Staley
  • Subjects: cs.DC; cs.AR
  • Tags: LLM Inference, Edge Computing, Energy Efficiency
  • Summary: 本文描述了一种用于轨道计算卫星的分布式计算架构,将太阳能电池、散热器和计算功能集成到多个小型面板中。该设计可实现每发射吨位>100 kW的计算功率,支持大规模LLM推理。

[10] Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

  • arXiv: 2604.07902 (cross-listed)
  • Authors: Shigeo Mitsunari, Takashi Hoshino
  • Subjects: cs.PL; cs.AR
  • Tags: Optimization, DNN Deployment
  • Summary: 本文提出了一种针对64位CPU的32位无符号常数除法优化方法,改进了Granlund-Montgomery方法。该实现已在LLVM主分支合并,在Intel Xeon和Apple M4上实现了1.67x-1.98x的加速。

[11] Wattlytics: A Web Platform for Co-Optimizing Performance, Energy, and TCO in HPC Clusters

  • arXiv: 2604.08182 (cross-listed)
  • Authors: Ayesha Afzal, Georg Hager, Gerhard Wellein
  • Subjects: cs.DC; cs.AR; cs.ET; cs.PF
  • Tags: High Performance Computing, Energy Efficiency, GPU Computing
  • Summary: 本文提出Wattlytics,一个交互式浏览器平台,用于协同优化HPC集群中GPU的性能、能耗和总拥有成本。该工具集成了基准驱动的性能扩展、DVFS感知的功耗建模和多年TCO分析。

[12] PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores

  • arXiv: 2604.08445 (cross-listed)
  • Authors: Luke Panayi, Johan Jino, Sebastian S. Kim, Alberto Ros, Alexandra Jimborean, Jim Whittaker, Martin Berger, Paul Kelly
  • Subjects: cs.PL; cs.AR
  • Tags: Edge Computing, Optimization
  • Summary: 本文提出PG-MDP,一种面向面积受限核心的剖析引导存储依赖预测方法,通过软件协同设计标记内存无关加载来减少79%的MDP查询。该方法在无面积开销的情况下实现了1.47%的IPC提升。

替换投稿 (4)

[13] Rethinking Compute Substrates for 3D-Stacked Near-Memory LLM Decoding: Microarchitecture-Scheduling Co-Design

  • arXiv: 2604.04253 (replaced)
  • Authors: Chenyang Ai, Yixing Zhang, Haoran Wu, Yudong Pan, Lechuan Zhao, Wenhui OU
  • Subjects: cs.AR
  • Tags: LLM Inference, High Performance Computing, Energy Efficiency
  • Summary: 本文提出了一种针对3D堆叠近存LLM解码的计算微架构,使用可重构脉动阵列和多核调度框架。与Stratum相比,该设计实现了2.91x加速和2.40x的能效提升。

[14] DHFP-PE: Dual-Precision Hybrid Floating Point Processing Element for AI Acceleration

  • arXiv: 2604.04507 (replaced)
  • Authors: Shubham Kumar, Vijay Pratap Sharma, Vaibhav Neema, Santosh Kumar Vishvakarma
  • Subjects: cs.AR; cs.RO; eess.AS; eess.IV
  • Tags: Energy Efficiency, Circuit Design, Model Compression
  • Venue: NEleX-2026
  • Summary: 本文提出一种支持FP8和FP4格式的双精度浮点MAC处理单元,采用新颖的位分区技术实现硬件高效利用。在28nm工艺下,该设计实现了60.4%的面积减少和86.6%的功耗节省。

[15] DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators

  • arXiv: 2604.04750 (replaced)
  • Authors: Zhiwen Mo, Guoyu Li, Hao Mark Chen, Yu Cheng, Zhengju Tang, Qianzhou Wang, Lei Wang, Shuang Liang, Lingxiao Ma, Xianqi Zhou, Yuxiao Guo, Wayne Luk, Jilong Xue, Hongxiang Fan
  • Subjects: cs.AR; cs.DC
  • Tags: LLM Inference, High Performance Computing, EDA
  • Summary: 本文提出DeepStack,一个用于分布式3D堆叠AI系统早期设计空间探索的性能建模工具,包含细粒度存储语义和全面的并行化策略。该框架比现有模拟器快100,000倍,精度相当。

[16] Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs

  • arXiv: 2603.10100 (replaced)
  • Authors: Vishal Shashidhar, Anupam Kumari, Roy P Paily
  • Subjects: cs.LG; cs.AI; cs.AR
  • Tags: Edge Computing, Energy Efficiency, Model Compression
  • Summary: 本文提出一种”软稀疏”范式,使用MSB代理跳过CNN中可忽略的非零乘法,作为RISC-V自定义指令实现。该方法在零精度损失下减少了88.42%的MAC操作,实现了35.2%的功耗节省。
This post is licensed under CC BY 4.0 by the author.

Trending Tags