arXiv cs.AR Daily Update
arXiv cs.AR Daily Update
cs.AR 领域 2026年4月14日 共有 27 篇论文更新:
- 15 篇新投稿:Fault Tolerance (LLM-PRISM [5], Strix [6], [2]), LLM Inference (EdgeCIM [14], CUTEv2 [15], [10]), Energy Efficiency ([1], [2], [10]), EDA (CapBench [13], [3], [9]), 3D Vision (L-PCN [8], [4])
- 4 篇跨领域投稿:Graph Neural Network (FlexVector [16]), LLM Inference (WaveTune [17]), GPU Computing (WaveTune [17]), Optimization (WaveTune [17]), Photonic Computing ([18])
- 8 篇替换投稿:AI Sustainability ([20], [27]), LLM Inference ([21], [27]), RTL Verification (FireBridge [23], CoverAssert [24]), DNN Deployment (FireBridge [23], FILCO [25]), Heterogeneous Computing (FireBridge [23], FILCO [25])
整体趋势:今日论文主要聚焦于LLM Inference、Fault Tolerance、Energy Efficiency等方向。
已录用论文:[2](ISEDA 2026), [6](DAC 2026), [7](DAC 2026), [8](ISCA 2026), [10](ISPASS 2026), [11](ISEDA 2026), [13](DAC 2026), [15](DAC 2026), [19](DAC 2026), [26](DAC 2026)
新投稿 (15)
[1] Sustainable Transformer Neural Network Acceleration with Stochastic Photonic Computing
- arXiv: 2604.09759
- Authors: S. Afifi, O. Alo, I. Thakkar, S. Pasricha
- Subjects: cs.AR; cs.LG
- Tags: Photonic Computing, Energy Efficiency
- Summary: 本文提出了ASTRA,首个利用随机计算的光子加速器用于Transformer推理,通过新型光学随机乘法器和交叉干扰最小化组织实现高效动态张量计算,相比现有加速器实现7.6倍加速和1.3倍更低能耗。
[2] Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators
- arXiv: 2604.09994
- Authors: Tong Xie, Zuodong Zhang, Chao Yang, Yuan Wang, Runsheng Wang, Meng Li
- Subjects: cs.AR
- Tags: Fault Tolerance, Energy Efficiency
- Venue: ISEDA 2026
- Summary: 本文开发了一个精确的老化预测框架和容错电压缩放策略,利用DNN的内在弹性延迟电压提升,实验显示可减少PMOS和NMOS的老化退化分别达30.6%和45.8%,同时实现14%的功耗节省。
[3] Late Breaking Results: CHESSY: Coupled Hybrid Emulation with SystemC-FPGA Synchronization
- arXiv: 2604.10093
- Authors: Lorenzo Ruotolo, Giovanni Pollo, Mohamed Amine Hamdi, Matteo Risso, Yukai Chen, Enrico Macii, Massimo Poncino, Sara Vinco, Alessio Burrello, Daniele Jahier Pagliari
- Subjects: cs.AR
- Tags: EDA
- Summary: 本文介绍了一个开源框架,将SystemC虚拟平台与FPGA仿真连接,实现数字和非数字组件的全系统协同仿真,相比RTL仿真实现高达2500倍加速,同时保持相对于纯FPGA仿真不到2倍的总仿真时间。
[4] A 129FPS Full HD Real-Time Accelerator for 3D Gaussian Splatting
- arXiv: 2604.10223
- Authors: Fang-Chi Chang, Tian-Sheuan Chang
- Subjects: cs.AR; cs.GR; eess.IV
- Tags: 3D Vision, Circuit Design
- Summary: 本文提出了一种低功耗、低成本的3D高斯泼溅硬件加速器,结合硬件友好的压缩流水线实现51.6倍模型压缩,在TSMC 28nm工艺下实现1080p分辨率129FPS实时渲染,面积比现有加速器小5.98倍。
[5] LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
- arXiv: 2604.10390
- Authors: Abhishek Tyagi, Saurabh Hukerikar, Nirmal Saxena, Yanxiang Huang, Philip Shirvani, Chung-Hsuan Tung, Yuhao Zhu
- Subjects: cs.AR
- Tags: Fault Tolerance, LLM Training
- Summary: 本文提出了LLM-PRISM方法,通过RTL级GPU故障仿真和嵌入Megatron-LM的随机注入引擎来表征LLM预训练对硬件故障的弹性,发现关键数据通路和特定精度格式在中等故障率下可能导致灾难性发散。
[6] Strix: Re-thinking NPU Reliability from a System Perspective
- arXiv: 2604.10484
- Authors: Jiapeng Guan, Jie Zhang, Hao Zhou, Ran Wei, Dean You, Hui Wang, Yingquan Wang, Tinglue Wang, Xudong Zhao, Jing Li, Zhe Jiang
- Subjects: cs.AR
- Tags: Fault Tolerance
- Venue: DAC 2026
- Summary: 本文提出了Strix,一个开源SoC上的全栈NPU可靠性框架,跨越微架构、ISA和编程方法,通过重新划分NPU并附加针对性保护措施,实现亚微秒级故障定位和错误检测纠正,仅带来1.04倍减速。
[7] From Characterization to Microarchitecture: Designing an Elegant and Reliable BFP-Based NPU
- arXiv: 2604.10494
- Authors: Jie Zhang, Jiapeng Guan, Hao Zhou, Xiaomeng Han, Tinglue Wang, Ran Wei, Zhe Jiang
- Subjects: cs.AR
- Tags: Fault Tolerance
- Venue: DAC 2026
- Summary: 本文首次对基于块浮点(BFP)的NPU进行了深入的可靠性研究,通过RTL级故障注入分析揭示了显著的异构脆弱性,并设计了一种容错微架构,以仅3.55%的性能开销实现接近双模冗余的可靠性。
[8] L-PCN: A Point Cloud Accelerator Exploiting Spatial Locality through Octree-based Islandization
- arXiv: 2604.10716
- Authors: Yiming Gao, Jieming Yin, Yuxiang Wang, Xiangru Chen, Zhilei Chai, Bowen Jiang, Jiliang Zhang, Herman Lam
- Subjects: cs.AR
- Tags: 3D Vision
- Venue: ISCA 2026
- Summary: 本文提出了L-PCN点云加速器,通过八叉树岛屿化和枢纽调度技术利用空间局部性减少重复操作,在FPGA上实现的原型显示可为现有PCN加速器带来1.2x至3.2x的额外加速。
[9] EMSpice 3: Full-chip Temperature-Aware Multiphysics Electromigration and IR-Drop Analysis
- arXiv: 2604.10743
- Authors: Haotian Lu, Sheldon X.-D. Tan
- Subjects: cs.AR
- Tags: EDA
- Summary: 本文提出了EMSpice 3,一个全芯片温度感知多物理场框架,用于电源网格的电迁移、热迁移和IR压降耦合分析,集成了扩展有理Krylov降阶方法加速大规模仿真,实现1.18x-1.50x运行时间减少。
[10] The xPU-athalon: Quantifying the Competition of AI Acceleration
- arXiv: 2604.10852
- Authors: Alicia Golden, Carole-Jean Wu, Gu-Yeon Wei, David Brooks
- Subjects: cs.AR
- Tags: LLM Inference, Energy Efficiency
- Venue: ISPASS 2026
- Summary: 本文对Cerebras、SambaNova、Groq、Gaudi和TPU等AI加速器与NVIDIA和AMD GPU进行了定量比较,分析了延迟、吞吐量、功耗和能效的权衡,发现最优硬件平台因批次大小、序列长度和模型大小而异。
[11] Automated SVA Generation with LLMs
- arXiv: 2604.11044
- Authors: Lik Tung Fu, Qihang Wang, Shaokai Ren, Mengli Zhang, Sichao Yang, Jun Liu, Xi Wang
- Subjects: cs.AR
- Tags: RTL Verification
- Venue: ISEDA 2026
- Summary: 本文提出了SVA Generator框架,利用LLM将自然语言描述转换为可执行的SystemVerilog断言,通过AST约束注入和自动监督流水线,在深层复杂度级别上实现了比通用LLM高22.7个百分点的语义等价率提升。
[12] Technology solutions targeting the performance of gen-AI inference in resource constrained platforms
- arXiv: 2604.11128
- Authors: Joyjit Kundu, Joshua Klein, Aakash Patel, Dwaipayan Biswas
- Subjects: cs.AR
- Tags: LLM Inference, Edge Computing
- Summary: 本文使用层次化roofline分析性能模型评估了高带宽存储(HBS)和绑定全局缓冲存储芯片两种新兴技术方案对资源受限平台上生成式AI推理性能的影响,分析了带宽/延迟需求以实现可接受的交互吞吐量。
[13] CapBench: A Multi-PDK Dataset for Machine-Learning-Based Post-Layout Capacitance Extraction
- arXiv: 2604.11202
- Authors: Hector R. Rodriguez, Jiechen Huang, Wenjian Yu
- Subjects: cs.AR; cs.LG
- Tags: EDA
- Venue: DAC 2026
- Code: code
- Summary: 本文提出了CapBench,一个完全可复现的多PDK电容提取数据集,源自开源设计并覆盖三个技术节点,包含61,855个3D窗口,评估了10种机器学习架构,CNN实现最低误差(1.75%)而GNN快41.4倍但误差较大(10.2%)。
[14] EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models
- arXiv: 2604.11512
- Authors: Jinane Bazzi, Mariam Rakka, Fadi Kurdahi, Mohammed E. Fouda, Ahmed Eltawil
- Subjects: cs.AR; cs.AI
- Tags: LLM Inference, Compute-in-Memory, Edge Computing
- Summary: 本文提出了EdgeCIM软硬件协同设计框架,用于边缘设备上小语言模型的存内计算加速,在65nm工艺下实现CIM宏,相比NVIDIA Orin Nano在LLaMA3.2-1B上实现7.3倍更高吞吐量和49.59倍更好能效。
[15] CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead
- arXiv: 2604.11615
- Authors: Jinpeng Ye, Chongxi Wang, Wenqing Li, Bin Yuan, Shiyi Wang, Fenglu Zhang, Junyu Yue, Jianan Xie, Yunhao Ye, Haoyu Deng, Yingkun Zhou, Xin Cheng, Fuxin Zhang, Jian Wang
- Subjects: cs.AR; cs.AI; cs.DC; cs.LG
- Tags: Circuit Design, LLM Inference
- Venue: DAC 2026
- Summary: 本文提出了一种统一可配置的CPU矩阵扩展架构,通过解耦矩阵单元与CPU流水线实现低开销集成,在四个开源CPU RTL平台上评估显示GEMM负载下矩阵单元利用率超过90%,在ResNet、BERT和Llama3上实现显著加速。
跨领域投稿 (4)
[16] FlexVector: A SpMM Vector Processor with Flexible VRF for GCNs on Varying-Sparsity Graphs
- arXiv: 2604.10113 (cross-listed)
- Authors: Bohan Li, Shengmin Li, Xinyu Shi, Enyi Yao, Francky Catthoor, Simei Yang
- Subjects: cs.DC; cs.AR
- Tags: Graph Neural Network
- Summary: 本文提出了FlexVector向量处理器架构,通过行级乘积数据流和灵活向量寄存器文件加速GCN推理中的稀疏-稠密矩阵乘法,结合图感知预处理策略,在真实GCN数据集上实现3.78倍加速和40.5%能耗降低。
[17] WaveTune: Wave-aware Bilinear Modeling for Efficient GPU Kernel Auto-tuning
- arXiv: 2604.10187 (cross-listed)
- Authors: Kaixuan Zhang, Chutong Ding, Shiyou Qian, Luping Wang, Jian Cao, Guangtao Xue, Cheng Huang, Guodong Yang, Liping Zhang
- Subjects: cs.PF; cs.AR
- Tags: LLM Inference, GPU Computing, Optimization
- Summary: 本文提出了WaveTune,一个用于GPU内核自动调优的波感知框架,通过统一的映射方法和分析性波感知双线性模型来预测内核延迟,实现了在LLM推理中高达1.83倍的内核级加速和1.33倍的端到端TTFT降低。
[18] Harnessing Photonics for Machine Intelligence
- arXiv: 2604.10841 (cross-listed)
- Authors: Hanqing Zhu, Shupeng Ning, Hongjian Zhou, Ziang Yin, Ray T. Chen, Jiaqi Gu, David Z. Pan
- Subjects: cs.AI; cs.AR; cs.ET; cs.LG
- Tags: Photonic Computing, AI Sustainability
- Summary: 本文从电路与系统视角综述了光子计算在机器智能中的应用,提出了跨层协同设计和电子光子设计自动化(EPDA)的路线图,以实现可扩展的光子机器智能系统。
[19] Compiler Framework for Directional Transport in Zoned Neutral Atom Systems with AOD Assistance: A Hybrid Remote CZ Approach
- arXiv: 2604.11000 (cross-listed)
- Authors: Lingyi Kong, Chen Huang, Zhemin Zhang, Yidong Zhou, Xiangyu Ren, Shaochen Li, Zhiding Liang
- Subjects: cs.AR
- Tags: Quantum Computing, Quantum Compiler
- Venue: DAC 2026
- Summary: 本文提出了一种基于定向传输的远程CZ门和编译器框架,用于分区中性原子量子系统,相比仅使用AOD的基线方法,将纠缠阶段持续时间减少了50%至90%。
替换投稿 (8)
[20] Lifetime-Aware Design for Item-Level Intelligence at the Extreme Edge
- arXiv: 2509.08193 (replaced)
- Authors: Shvetank Prakash, Andrew Cheng, Olof Kindgren, Ashiq Ahamed, Graham Knight, Jed Kufel, Francisco Rodriguez, Arya Tschand, David Kong, Mariam Elgamal, Jerry Huang, Emma Chen, Gage Hills, Richard Price, Emre Ozer, Vijay Janapa Reddi
- Subjects: cs.AR; cs.AI; cs.ET
- Tags: Edge Computing, AI Sustainability, Low Power
- Summary: 本文提出了FlexiFlow,一个面向极端边缘物品级智能的寿命感知设计框架,通过考虑应用特定寿命来权衡嵌入式碳足迹和运营碳足迹,实现了高达14.5倍的碳足迹减少。
[21] Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference
- arXiv: 2509.09505 (replaced)
- Authors: Haoran Wu, Can Xiao, Jiayi Nie, Xuan Guo, Binglei Lou, Jeffrey T. H. Wong, Zhiwen Mo, Cheng Zhang, Przemyslaw Forys, Chengyang Ai, Timi Adeniran, Wayne Luk, Hongxiang Fan, Jianyi Cheng, Timothy M. Jones, Rika Antonova, Robert Mullins, Aaron Zhao
- Subjects: cs.AR
- Tags: LLM Inference, Long Context, LLM Agent
- Summary: 本文介绍了PLENA,一个软硬件协同设计的系统,通过扁平化脉动阵列架构、非对称量化方案和FlashAttention支持,解决了代理LLM推理中的内存带宽和容量瓶颈问题。
[22] VeriInteresting: An Empirical Study of Model Prompt Interactions in Verilog Code Generation
- arXiv: 2603.08715 (replaced)
- Authors: Luca Collini, Andrew Hennesee, Patrick Yubeaton, Siddharth Garg, Ramesh Karri
- Subjects: cs.AR; cs.CL
- Tags: Code Generation, Prompt Engineering, RTL Generation
- Summary: 本文对Verilog代码生成中的语言模型进行了实证研究,评估了不同模型类别对结构化提示和优化策略的响应模式,分析了模型推理能力、专业化程度与提示工程策略之间的交互关系。
[23] FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
- arXiv: 2603.25969 (replaced)
- Authors: G Abarajithan, Zhenghua Ma, Francesco Restuccia, Ryan Kastner
- Subjects: cs.AR
- Tags: RTL Verification, DNN Deployment, Heterogeneous Computing
- Code: code
- Summary: 本文提出了FireBridge,一个周期精确的软硬件协同验证框架,通过将固件编译为x86并通过随机化内存桥与模拟子系统连接,实现了比传统FPGA流程快50倍的调试迭代速度。
[24] CoverAssert: Iterative LLM Assertion Generation Driven by Functional Coverage via Syntax-Semantic Representations
- arXiv: 2604.06607 (replaced)
- Authors: Yonghao Wang, Yang Yin, Hongqin Lyu, Jiaxin Zhou, Zhiteng Chao, Mingyu Shi, Wenchao Ding, Yunlin Du, Jing Ye, Tiancheng Wang, Huawei Li
- Subjects: cs.AR
- Tags: RTL Verification, Assertion Generation
- Summary: 本文提出了CoverAssert,一个迭代式断言生成框架,通过聚类语义和AST结构特征并利用功能覆盖率反馈来指导LLM优先覆盖未覆盖的代码点,在分支覆盖率和语句覆盖率上分别提升了9.57%和9.64%。
[25] FILCO: Flexible Composing Architecture with Real-Time Reconfigurability for DNN Acceleration
- arXiv: 2604.07523 (replaced)
- Authors: Xingzhen Chen, Jinming Zhuang, Zhuoping Yang, Shixin Ji, Sarah Schultz, Zheng Dong, Weisong Shi, Peipei Zhou
- Subjects: cs.AR
- Tags: DNN Deployment, Heterogeneous Computing, FPGA
- Summary: 本文提出了FILCO,一种可实时重配置的灵活组合架构,可以将硬件资源统一组合或划分为多个独立加速器,在各种多样化工作负载上实现了1.3倍至5倍的吞吐量和硬件效率提升。
[26] The Phantom of PCIe: Constraining Generative Artificial Intelligences for Practical Peripherals Trace Synthesizing
- arXiv: 2411.06376 (replaced)
- Authors: Zhibai Huang, Chen Chen, James Yen, Yihan Shen, Yongchen Xie, Zhixiang Wei, Kailiang Xu, Yun Wang, Fangxin Liu, Tao Song, Mingyuan Xia, Zhengwei Qi
- Subjects: cs.LG; cs.AI; cs.AR
- Tags: LLM Hallucination, Hardware Simulation
- Venue: DAC 2026
- Summary: 本文提出了Phantom框架,通过将生成式AI骨干网络与后处理过滤器结合,强制执行PCIe特定约束,有效消除了AI生成TLP序列中的幻觉问题,在任务特定指标上实现了高达1000倍的改进。
[27] Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving
- arXiv: 2505.23970 (replaced)
- Authors: Yuyang Tian, Desen Sun, Yi Ding, Sihang Liu
- Subjects: cs.DC; cs.AR
- Tags: LLM Inference, AI Sustainability
- Summary: 本文提出了GreenCache,一个碳感知的缓存管理框架,通过动态分析碳排放与SLO满足度之间的相关性并重新配置资源,在满足延迟约束的同时实现了平均15.1%的碳减排。
This post is licensed under CC BY 4.0 by the author.