“为众人抱薪者,不可使其扼于风雪”——鲁迅
记得在3年前,我整理过一篇《_服务器BIOS应用调优:低延时、虚拟化、数据库、SDS和CSP_》,当时主要是针对Dell PowerEdge 14/15G服务器。如今基于第四代Intel Xeon Scalable处理器的最新一代16G出来了,针对性能调优方面又有哪些注意要点呢?
本文主要引用自Dell网站上的技术资料《_BIOS Settings for Optimized Performance on Next-Generation Dell PowerEdge Servers_》,并根据我有限的技术水平,按照自己理解添加一些学习笔记。
表格1:BIOS setting recommendations—System profile settings
[1] Depends on how system was ordered. Other System Profile defaults are driven by this choice and may be different than the examples listed. Select Performance Profile first, and then select Custom to load optimal profile defaults for further modification
[2] SST Turbo Boost Technology is substantially better than previous generations for latency-sensitive environments, but specific Turbo residency cannot be guaranteed under all workload conditions. Evaluate Turbo Boost Technology in your own environment to choose which setting is most appropriate for your workload, and consider the Dell Controlled Turbo option in parallel.
[3] Monitor/Mwait should only be disabled in parallel with disabling Logical Processor. This will prevent the Linux intel\_idle driver from enforcing C-states.
[4] You can test your own environment to determine whether disabling Memory Patrol Scrub is helpful.
[5] Dynamic selection can provide more TDP headroom at the expense of dynamic uncore frequency. Optimal setting is workload dependent.
[6] Autonomous on Air Cooled system or Disabled on Liquid Cooled Systems
注:表格下方的注释信息很重要,所以在这里必须贴过来。
首先,我们来看下3种不同的目标场景(应用优化方向):
1、Recommended setting for performance for HPC and SPECcpu speed environments
针对高性能计算,以及SPECcpu speed(单核)测试环境。我理解也包括一些线程数相对较低的应用;
2、Recommended setting for low latency, Stream, and MLC environments
针对低延时,比如Stream内存带宽测试,以及MLC(内存延时测试)环境;
3、Recommended for general business/scientific throughput (for example, SPECcpu2017)
针对通用业务/科学计算吞吐量场景(例如SPECcpu2017,这里应该指多线程应用)
部分设置解读
首先在不考虑节电和能耗比的情况下,CPU电源管理和内存频率一律建议“最大性能”,TurboBoost也自然保持打开,C1E关闭。
至于C-State(C1E关闭时,在此处只代表最基础的C1节电状态),场景1和2都建议关闭;侧重多核性能的场景3,液冷配置的系统建议关闭,而风冷则设置为“Autonomous”。液冷对短时超过TDP设计功耗Turbo的支持更友好;对于风冷来说,在核心闲置的情况下,让其保持较低的温度对于接下来的性能发挥(超频时长)应该有帮助。
Monitor/Mwait设置选项,只是在并行处理并且禁用逻辑处理器(超线程)时才应该关闭,适用于要求内存带宽和延时的场景2。
Memory Patrol Scrub:内存完整性方面的增强技术,理论上都可能影响性能。用户可以测试禁用该选项是否有帮助。
内存刷新速度保持最基本的1x,刷新多了也会影响性能。
Uncore频率,场景1和2都建议设为最大;场景3则是Dynamic(动态)——这时可以提供更多的TDP空间,用于动态uncore频率的开销上。
CPU互连总线(UPI)链路电源管理、PCI ASPM L1 Link电源管理,全部建议禁用。
表格2:BIOS setting recommendations—Memory, processor, and iDRAC settings
[1] Use Optimizer Mode when Memory Bandwidth Sensitive, up to 33% BW reduction with Fault Resilient Mode.
[2] Only available when x4 DIMMS installed in the system.
[3] Logical Processor (Hyper Threading) tends to benefit throughput-oriented workloads such as SPEC CPU2017 INT and FP\_RATE. Many HPC workloads disable this option. This only benefits SPEC FP\_rate if the thread count scales to the total logical processor count.
[4] Dell Controlled Turbo helps to keep core frequency at the maximum all-cores Turbo frequency, which reduces jitter. Disable if Turbo disabled.
[5] Option is available on liquid cooled systems only.
[6] Depends on if your program is affected by Base and Turbo frequency. Will reduce CPU core count and give higher Base and Turbo frequencies.
部分设置解读
先看内存部分。Memory Node Interleave建议关闭,也就是打开NUMA,这个前提是应用程序对非一致性内存的优化没有问题。如果跨CPU插槽的内存访问较多(如Oracle数据库等),就不能这样设置了。
内存Self Healing和ADDDC setting这些高级纠错选项都建议禁用,关闭可纠正错误的Logging,内存Training时间保持“Fast”。
逻辑处理器(超线程):“倾向于使吞吐量方向的工作负载受益,比如SPEC CPU2017 INT和FP_RATE测试,并且要在线程数设置到总逻辑处理器数量时。许多HPC工作负载禁用该选项。” 我理解现代应用软件需要禁用这个的不多了,虚拟化服务器有些是关HT的。
各种Prefetcher(预取):前面4项大部分建议保持打开,只有DCU Streamer Prefetcher在场景2和3建议关闭。
“DCU (Level 1 Data Cache) streamer prefetcher is an L1 data cache prefetcher. Lightly threaded applications and some benchmarks can benefit from having the DCU streamer prefetcher enabled. Default setting is Enabled.”
XPT预取和UPI预取,只在场景3(也就是通用场景)建议打开;LLC预取默认值是关闭,只有场景1建议打开。
XPT prefetch is a mechanism that enables a read request that is being sent to the last level cache to speculatively issue a copy of that read to the memory controller prefetcher.
UPI prefetch is a mechanism to get the memory read started early on DDR bus. The UPI receive path will spawn a memory read to the memory controller prefetcher.
Sub NUMA Cluster:默认设置为禁用,也就是每个物理CPU插槽一个NUMA域,这样不至于使内存分配过于复杂。而在CPU核心——内存访问充分局部化的情况下,启用SNC还是能提高性能的。
我以前在《_AMD下一代EPYC服务器(Zen2):从NUMA到SMP的轮回?_》里曾经提到过,Intel从Skylake架构开始,内存控制器已经是2个SNC Domain了,而到了XCC——也就是4个Die,Chiplet架构的部分型号4th Xeon Scalable,场景2和3建议设置为SNC 4了。MCC架构保持SNC 2就好了。
我在《关于第四代Intel Xeon Scalable的一些技术思考》中列出过的XCC__与MCC__对照。
Dell Controlled Turbo:只在场景2建议建议打开,有助于保持核心频率在最大全核Turbo频率,减少抖动。这一点适合低延时应用。关于Dell Processor Acceleration Technology,扩展阅读:《低延时应用 & 服务器TurboBoost不可得兼?》
至于新的Dell Controlled Turbo Optimizer mode,只在液冷系统上才能打开。可见在满足同样瓦数CPU散热的情况下,有些场景液冷还是比风冷有点性能优势的。
LLC Dead Line Allocation—In some Intel CPU caching schemes, mid-level cache (MLC) evictions are filled into the last level cache (LLC). If a line is evicted from the MLC to the LLC, the core can flag the evicted MLC lines as "dead." This means that the lines are not likely to be read again. This option allows dead lines to be dropped and never fill the LLC if the option is disabled. Values for this BIOS option can be:
o Disabled: Disabling this option can save space in the LLC by never filling MLC dead lines into the LLC.
o Enabled: Opportunistically fill MLC dead lines in LLC, if space is available.
关于SST我之前也做过介绍,指的是Intel Speed Select Technology。
所谓Dynamic SST Perf Profile,应该是根据运行程序的需要,可以禁用一部分CPU核心来换取更高的基础和Turbo频率。这同样对低延时类应用是有价值的。
最后再补充2点iDRAC设置的建议:
1、在散热有挑战的环境,应该通过iDRAC里的散热选项提升风扇转速;
2、在性能敏感环境中,所有限制功率的设置都应该去掉。
以上班门弄斧写了这些,希望对大家有帮助。
作者:企业存储技术
原文:企业存储技术
推荐阅读
欢迎关注企业存储技术极术专栏, 欢迎添加极术小姐姐微信(id:aijishu20)加入技术交流群,请备注研究方向。