Intel-Memory关键技术解析

由天下分享时间：2025/2/15 17:43:25 加入收藏我要投稿点赞

Intel Memory关键技术解析

Independent Channel Mode

Channels can be populated in any order in Independent Channel Mode. All four channels may be populated in any order and have no matching requirements. All channels must run at the same interface frequency but individual channels may run at different DIMM timings (RAS latency, CAS latency, and so forth).

Lockstep Channel Mode

In Lockstep Channel Mode, each memory access is a 128-bit data access that spans Channel 0 and Channel 1, and Channel 2 and Channel 3. Lockstep Channel mode is the only RAS mode that allows SDDC for x8 devices. Lockstep Channel Mode requires that Channel 0 and Channel 1, and Channel 2 and Channel 3 must be populated identically with regards to size and organization. DIMM slot populations within a channel do not have to be identical but the same DIMM slot location across Channel 0 and Channel 1 and across Channel 2 and Channel 3 must be populated the same.

Mirrored Channel Mode

In Mirrored Channel Mode, the memory contents are mirrored between Channel 0 and Channel 2 and also between Channel 1 and Channel 3. As a result of the mirroring, the total physical memory available to the system is half of what is populated. Mirrored Channel Mode requires that Channel 0 and Channel 2, and Channel 1 and Channel 3 must be populated identically with regards to size and organization. DIMM slot populations within a channel do not have to be identical but the same DIMM slot location across Channel 0 and Channel 2 and across Channel 1 and Channel 3 must be populated the same.

Rank Sparing Mode

In Rank Sparing Mode, one rank is a spare of the other ranks on the same channel. The spare rank is held in reserve and is not available as system memory. The spare rank must have identical or larger memory capacity than all the other ranks (sparing source ranks) on the same channel. After sparing, the sparing source rank will be lost.

进行内存热备时，做热备份的内存在正常情况下是不使用的，也就是说系统是看不到这部分内存容量的。每个内存通道中有一个DIMM不被使用，预留为热备内存。芯片组中设置有内存校验错误次数的阈值, 即每单位时间发生错误的次数。当工作内存的故障次数达到这个“容错阈值”，系统开始进行双重写动作，一个写入主内存，一个写入热备内存，当系统检测到两个内存数据一致后，热备内存就代替主内存工作，故障内存被禁用，这样就完成了热备内存接替故障内存工作的任务，有效避免了系统由于内存故障而导致数据丢失或系统宕机。这个做热备的内存容量应大于等于所在通道的最大内存条的容量，以满足内存数据迁移的最大容量需求。

内存刷洗（Memory Scrubbing）

It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.

In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

The normal memory reads issued by the CPU or DMA devices are checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

Key Info：1）Soft error, an important reason for doing memory scrubbing

2）Error detection and correction, a general theory used for memory scrubbing

ECC技术

90年代初，内存体系采用奇偶性校验（Parity Verifying）技术。奇偶校验内存在每一字节（8位）外又额外增加了一位作为错误检测之用，BIOS中的监控程序会将存入内存中的数据位相加，并将结果存于校验位中。比如一个字节中存储了某一数值10011110，每一位加起来的结果为奇数（1＋0＋0＋1＋1＋1＋1＋0＝5），校验位存入1。当CPU读取储存的数据时，监控程序再次相加存储的8位数据，并将计算结果与校验位相比较。如果发现二者不同，系统就会产生出错信息。奇偶校验技术仅能粗略地检查内存错误，并不具备纠错能力。

另一种内存纠错技术叫做ECC（Error Correct Code，纠错码），它也是在原来的数据位上外加位来实现的，增加的位用来重建错误数据。在ECC纠错体系中，如果数据为N个字节，则外加的ECC位为log2N + 5。例如对于64位数据，需要外加log28 + 5 = 8个ECC位。

当出现一个存储位错误时，ECC体系可以自动进行纠错。当出现2个数据位错误时，可以检测出来，但不能纠错，这种行为通常称作“单错纠正／双错检测（Single Error Correction/Double Error Detection ，简称SEC/DED）。一次存取中有2个以上的数据位出错时，由于SEC/DED体系检测不出来了，致使数据的完整性受损。采用这种结构的存储器，当检测出多位错误时，系统就会报告出现了致命故障（Fatal fault），之后系统崩溃。

X4/X8 SDDC (Single Device Data Correction)

随着RAM芯片的集成度的提高和内存容量的增大，内存发生错误的概率也随之增加。几年前被认为很可靠的SEC／DED内存体系，今天已经力不从心了，寻求具有多位纠错能力的内存体系结构一直是众多厂商追求的目标。 RAM器件失效最为严重的情形是其全部数据位全部发生错误，纠正这种错误的基本思路应该着眼于芯片和系统的硬件结构，而不可能通过软件升级的方式来达到目的。

存储器中的每个字节外加一个ECC位构成ECC字。如果存储器系统的数据宽度为32个字节（或256位），实际的存储器数据的宽度是256＋32＝288位。同时，每一个数据位都被置于分离的ECC字中。

图1描述了这种方法工作的原理。存储系统由4个DIMM模块构成，32个字节（256位）的数据被分成4个ECC字，每个ECC字含有8个字节（64位）的数据位和8个ECC位。这样，一个ECC字的实际长度为64＋8＝72位，存储数据总长度为72×4＝288位。

图1 Chipkill内存纠错原理

存储器控制器（Memory Controller）把每个ECC字被分成4个长度为18位的段，分别存储于4个DIMM中。同时，每个DIMM中也存储了4个来自不同的ECC字的段。然后，每个段的18个位再被存储在不同的RAM芯片中。