INTEL MCE INJECTOR DRIVER
There is a notion of an “action optional” machine check. If background scrubbing detects something uncorrectable, it can and it seems like it ought to signal a machine check. Thus, the patch may proliferate on future Linux server distributions, allowing users of future Linux servers to enjoy increased fault tolerance. An instruction to load some data from memory didn’t get the data because it’s been destroyed. On a later page fault the associated application will be killed. Posted Sep 8, System programming guide https:
|Date Added:||28 February 2007|
|File Size:||56.17 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
Intel’s recent preview of its Xeon processor codenamed Nehalem-EX promises support for memory poisoning. Linux EDAC project on sourceforge.
First, hardware detects an uncorrectable error from memory transfers into the system cache or on the system bus. This simple harness uses debugfs to allow failures at an arbitrary page to be injected. Once the poisoned data is actually used loaded into a processor register, etc.
Clean pages in either the swap or page cache can be easily recovered by invalidating the cache entry for these pages.
The OS can then take appropriate action, like killing the process with the corrupted data or logging the event properly to disk. These delays include asynchronous hardware reporting of the machine check event, and delayed execution of the handler via a workqueue.
Can it be any clearer? Background scrubbing gives a machine check. It is not recommended to use them for planning purposes.
In case you think this feature is old and was supplanted by something more recent, I urge you to flip back to and read along here at the intro to Section Since these pages have a duplicate backing copy on disk, the in-memory cache copy can be invalidated. See this LWN article for further details about this issue. And they go on to say that the poison handler runs some time after the time that the specific bad subset is used. Whether or not the CPU referenced the particular word that triggered the fault, the existing MCA may consider such faults catastrophic at the task level, and so does not bother to precisely track which instruction mcce may have consumed the bogus data.
While the specifics of how hardware and the kernel might implement memory poisoning varies, the general concept is as follows. However, this is infeasible for two reasons.
If background scrubbing detects something uncorrectable, imjector can and it seems like it ought to signal a machine check. If the faulting word is due to a prefetch, or is late in the cache line that was read due to a demand fetch, that data may arrive at the CPU quite long after the instruction that triggered that line fill.
The first two error types are the “an error was detected, but the CPU hasn’t consumed the errant data yet” error types. That’s not how I read this. A CPU read, or better yet, a data prefetch either triggered explicitly by an instruction or implicitly by a prefetch engine may have triggered the memory reference that triggered the MCA. On a later page fault the associated application will be killed. I guess what you’re missing is who marks the memory as poisoned. Er, maybe I’m missing the thrust of your question, but I thought it was sort of straightforward: Unified error handling — A worthy goal?
If I’m not mistaken, that’s the processor family this article was referring to. Huge pages fail since reverse mapping is not supported to identify the process which owns the page. Or are you asking about something much more subtle? In the most recent Intel architectures, they support a notion of “recoverable machine check,” wherein the hardware tells the OS that no CPU state was corrupted when it noticed the problem.
Posted Aug 31, 6: The machine check is action optional and it can do just as you suggest. The CPU need not have referenced the particular word that triggered the fault. How can the CPU continue executing and generate a machine check at some arbitrarily later time?