Future microprocessors will be highly susceptible to transient errors as the sizes of transistors decrease due to CMOS scaling. Prior techniques advocated full scale structural or temporal redundancy to achieve fault tolerance. Though they can provide complete fault coverage, they incur significant area and/or performance overhead. It is desirable to have a mechanism that can provide, incomplete, but still sufficiently high fault coverage with negligible area and/or performance cost. To achieve this goal, in this paper, we examine exploiting speculative structures that already exist in modern processors to provide partial fault coverage. We start by quantifying how much the faulty program deviates from the correct program execution in terms of control flow, address patterns and store values. We find this classification useful to design techniques that can detect a particular form of deviation and thereby ultimately detect the transient fault. In order to detect transient faults, we propose augmenting branch predictors to detect control flow errors, store sets and L2 cache misses to predict faults that might have resulted in incorrect address references, and a value predictor to detect incorrect store values.
The authors of these documents have submitted their reports to this technical report series for the purpose of non-commercial dissemination of scientific work. The reports are copyrighted by the authors, and their existence in electronic format does not imply that the authors have relinquished any rights. You may copy a report for scholarly, non-commercial purposes, such as research or instruction, provided that you agree to respect the author's copyright. For information concerning the use of this document for other than research or instructional purposes, contact the authors. Other information concerning this technical report series can be obtained from the Computer Science and Engineering Department at the University of California at San Diego, firstname.lastname@example.org.
[ Search ]