Dr. Jerry Cox, CEO & Founder, BlendicsIn a perfect world of chip design, there would be no margin of error. However, as many engineers and only a few programmers know, digital signals do not instantly change from 0 to 1 or vice versa. It may take less than a nanosecond, but all the values of voltage in between the two valid ones representing 1 and 0 must be traversed during these transitions.

The circuit designer, aided by his or her EDA tool flow, strives to avoid catching a signal in between. Nothing good can ever come from having the next circuit say, “Perhaps it was a 1 or perhaps a 0.” This perhaps condition usually arises from a fundamental an unavoidable phenomenon called metastability. Unfortunately, guarding against perhaps is getting both harder and also more important.

Reducing the risk of perhaps is getting harder because of Moore’s Law scaling (the empirical observation that the number of transistors that can be fit on a fixed chip area doubles about every two years.) More transistors on a chip mean more power consumption.

Lower supply voltages and higher transistor threshold voltages are required to avoid crushing battery life in mobile devices and overheating in embedded systems. However, this drive to reduce power significantly increases the probability of encountering perhaps as a result of failure to properly synchronize certain circuits.

Moore’s Law scaling provides more transistors, leading designers to increase functionality and reuse silicon-based intellectual property. This requires either the challenging generation of a low-skew clock tree or an increase in the number of clock domain crossings. In either case, the risk of a synchronization failure leading to a perhaps will grow substantially.

Reducing the risk of perhaps has not only gotten harder, but also has become  more important because more IC designs are planned for safety-critical applications. A perhaps once a year in a smartphone is not noticed among the dropped calls and fat-fingered errors.

The system-on-chip (SoC) technology that makes the smartphone possible has migrated to the power grid, to driverless cars and to implantable electronic devices. In these SoCs a perhaps has the possibility of fatal consequences. An error in synchronization could cause your future electronics platform on wheels to fail to brake or turn to avoid a collision.

How does an  SoC design team know whether synchronization of signals in their new design will perform with high reliability? Most likely both the SoC designer and the verification engineer will rely on the rule of thumb that a two-stage synchronizer is safe. Because they don’t have convenient tools they will skip the calculation of the reliability metric, Mean Time Between Failures (MTBF). This pragmatic approach has been safe in the past, but is hazardous for modern, high-performance SoCs in safety-critical applications.

In these applications, MTBF must be determined during design and before fabrication. Failure to do so risks the heavy costs of chip respins and product liability judgments. The design team must also recognize that 63 percent of the systems described by a particular value of MTBF will have failed before the end of the MTBF period.

Viewing the system failures as occurring at the time of “one MTBF” is quite misleading. Also, a safety-critical product with 10,000,000 units in use over 10-years should have a thousand-fold longer MTBF than one with only 10,000 units in use over the same period.

Because reliable performance of an SoC is getting harder to achieve and because and because significantly higher reliability is required in safety-critical SoCs, chip designers and verification engineers must increase their vigilance. There are several new tools in the marketplace that can help:  MetaACE and a Public Synchronizer design.

Measuring MTBF in simulation is possible and easy with a new tools. Using the Public Synchronizer as a benchmark helps to determine a synchronizer’s quality.  While no system can predict every failure, these new diagnostic tools go the distance in predicting problems before they happen.