Tiny chips, but very big headaches : In vast computer networks, the smallest components are increasingly vulnerable.

In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures for which causes have not been easy to identify.

The problem, they argued, was not in the software - it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comments on its study.

Imagine for a moment that the millions of computer chips inside the servers that power the largest data centers in the world had rare, almost undetectable flaws. And the only way to find the flaws was to throw those chips at giant computing problems that would have been unthinkable, just a decade ago.

As the tiny switches in computer chips have shrunk to the width of a few atoms, the reliability of chips has become another worry for the people who run the biggest networks in the world. 

Companies like Amazon, Facebook, Twitter and many other sites have experienced surprising failures over the past year.

The service failures have had several causes, like programming mistakes and congestion on networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they still depend, at the basic level, on computer chips that are now less reliable and, in some cases, less predictable.

''They're seeing these silent errors, essentially coming from the underlying hardware,'' said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware.

INCREASINGLY, Dr. Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught.

Researchers worry that they are finding rare defects because they are trying to solve bigger and bigger computing problems, which stresses their systems in unexpected ways.

Companies that run large data centers began reporting systematic problems more than a decade ago.

In 2015, in the engineering publication IEEE Spectrum, a group of computer scientists who study hardware reliability at the University of Toronto that each year as many as 4 percent of Google's millions of computers had encountered errors that couldn't be detected and that had caused them to shut down unexpectedly.

In a microprocessor that has billions of transistors - or a computer memory board composed of trillions of the tiny switches that can each store a 1 or 0 - even the smallest error can disrupt systems that now routinely perform billions of calculations each second.

At the beginning of the semiconductor era, engineers worried about the possibility of cosmic rays' occasionally flipping a single transistor and changing the outcome of a computation.

Now they are worried that the switches themselves are increasingly becoming less reliable.

The Facebook researchers even argue that the switches have become more prone to wearing out and that the life span of computer memories or processors may be shorter than previously believed.

There is mounting evidence that the problem is becoming worse with each new generation of chips.

The Publishing continues into the future. The World Students Society thanks author John Markoff.


Post a Comment

Grace A Comment!