Data, Noise, and the Missing Internet Epistemology

By chromatic
December 18, 2008 | Comments: 2

Steffen Mueller pointed me to an interview with Bjarne Stroustrup from August 2008. In particular, the interviewer asked if C++ is really declining, as some journalists and pundits have suggested. Bjarne's answer is appropriate:

Most of the popular measures basically measures noise and ought to r eport their findings in decibel rather than "popularity."

Consider TIOBE's index of programming languages, which trawls various search engines for +"language programming" and reports on the findings. You may note with some amusement that several months after the release of Google's Chrome browser, TIOBE has decided to remove "Chrome" (the previous name for Oxygene) from Delphi's score. (See also The Many Faces of Delphi.)

Perhaps it's unfair to pick on TIOBE for not vetting their data completely, but TIOBE's index relies on an assumption Bjarne skewers. Popularity, practicality, and use are very different -- yet they're easily mistaken. If that weren't the case, would astroturfing work? How about form letters and Internet petitions?

Does the presence of an identifiable point of view on a Wikipedia page mean that motivated subject matter experts have refined the phrasing and tone and approach of the article so that it reflects the most accurate consensus opinion of skilled practitioners in the field, or does it mean that an editor with free time reverted dissenting opinions so often that everyone else gave up?

Does the top result for a search in Google represent the best result for that query, or the most-linked result? Given that many linkers rely on Google to find relevant links, is Google distorting its own results by encouraging self-reinforcing behavior?

Can a widely-syndicated writer inject an idea or phrase into a headline and watch it spread throughout the Internet from nothing to effective ubiquity, giving the appearance that the idea is widely held? How can you tell the difference between syndicated-by-default and consciously-spread-on-purpose anyway?

Can you tell the difference between a grass-roots storm of complaints about DRM in Spore from angry, affected customers and a loosely-organized mob of complaints from people who'd never have bought the game anyway?

I don't know how to answer to these questions. I worry that deriving data from collective intelligence may lead us to assume that we can -- or worse yet, that we'll never even think about the questions. Where is the new epistemology for the Internet?

You might also be interested in:


These are very important questions. It might be where the long tail trips us up. We risk creating de-facto knowledge simply by aggregating large numbers of readers prepared to accept non-sensical information from a vocal minority.

If knowledge is a "true, justified belief" how do we validate truth on the Web? We need to be able to differentiate between facts and mythology.

The answer lies not in collective approval but rather the degree of authority commanded by any given source of information. For example I have a reasonable level confidence in a news story posted on the BBC website or in an article found in the Encyclopedia Brittanica at my local library - because those are trusted sources. I do not share the same confidence in Google or Wikipedia.

A good researcher looks to multiple sources of information and would be unwise to rely solely upon crowd-sourced content.

How long ago was it finally revealed that "everybody" knew Iraq held Weapons of Mass Destruction? That what "everybody" knew was repetition of fabricated data from a single source, the US government (Office of Special Plans)?

How long will it be until "everybody" figures out that xml, as datastore, is a bad rehash of IMS?

How long will it be until "everybody" figures out that the web browser, as application platform, is semantically identical to a 1970's era 3270 terminal?

And the list goes on.

News Topics

Recommended for You

Got a Question?