on Jan 20, 2006
Mark normally posts on Sundays, but he seems to be on a roll this week. He has more on the probalistic systems, which I mentioned earlier. This led me to this bit from Nicholas Carr, which is one side of a debate on the merits of probalistic systems.
Back already? Great.
As others have pointed out, one thing about the these systems is that even if nobody is cheating, deciding what is “good enough” is a bit abstract: It depends on what you want to do with the emergent data, and what your standards are for usefulness. Everyone’s big problem seems to be with Wikipedia. It is often used as an example of a probablistic system that doesn’t really deliver and (occasionally) used as an indictment of probalistic systems in general. As far as probalistic systems go, Wiki is really a poor example. I think it’s a stretch to lump it in with systems like Google and Technorati. So what makes Wikipedia so different?
Low fault tolerance
Let’s say I wrote some software that looks at common airplane approach vectors to major airports. Pilots can can enter their current position, their destination, a few other variables, and my program will then come back with, “Based on what other pilots have done in similar circumstances, we suggest using the following approach…” Let’s assume I do a good job and my program makes the right choice nearly every time.
Well, we can stop right there. Nearly every time isn’t nearly good enough in this situation. I don’t care how much depth we give the dataset or how many variables we take into account. The whole system is useless.
On the other hand, let’s say you want a picture of Brittny Spears for your desktop (humor me here) and Google comes back with a less-than-optimal result. Instead of giving you the “official” page run by some media company, it gives you a website maintained by a fan. Odds are, his site has what you want as well. Even if it doesn’t, he probably has a link that will point you to the goods.
The difference between these two situations is pretty stark. One is a waste of time, even with a 99% success rate, and the other works well enough even when it gets things “wrong”.
And this is Wikipedia’s problem: Most people have a pretty low tolerance for error in an encyclopedia. If the info is wrong (or even suspect) then they have to look it up elsewhere, so why bother with Wiki at all? More to the point, if you have a low error tolerance, should you really be using probalistic systems? Probably not.
Lack of Darwinisim
As I understand Wikipedia, each subject has one entry. If I think the guy who wrote the entry for Article 153 of the Constitution of Malaysia got something wrong, I edit the original article. The next person to visit the page will see my version, not the original. People can review new changes or revert to old versions acording to various rules, but at any time there is only one page for Article 153 of the Constitution of Malaysia, and the average visitor isn’t going to want to take part in the courtship between new data and old data.
This isn’t a good way to foster, uh, probablisim. For a healthy probalistic system, it would need to create a new article that exists parallel to the original. They would be “ranked” according to (perhaps) number of incoming references that favor one version over the other, and the number of times users clicked on “this item was helpful”. The two versions of the same subject would be allowed to compete for visitors, with better pages slowly knocking less useful pages down in the rankings. Thus, each visitor contributes to the system by helping to rank pages, often by simply using them and then going away. This means the data gets more useful even when nobody is editing the articles themselves.
(Note that I’m not suggesting it should work this way. There are many reasons why this might not be a good idea. I’m just saying this would give the system much stronger probalistic properties.)
Detecting bad data
As I mentioned before, often Google will give you a less-than optimal result, but things still work out. Often the “wrong” site will contain a link to the “right” one. Finding a Brittny Spears fan site leads me to the official one. The same is not true for poor Wiki. When I get to the wrong site, I don’t know it’s wrong. If I did, I wouldn’t need to look it up. Even worse, finding the wrong birthday for Napoleon doesn’t lead me to the right one. It leads me to propigate bad data.
Help from the user
It is very, very rare that I ever need to check out page 2 of Google search results. Usually what I want is right there on page 1. However, often my goal is not the #1 result. So, Google is great at narrowing a search down to 10 or so likely contenders, but it has a really hard time picking the right one out of those 10. Since it lists all 10, and lets me choose, it doesn’t have to. That last level of value judgments – the most difficult – is left for the user.
By contrast, there is no way the user can really “help” Wiki, unless they jump in and write an article.
I guess my point in all this is that Wiki, regardless of its usefulness, is a bit shabby when it comes to probalistic properties.
Shamus Young is an old-school OpenGL programmer, author, and composer. He runs this site and if anything is broken you should probably blame him.