Wednesday, September 9, 2009

Abusing Terminology in the Name of the Abused

I first heard Dave Lee's story, now posted on the BBC NEWS Web site under the headline "Smart tech reconnects Colombians," on the radio last night on World Service Radio's Digital Planet program. I do not listen to this program regularly, partly because I find that I have to brace myself with a high level of skepticism when I do listen. Just because we are still in an economic crisis does not mean that attempts to sell snake oil have abated. If anything, the case is quite the contrary.

Before getting carried away with this rant, let me present Lee's statement of the problem that needs to be solved:

Many of Colombia's displaced people have been caught up in the country's violent conflicts involving armed guerrilla groups or the drug cartels.

Many have lost all of their assets, belongings and land, ending up in slums outside the cities.

Often families are split up in the process. When this happens, they are told to register their details on a national database - known as the unique registry of displaced persons - set up by the Colombian government.

However, other registries have been set up by NGO groups - such as the Red Cross - meaning the displaced millions are spread over several databases.

Frustratingly for those who have lost connection with their families, these databases don't "talk" to each other or share information.

So, while one brother may be on one database, the other may be registered elsewhere, reducing their chances of being reunited.

This means for many Colombians, being displaced from their home can mean losing contact with friends and relatives for years, even if they live in the same city.

The problem of getting the information you need by synthesizing the results of querying multiple databases is far from a new one. Back in the earlier nineties, when I was doing multimedia research at the Institute of Systems Science in Singapore, the database group there was working this problem. It is only when we read beyond the statement of the problem, however, that we discover that the headline composed around this problem is a misleading one:

Researchers aim to solve this problem by creating a "semantic knowledge layer", that will link crucial information (such as names, addresses, age, etc.) across all the databases.

Semantic technology is seen by some as the the next step for the world wide web, as it allows a much richer understanding of huge data sets.

The fact is that, at the present time, no technology is dealing with the problem of reconnecting displaced Columbians. At best we have a rehash of that old joke about the mathematician who is content to take a new problem and reduce it to another problem known to be solved, without ever exerting the effort to actually perform the solution. At worst we have weasel words, such as "seen by some," and the use of the verb "aim" with no evidence as to just how good those researchers' "aim" is!

Regular readers know that I love to be a gadfly about all things Google; but in this case the bottom line comes down to whether or not a "semantic knowledge layer" is going to offer a substantive improvement over the kind of brute-force search that Google does so well. Google technology has been adapted to searching across multiple databases. It may not integrate the results; but it delivers results that facilitate the user performing that integration, perhaps more reliably than any current "semantic knowledge layer" could do it. Again, there is nothing new about this approach. On my old blog I reported about it in an exchange that took place in 2006 between Tim Berners-Lee (self-appointed high priest of semantic knowledge layers) and Peter Norvig (Google Director of Search).

Lee's article concludes with a single sentence from a member of the Columbian project, Juan Sequeda, which may well invalidate the rest of the report:

It's all about how you integrate data.

There are two ways to respond, neither of which serves the case of semantic information layers particularly well. The Google response would be, "No, it isn't!" Google search results will be just as good whether or not the data have been integrated. On the other hand anyone who is seriously interested in semantics realizes that Sequeda's assertion says that the power of a semantic knowledge layer is only as good as the quality of data integration. If the integration effort leaves too many loose ends, semantic technology has nothing to add. Indeed, it may even detract, creating yet another scenario in which failure is attributed to an inability to "connect the dots."

The one good thing about this unfortunate exercise is that it offers a clear example of the difference between "machine semantics" and "human semantics." As humans, we do not have to worry about whether we have been primed with representations that integrate the data we derive from our many different experiences in many different settings. We do not have to represent "integrating relations," because we can infer them when we need them. In other words the mastery of semantics has more to do with refining our inferential behavior than with seeking out more powerful representations. My guess is that Sir Tim knows that such inferential behavior is part of the "semantic big picture" but that he is assuming that it will drop out as an "exercise left for the student," once all those questions of representation that he finds so appealing have been resolved. This provides a convenient distraction from the higher-level question of whether it makes sense to resolve those problems of representation in the first place.

No comments: