Thirty-five years ago, if you wanted to find life sciences articles from Thomson
Science Publishing, you'd have to pore through stacks of printed journals or search
data on a computer tape to do it. Today, not only is its data online, but the
data itself understands its own meaning.
The company scanned decades of printed life sciences journals and categorized
the data using a new type of information science called semantic web technology.
"The text mining software was provided with a list of vocabulary terms; 2.7 million
organisms, chemical compounds, body parts, and different time periods," says Peter
Meehan, business development director for software firm Mondeca, which provided
the indexing tools.
Certain documents might have links with the Jurassic period, for example. Searches
that found those documents might also find others linked to the Jurassic in some
way, even if those documents didn't share any key words.
While dot com start ups rave about Web 2.0, semantic web researchers including
original Web inventor Sir Tim Berners-Lee are already using these concepts to
build what could be called Web 3.0. The semantic web understands something about
the information it carries, rather than simply holding documents and delivering
them blindly based on keyword search.
"It's about transforming the web of documents into a web of data," explains David
Zaccagnini, principal product manager at semantic web company Language and Computing.
His company uses semantic web technology to help clients in the healthcare industry
improve their retrieval of clinical data. Its semantic indexes can retrieve content
based not simply on keywords, but on the concepts behind them.
transforming search
When we search today's web, we have to guess at the potential keywords that documents
might contain. Someone interested in businesses with links to Bill Gates may search
on any or all of those terms, and retrieve a set of loosely related documents.
They'll pore laboriously over the results and repeatedly refine their terms, and
will still miss vital data.
But if a document included hidden data defining its concepts and relating them
to others, the search engine would be able to understand what documents meant,
qualifying the results before passing them back to the user. 'This document discusses
a person called Bill Gates', the hidden data might say, in a way that machines
rather than humans can understand. 'He chairs the following institutions, and
has shares in these others'.
Semantic searches could potentially deliver results containing information about
related items that they hadn't thought of. Someone surfing Google may or may not
stumble on the fact that Bill Gates is a director of Warren Buffet's company Berkshire
Hathaway. If it was encoded explicitly in metadata, the user couldn't miss it
because a semantic search engine would point it out.
using ontologies
Codifying the hidden data describing those concepts is the hard part, and it
relies heavily on tools called ontologies. "It defines what sorts of things we
are talking about and what their characteristics are," says Meehan. "The next
thing is what the valid relationships are that can exist between those systems."
An ontology model for cars might define the characteristics of a carburetor, and
describe the properties of fuel, and what air is. It would know that the carburetor
blends air and fuel, and would use it to flesh out a computer's search for information
about any of those things. Index hundreds of components and relationships, and
it is easy to see how complex an ontology could become.
Could Google ever produce a semantic web search engine? The web's information
would need to be coded semantically first, and that won't happen overnight. But
companies are using semantic web technology to help them understand and use information
of their own, behind the firewall.
Eric Miller, president of semantic information management product vendor Zepheira,
says that it can be useful in tying together information sources across different
functions within a company. "It wraps together traditional data stores such as
your CRM system, your personnel directory and your project system," says Miller,
who headed up the World Wide Web consortium's semantic web effort before starting
Zepheira. "It creates a set of interconnected data. You can begin to tie together
people, projects, and documents, all in a web of data."
Codifying our huge information bases in this way will take time, and effort.
But the rewards could be huge. Having information is one thing. Having information
that understands what it is and how it fits together is quite another.