Navigation haute|Navigation gauche|Contenu

Industry Watch : building the semantic web

December 2007
Thirty-five years ago, if you wanted to find life sciences articles from Thomson Science Publishing, you'd have to pore through stacks of printed journals or search data on a computer tape to do it. Today, not only is its data online, but the data itself understands its own meaning.
 
The company scanned decades of printed life sciences journals and categorized the data using a new type of information science called semantic web technology. "The text mining software was provided with a list of vocabulary terms; 2.7 million organisms, chemical compounds, body parts, and different time periods," says Peter Meehan, business development director for software firm Mondeca, which provided the indexing tools.
 
Certain documents might have links with the Jurassic period, for example. Searches that found those documents might also find others linked to the Jurassic in some way, even if those documents didn't share any key words.
 
While dot com start ups rave about Web 2.0, semantic web researchers including original Web inventor Sir Tim Berners-Lee are already using these concepts to build what could be called Web 3.0. The semantic web understands something about the information it carries, rather than simply holding documents and delivering them blindly based on keyword search.
 
"It's about transforming the web of documents into a web of data," explains David Zaccagnini, principal product manager at semantic web company Language and Computing. His company uses semantic web technology to help clients in the healthcare industry improve their retrieval of clinical data. Its semantic indexes can retrieve content based not simply on keywords, but on the concepts behind them.
 
transforming search
 
When we search today's web, we have to guess at the potential keywords that documents might contain. Someone interested in businesses with links to Bill Gates may search on any or all of those terms, and retrieve a set of loosely related documents. They'll pore laboriously over the results and repeatedly refine their terms, and will still miss vital data.
 
But if a document included hidden data defining its concepts and relating them to others, the search engine would be able to understand what documents meant, qualifying the results before passing them back to the user. 'This document discusses a person called Bill Gates', the hidden data might say, in a way that machines rather than humans can understand. 'He chairs the following institutions, and has shares in these others'.
 
Semantic searches could potentially deliver results containing information about related items that they hadn't thought of. Someone surfing Google may or may not stumble on the fact that Bill Gates is a director of Warren Buffet's company Berkshire Hathaway. If it was encoded explicitly in metadata, the user couldn't miss it because a semantic search engine would point it out.
 
using ontologies
 
Codifying the hidden data describing those concepts is the hard part, and it relies heavily on tools called ontologies. "It defines what sorts of things we are talking about and what their characteristics are," says Meehan. "The next thing is what the valid relationships are that can exist between those systems." An ontology model for cars might define the characteristics of a carburetor, and describe the properties of fuel, and what air is. It would know that the carburetor blends air and fuel, and would use it to flesh out a computer's search for information about any of those things. Index hundreds of components and relationships, and it is easy to see how complex an ontology could become.
 
Could Google ever produce a semantic web search engine? The web's information would need to be coded semantically first, and that won't happen overnight. But companies are using semantic web technology to help them understand and use information of their own, behind the firewall.
 
Eric Miller, president of semantic information management product vendor Zepheira, says that it can be useful in tying together information sources across different functions within a company. "It wraps together traditional data stores such as your CRM system, your personnel directory and your project system," says Miller, who headed up the World Wide Web consortium's semantic web effort before starting Zepheira. "It creates a set of interconnected data. You can begin to tie together people, projects, and documents, all in a web of data."
 
Codifying our huge information bases in this way will take time, and effort. But the rewards could be huge. Having information is one thing. Having information that understands what it is and how it fits together is quite another.