Researchers chart Web stats
KYLE ANDREWS
The World Wide Web, lauded for its pivotal role in communications, still remains, in part, a mystery. Although the number of documents on the Web is accepted to be around 800 million, the actual topography of the Web has not been seriously charted.
Two Notre Dame graduate students and one professor are working to change that.
"We are studying the properties of very large networks," said Associate Professor of Physics Albert-László Barabási, of his work with graduate students Réka Albert and Hawoong Jeong. "The scientific community has ignored this area of complexity. We don't have real data on these networks."
"We wanted to find examples of systems where we could find the topology," said Barabási. "The inhomogeneity of the Web has not been properly quantified yet."
"There are a finite number of Web pages," added Jeong. "If we know the topography, we can use that information to make a model of the entire Internet."
What they found was that the growth of the Web follows a power-law distribution, typical of social or biological systems. The power law is typically found in self organizing systems like magnetic fields or plant growth, not what one would expect from a random network.
A typical mathematical model of a random network would be a bell curve. If this were true, then it would follow that a majority of Web sites would have a moderate number of links, while there would be a few sites that would have either many or a few links. But this was found to be false.
"There is a very large number of sites with very many connections," said Barabási. Additionally, there are many sites that have a small number of links, thus making the bell curve inapplicable.
This allowed the researchers to develop a model of the Web that could be used to project the shape of the entire Web. What they found was that, although the Web is very large, any two randomly selected documents are separated by an average of 19 links.
This information regarding the "diameter" of the Web is important for many reasons. Most practically, it will help search engine developers make more efficient search engines.
Brute force, namely identifying information by matching strings, is inefficient in searching the entire Web. The search engine with the most coverage, Northern Light, is only able to search 38 percent of the published documents. A more discerning search engine would be able to make use of the interconnectedness of the Web, allowing for greater and quicker coverage.
In order to come to their findings, Jeong first developed a robot that analyzed web sites and their links. Using the Notre Dame Web site as an example, he found a relationship between the size of the web site and its connectedness.
Taking this information regarding link distribution, Albert was able to construct a model that reflected the actual Web.
"With the model, we can study the distance between two points in the Internet", says Albert. If the Web is taken as 800 million documents and plugged into the model, the shortest distance between two random points was found to be 18.59, or an average of 19 clicks.
Additionally, the determined logarithmic function of the Web increases slowly, which means that even the projected 1,000 percent increase in Web size over the next few years will not dramatically change the diameter. In fact, the number of clicks would only rise to 21 from 19.
All News Stories for Monday, September 20, 1999