People Companies Advertise Archives Contact Us Jason Dowdell

Main > Archives > 2004 > November > Google's 8 Billion Page Index: How Unique Is It?

Wednesday, November 17, 2004

Google's 8 Billion Page Index: How Unique Is It?

I noticed something interesting last week just before Google announced it had over 8 billion web pages in its index. I received a much higher than normal amount of GoogleAlerts for my name "Jason Dowdell" and a few other queries I monitor the results of. When I looked at the alerts I noticed that several my alias domains for globalpromoter.com were showing up in the results. These aliases were purchased well over a year ago and most of them don't 301 or 302 over to the main url (http://www.globalpromoter.com/). Each of these aliases resolve to the same ip address, and virtual web server root directory so the sites are 100% identical. Additionally, since the main site uses absolute urls, there's no way Google could mistake these aliases as separate sites since they don't use relative urls. My main issuehere is that prior to Thursday of last week when Google made the announcement they had indexed over 8 billion web pages, these sites didn't show up in "site:" searches on Google. But after Wednesday many of them were returning results for "site:" searches.

Now, I don't have complete historical stats but I do have a compelling breakdown of the current state of affairs that Matt Cutts from Google has initially reviewed and promised to dig deeper into when he returns from the WebmasterWorld conference in Las Vegas. Matt was kind enough to give my data a quick look and offered up one possible explanation that makes perfect sense to me. However, I'm not going to hold him to it until he has more time to review my findings and data in greater detail.

Matt Cutts:
"...It can always take some time to establish pages or sites as duplicates of each other... I wouldn't be surprised if crawling more deeply recently led to finding these pages, but it can take time for duplication classification to happen..."

I have no doubt Google has indexed 8 billion plus web pages, that's not hard to believe at all. My main issue is whether or not the additional 3 billion pages that have been added to their index add any signifigantly unique content. As Matt said, they were recently "crawling more deeply", looking in every nook and cranny to find and index pages. But when you're drinking from the firehose it's hard to focus on quality, even if you're Google. As a result, I'm afraid that a significant percentage of these freshly crawled and indexed pages consist of the same or similar content as pages already found in their index. But because they reside on a different domain name, they're now included in the latest index. If that is indeed the case then even though their index has increased by over 60% the quality of web pages included may have decreased.

That's the biggest issue.

Now obviously Google wants to provide quality results to their users, otherwise the users will perform their searches elsewhere and advertisers will only follow the users. I'm just wondering if playing the "my index is bigger than your index" game won't ultimately have a negative impact on the overall quality of the search results.

I spoke with Nate Tyler, a PR person at Google, and he said that Google has no statement to issue at this time and won't be able to comment on the issue until it's looked into further. Which could be as early as Friday of this week but no promises were made. Until I hear back from Google I'll hold off on any more conclusions regarding the relevance of their increased index size. But I will give you the same data I gave Matt.

Here is the Google test data I've been working with for this blog entry. Although I limited the scope of the data represented here to a single site I have reviewed multiple sites and found the same results.

Google test case data used in my analysis

Summary of My Findings Thus Far
I believe Google is not at fault if my theory is proved true. Because Google is only crawling sites it finds via links from other sitesand from their submission form. But in order for Google, Yahoo, MSN and every other search engine to know which urls are duplicates or aliases of a main url and shouldn't be crawled, there needs to be a standard. The current limitations of the robots.txt are numerous so I'll not delve into them in this post except to say something needs to be done soon to remedy this issue. Until the search engines are told by web site owners which urls are just aliases for a parent domain things will remain the same. Search engines will have insanely large amounts of duplicate content to sift through and crawler bandwidth will continue to increase at exponential rates. Expect an initiative from me on this issue very soon.

If you have feedback and don't want to leave a comment you can email me directly at [jason.dowdell at gmail.com]

Posted By Jason Dowdell at 02:28 AM
Permanent Link: Google's 8 Billion Page Index: How Unique Is It? | Comments (0)

Post a Comment











Subscribe to Marketing Shift PostsSubscribe to The MarketingShift Feed

Add Marketing Shift to your Technorati Favs