People Companies Advertise Archives Contact Us Jason Dowdell

Marketing Home » Archives » 2005 » January

January 2005 Marketing Archives

Friday, January 07, 2005

Link Reform To The Rescue

Okay, I can't wait any longer. I've had this in the works for a couple months and now that Andy started a similar thread I am forced to release my findings. There's a major issue about how search engine spiders crawl sites and what sites they crawl. The robots.txt file is suppose to address this issue and has yet to do so in entirety so I'm making a proposal for a new spec or an ammendment to the W3C Robots.txt spec in this post. I'll set up a site to garner support for the spec shortly.

The current robots.txt standard is failing search engines and visitors. It's limitations are wasting bandwidth and resources of search engine crawlers and web site hosting providers and resulting in an unacceptable level of duplicate content in the major search engines' indexes. This is despite their (se's) best efforts to detect and erradicate duplicate content. I know this because its the number 1 issue I see each and every day with clients suffering from automatic penalties for spam from the search engines for stuff that isn't spam.

It's common practice for a web site owner to have multiple domains (aliases) resolving to one main site (primary site). It is even more common for these aliases to not employ an HTTP 301 or 302 redirect. If any redirect is used at all it is more often than not one of the following.

  1. Javascript redirect (often flagged as spam)
  2. Meta refresh (again, often flagged as spam) Update: As of mid Nov. '04 Yahoo considers a refresh of 0 seconds a 301 redirect and a refresh of 1 - 15 seconds a 302 redirect.
  3. A redirect within a flash movie
  4. A programatic redirect employing jsp, cfm, php, asp, etc....

Worse still, many times no redirect mechanism is used at all. This is especially true in a shared hosting environment. Site owners just have their web host setup a dns record for each alias domain and point those aliases to the same ip/machine/virtual directory as the main site and think everything is just super-duper. doh! That's when the automatic penalties begin and rankings drop.

Even when an HTTP 301 or HTTP 302 redirect is used they are often times merely pointed to another page that employs yet another redirect. Education on the proper use of browser redirects is anemic at best and even most search engine marketers don't fully understand how to properly employ a redirect. Additionally, it is not feasible for the majority of an sem's clients to even be able to make the necessary server-side changes (301, 302, etc...) that tell a search engine spider what to index and what not to index.

Even if a web site owner changes the robots.txt file within the web root of their site, their problems aren't solved. The robots.txt exclusion standard doesn't allow them to designate alias domain names (aliases) they don't want spidered. Additionally the w3c states there can be only one robots.txt on each site and it must reside in the root directory otherwise it will be ignored.

But wait, there is a META Robots tag that can be implemented on individual pages, doesn't that solve the problem? In short, no.

Even if a web site owner can change the <meta name="ROBOTS"> tag and tell
a spider to not index a particular page, it can't be applied to a specific domain since each alias is serving up the exact same content. Every aliased domain uses the same files on every site hosted in that virtual directory. This can be changed programatically but again: most site owners don't have access to the necessary tools or resources to accomplish such a task.

The syntax is <meta name="ROBOTS" content="NOINDEX, NOFOLLOW"> and the allowed list of terms in content are ALL, INDEX, NOFOLLOW, NOINDEX.

The robots.txt limitations aren't a new issue at all. They have been brought to the attention of the W3C several times before. But change is needed now! As the internet continues to grow and evolve there are more and more web pages to be indexed. Some of them are duplicates but even more are not. In order for the major search engines to be able to find and index these new web pages they'll need to be able know which pages or domains they don't need to index. The only way for them to know that is for us to tell them. If internet users are going to get the absolute best results for their searches something must be done and it must be done now.

In order for us to quantifiably determine the current impact of this duplicate
content issue we must have solid data to look at.

We'll need to find out.
  • %age of domains that employ the robots.txt standard
  • %age of domains currently displaying similar content
  • %age of urls (from each search engine) flagged as "duplicate content" that continue to get crawled despite this flagging.
update: I've been told from an undisclosed Google source that less than 2% of Google's index consists of duplicate content. Even with that seemingly small number it equates to 160,000,000 pages containing duplicate content in Google's 8,000,000,000 page index. I think that's enough data alone to raise a few eyebrows.

Once we have accurate data we can show, without a shadow of a doubt, that this change is a vital and necessary step in the evolution of the internet.

One Possible Solution
I propose an additional file be created that resides in the root directory of a web site. The file name is not important (linkreform.txt anyone?) but it's functionality is critical. This file should address the single biggest issue facing search relevance: duplicate content, and the effects duplicate content is having on the search industry as a whole.

Possible Syntax A
=====================================
# linkreform.txt for http://www.mainurl.com/
#
# $Id: linkreform.txt,v 1.01a 2004/11/15 01:33:07 jdowdell
#
# Identify Main URLs That Should Be Crawled
#
# Main url preferred crawler starting point and to be used #in search results
Parent-Domain: www.mainurl.com

# First Alias - non www version of url
Alias-Domain: mainurl.com

# Second Alias - .net version of url
Alias-Domain: www.mainurl.net
Alias-Domain: mainurl.net

# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl.com
Alias-Domain: aliasurl.com

# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl.net
Alias-Domain: aliasurl.net

# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl-a.com
Alias-Domain: aliasurl-a.com

# Additional Alias - completely different domain name
Alias-Domain: www.aliasurl-a.net
Alias-Domain: aliasurl-a.net
=====================================


Here's another proposed doc format that simply states the main url the spiders should crawl and allows users to anonymously own and point several domains to the same place without giving their competitors any information about their aliases.
Possible Syntax B
=====================================
# linkreform.txt for http://www.mainurl.com/
#
# $Id: linkreform.txt,v 1.01b 2004/11/15 01:38:11 jdowdell
#
# Identify Main URLs That Should Be Crawled
#
# Main url preferred crawler starting point and to be used #in search results
Parent-Domain: www.mainurl.com

The easiest implementation would be for the W3C to ammend the robots.txt specification and allow the following line to be added to the file.

Possible Syntax C
=====================================
# robots.txt for http://www.mainurl.com/
#
# $Id: robots.txt,v 1.01b 2004/11/15 01:41:23 jdowdell
#
# Main url preferred crawler starting point and to be used #in search results
User-agent: *
URL-to-crawl: www.mainurl.com


By implementing this new standard we could...
  1. Reduce bandwidth of all major search crawlers
  2. Reduce resources needed to power major crawlers
  3. Reduce cost of hosting a web site and demand on individual web site resources.
  4. Reduce the number of pages appearing in a search engine index that are of the same content but on a different domain.
  5. By doing no.4 search engines could focus more fine tuned efforts to thwart the practice of publishing duplicate content as an effort to rank higher.
  6. Increase end-user satisfaction rates by decreasing the amount of noise associated with typical search results.
  7. Facilitate more accurate results across all major search engines by reducing the number of duplicate content pages from non-spammers from their indexes.


Possible Side Effects: Financial and Sociological (both good and bad)
  1. Reduction in the amount of ppc revenue generated by search engines since there would be more relevant results in the natural section.
  2. Conversely, it may increase ppc revenue since results are more accurate.
  3. Society isn't ready for the "less is more" approach just yet since most internet users don't know the difference between natural results and paid listings.
  4. Search engines save money on overhead by using less resources for crawling & web hosting providers save money on bandwidth since less requests would be made.
  5. Could completely backfire and engines that support it could lose face with visitors and advertisers.



Tim Bray had made some recommendations for changes previously as well. He points to and this

Related articles:
Tag Issues
And Another

Some more previously proposed changes are here. This link deals with the problem of having multiple names for the same content. Most sites can be referenced by several names.

To avoid duplication crawlers usually canonicalize these names by converting them to ip addresses. When presenting the results of a search, it is desirable to use the name instead of the ip address. Sometimes it is obvious which of several names to use (e.g. the one that starts with www), but in many cases it is not. The robots.txt file should have an entry that states the preferred name for the site.


Those recommendations were proposed by Mike Frumkin and Graham Spencer of Excite on
6/20/96 but nothing (that I know of) has been done as of yet.

Back in 1996 during a breakout session concerning additions to the robots.txt standard Martijn Koster of AOl made some good points and basically that simpler is better and the robots.txt is simple. This report reminds me of many meetings I've taken part in where everyone has great ideas and the issues are brough up but nothing ever comes of it.

I'm proposing a call to action and I will create linkreform.com as a hub of sorts to garner support and feedback from the internet users and developers alike who want to see something happen [if there is enough interest in doing so]. Whether it's through the W3C or a group effort
by a bunch of nobodies isn't my concern. Getting something accomplished that helps out web site owners that aren't as technically adept as the average Marketing Shift reader is my goal and to that goal will I work.

Suggestions, Thoughts, Comments
If you would like to provide feedback regarding these ideas please just submit a comment on this post for the time being. If there is enough support we'll create a site dedicated to fixing this issue. A few charter members of the W3C have pledged their support for this initiative. When I asked Tim about his thoughts on whether or not I was nuts he responded with this.
=================================================
Me: Am I nuts with this idea?
Tim: The idea is not nuts.
Tim: Also check out http://www.w3.org/2001/tag/issues.html#siteData-36
=================================================
Update: I had previously stated that Tim Bray had "pledged his support" and I misstated that. My apologies to Tim. Tim said that if he has any good ideas on how to promote this idea he will pass them on. I need to get some sleep.

Link Reform To The Rescue By Jason Dowdell at 12:30 PM
Comments (1)

Blog Ads On Rough Times? I Don't Think So!

MediaPost ran a story about one blogger [Evan Coyne Maloney] who's having a difficult time getting ads he feels are "meant for his conservative audience" published on his blog from Adsense. His complaint centered around ads that were more liberal in nature and more specificallyhe cited an example that commented on the recent Presidential election.
"How can 59,054,087 people be so dumb?"
I think MediaPost did a good job in researching the issues with Adsense but I don't agree with this blogger's complaints. After all, he's got the power to turn off the domains he doesn't want to show advertisements for. Yeah, it will take some time but doesn't everything that's worth its weight take a little time to perfect?

Blog Ads On Rough Times? I Don't Think So! By Jason Dowdell at 10:06 AM
Comments (0)

Thursday, January 06, 2005

Role Reversal: Blogs Start Advertising

In an interesting twist today, MarketingVox [an online blog covering the advertising and media buying industry] ran an ad in the eMarketer newsletter. The interesting part of this is that MarketingVox is a blog and blogs aren't known for advertising. Everyone's talking about ways in which blogs may be able to generate revenue but nobody's talking about blogs running their own advertising campaigns.

I think its a pretty good sign blogs are maturing. Even though I think we're at least 12 months away from having a full blown industry in blogs. By that I mean an industry in which advertisers understand how to spend, where to spend and what to expect. Oh yeah, we'd need an industry watchdog as well (it sure as hell ain't gonna be me).


MarketingVox ran an ad in eMarketer's email newsletter today.  This is an interesting twist on the 'how will blogs make money' front.

Update: Here are Tig's thoughts. He's the blogger at MarketingVox.

"I think you'll see more and more trade advertising done by blogs because some blogs have such well-groomed, focused audiences that they can charge higher CPMs than the traditional trade press."

"I also see blog-structured trade publications doing more and more innovative marketing arrangements with other blogs and trade publications. Perhaps it's a cultural thing. Where traditional trade pubs tend to view their competition from a distance, viewing these enemies through squinty eyes, blogs tend to view their competition as friends. MarketingVOX headlines run at the bottom of MarketingProfs' homepage with an ad placement that both companies sell jointly. This isn't the sort of thing you're going to see come out of a meeting with Crain and VNU executives."
Definitely some interesting points from Tig on a new type of advertiser / publisher relationship.

Role Reversal: Blogs Start Advertising By Jason Dowdell at 11:41 AM
Comments (0)

Tuesday, January 04, 2005

Folksonomies & Toddlers

Yes, that's right, I said toddlers. It seems like I learn more from my soon-to-be 2 year old daughter every day. Just the other day I asked her what my name was and she said daddy. Then I said "I'm Jason" to which she responded "No daddy". Then I asked her if daddy is a boy or a girl and she said "daddy".

With all this talk about foksonomies and bottom up classification I figured I'd try to bring it down to a 2 year old's perspective and see what comes of it.

From this little q & a session with Piper I was gleaned a fundamental truth. Throughout out entire lives we tend to classify information. During that classificaation process we tend to go deeper and deeper until we forget where we started the classification and organization to begin with. My 2 year old knows I'm her daddy and she knows my name is Jason but only calls me daddy. When I attenpted to better understand her depth of categorization by asking her if I was a boy or a girl I realized she just defaulted to daddy.

Piper can classify me logically in the following ways.
  1. Daddy
  2. Daddy = Jason
  3. Daddy = fun (usually)

Piper can classify me emotionally in the following ways.
  1. Love
  2. Security (I'm her shelter when she's scared)

Piper can classify the Wiggles in the following ways.
  1. Fun
  2. Sing
  3. Dance
  4. Dorothy : Captain Feathersword : Wags : Big Red Car : etc...

From what I can tell about Piper so far (and what I can remember as I type this post) she's alreday beginning to classify me and everything else in ways that make sense to her. Seems like that's the most important part about the classification of information. Perspective. Without the proper perspective we'll be placing white elephants into the other white meats category and linux in the poultry category (since it's logo is a penguin).

Folksonomies & Toddlers By Jason Dowdell at 10:29 AM
Comments (0)

Monday, January 03, 2005

Matt Cutts On Google Suggests

Today I used the new Google Suggests tool to perform a vanity check on my name. When typing in "Jason Dow" I soon realized that the results for the tool weren't in alphabetical order or ordered by the number of search results associated with each search term. I quickly read over the faqs looking for an explanation and couldn't find one. Even the Google Groups thread about the tool didn't shed any light on my curiosity.

So I went to the source and asked Matt Cutts how the sort order is generated. Here's his response, quoted with his permission.

"I don't have much extra information on Google Suggest, but I know that suggestions aren't based on personal search history, according to http://labs.google.com/suggest/faq.html . I wouldn't be surprised if the order of the suggestions is determined by how often searches were queried at some point in the past. I'm basing that off this quote from the FAQ: 'For example, Google Suggest uses data about the overall popularity of various searches to help rank the refinements it offers.'"


So there you have it, directly from the horse's mouth.

Matt Cutts On Google Suggests By Jason Dowdell at 10:31 PM
Comments (0)

Google Scandal On 60 Minutes

Jason Kottke is reallly hot under the collar after the 60 minutes piece on Google last night. He's upset because 60 minutes used the following quote from John Battelle...

"If anybody got a Porsche or a Ferrari right now at Google, they'd probably be drummed out of the company"

I agree that it's a completely false statement from John but I don't know that it's that big of a deal. Yes there are those exercising their stock options on expensive toys, but just because John Battelle says there isn't anyone doing that doesn't mean 60 minutes is stupid. It just means John isn't as "in touch with Google" as 60 Minutes thinks he is. They probably chose him because of his popular blog and his smooth looks.

What I find much more important is the fact that 60 Minutes asked questions to Sergey about what he's purchased and all he said was a new t-shirt. Then they sensationalize that and combine that with John's statement about frowning upon employees exercising their options. I'd love to swap notes with kottke and get his input on exactly why John's quote has him so hot.

Update: Shellen informs me this is a inside joke among A list bloggers and its a 1999 thing. Ever have that Napoleon Dynamite "I feel so stupid, gosh!" feeling? Yeah, me too.

Google Scandal On 60 Minutes By Jason Dowdell at 06:28 PM
Comments (0)

Microsoft Will Never Catch Google

One reason Microsoft will never catch Google in the search race is because Microsoft isn't a verb. Even if you try to use "MSN" or "MSNSearch" or "Microsoft" or even "Longhorn" as a verb, it doesn't work. When was the last time you "Microsofted" something? When was the last time you "Longhorned" someone? Geesh, that sounds like pornography. Now on the other hand, when was the last time you Googled somebody? Or Googled a subject? See the difference?

When you use Google as a verb you are immediately excited and happy. When you use Microsoft or Longhorn as a verb you don't get the same joy.

Honestly, it all ties back to brand loyalty and I don't think Microsoft has brand loyalty. At least not the same kind of brand loyalty as Google. People want to use Google but they're forced to use Microsoft.

Happy Googling!

Microsoft Will Never Catch Google By Jason Dowdell at 04:55 PM
Comments (0)

January 2005 Week 2 »

  • Week 1 (7 entries) January 1-8
  • Week 2 (7 entries) January 9-15
  • Week 3 (6 entries) January 16-22
  • Week 4 (17 entries) January 23-29
  • Week 5 (0 entries) January 30-31

Link Reform To The Rescue
I have multi domains hosted in the same site in a ...
by kathy

Subscribe to Marketing Shift PostsSubscribe to The MarketingShift Feed