Wednesday, July 20, 2005
BlogPulse New & Improved Blog Search Engine
Now with more cowbell, Intelliseek released a new version of their BlogPulse blog search engine. In light of the recent heat a certain blog search engine has felt from the blogosphere, ahem [Technorati], I'm going to give you a little more technical and functional breakdown of the new features in BlogPulse. I'll also talk a bit about what makes BlogPulse different from Technorati, Feedster, IceRocket and every other blog search engine out there. So, not only will you get a breakdown of the new cool and useful tools in BlogPulse but you're going to also get the truth about some of the facts behind blog search most experts don't even understand. I'm sorry but I can't allow zip codes to get in the way of the truth when it comes to the story you're being fed from the media and every A-List blog search expert out there. Nuff said, let's dive in.
Some New Features In BlogPulse
(1)BlogPulse Profiles: Permits a user to plug in the url of
a blog and get detailed information about that blog, including rank
[available only for the top 10,000 blogs], rank trends, post trends,
citations, citation trends, sources, and similar blogs.
(2)Daily Analytics: Top Blog Posts, Top Blogs, Top News Stories, Top News
Sources that are cited in the blogosphere daily, along with RSS feeds for each.
(3)Streamlined User Interface: Tightly integrates all features of
BlogPulse cohesively [trend tool, conversation tracking, profile, etc...]
(4)Backend Enhancements: Discovers and indexes more blog content each day, increased performance of all end-user tools, and a few other enhancements.
BlogPulse Competitive Advantage
When I was testing out the beta release of the new BlogPulse search engine I found myself seriously doubting the accuracy of results it was returning. It didn't matter if I was searching for blogs that cited a specific post / url or if I was doing a general text search query, the results were off. I found that, on average, BlogPulse returned fewer results than the other major blog search engines like Technorati. It didn't make sense to me and I had to get a better fundamental understanding of the BlogPulse architecture and of the other major players' architecture.
Basically, sites like Technorati, IceRocket, Feedster are all using the same type of backend search engine as Google, Yahoo, AskJeeves, etc. They've adapted the old faithful crawl model to include pings for several reasons.
(a) Saves on amount of bandwidth their crawlers use and servers required to act as crawlers.
(b) By only crawling a blog when it's either been cited in another blog or they've been notified in some form that a blog has been updated they're able to save resources.
Since most blog search engines started up as more of a cool idea than anything else, the ping model makes sense. It saves money, time, resources, processing power etc... However, the ping method is only one way in which sites like Technorati know when a blog has been updated. There's also a blogger.xml file, a weblogs.com xml file, and others that show the blogs updated by each service in the last x minutes. This saves even more time and resources for the blog search engines.
Back to the thought of using old search technology in a new way. It's kind of like a one size fits all approach to searching. It's one thing to adapt a standard search engine framework to accomodate niche verticals [tweaks can be applied] but it's a completely different animal to apply the same thought process to blog search. Blogs are fundamentally different than web sites and require a key understanding of those differences in order to return the most accurate and reliable results.
Reason For Discrepancies
BlogPulse isn't using a standard search engine at the core, it's using a technology developed specifically for blogs from the ground up. Yes there are some overlaps but BlogPulse is built upon a consistent data model [focused on blogs] and can systematically analyze large volumes of unstructured data in a timely manner.
When a search engine returns a hit-count for the total number of results for a search, typically it's unreliable [as I'm sure all of you using Google, Yahoo! etc. are aware of]. In the area of blogs, each aggregator / search engine out there behaves a little differently in "estimating" the number of search results that are there for the given query. BlogPulse doesen't "estimate" the result count. Rather, the count you see returne on BlogPulse is the "actual" total count of matching posts for the given query. It takes more processing time and power to compute this number rather than return an "estimated" result count but it's actionable data.
That's a fundamental difference in BlogPulse. BlogPulse returns data you can actually use in marketing efforts.
And Why Do I Care?
I'm glad you asked.
(a) You shouldn't always believe in the total hit count displayed by blog search engines (BlogPulse is an exception). This includes Technorati.
(b) If a service wanted to provide trend charting type of analysis [a fundamental time series analysis], accurate hit counts are an absolute necessity. There are no exceptions.
Beware Of Anecdotal Evidence
Bloggers out there who are now citing blog link counts from Technorati, Bloglines, Feedster, BlogDigger etc. [like Scoble has been] should be wary of issues like this, potentially lurking beneath the surface.
Let Me Give You An Analogy
I'd say BlogPulse is like running a sql statement fetching the exact number of blue widgets a co. has in stock. Whereas traditional search is like returning the number of inventory items that have %widg% in their description and applying an inflationary factor to increase the count and implied quality / size of results returned.
Blog Search Engine Index Differences
BlogPulse has openly stated the types of posts they include in their index and the reasons behind them.
... BlogPulse creates a full-text search index of all of the blog entries it finds every day. You can search this index through the BlogPulse search engine.
Additionally, BlogPulse analyzes the blog data in a number of interesting ways. These methods reveal the most cited links and key people that are referred to daily in blog entries.
BlogPulse also performs a unique kind of text mining on blog data to help reveal topics and themes within blog entries every day...
How can you compare the value of BlogPulse's approach to blog data mining with the other key blog engines having the following issues...
(a) You don't know what links they are counting.
(b) Whether or not blogroll links are counted the same way as blog post links.
(c) Whether or not they are actually indexing the full text of the posts or partial text of the posts, [from RSS] or what?
I did a bit of digging and got some closure on Technorati's process but nothing that gives me the warm fuzzies so I'll have to follow that up a bit here. Technorati's about us page gives us an idea of what they index but there are holes I'll have to poke through later.
Technorati is a real-time search engine that keeps track of what is going on in the blogosphere â€” the world of weblogs.
From their publishers help page and their faq page.
...Technorati is an automated search engine that employs robots known as "spiders" to discover content on your site and its feeds. Please ping Technorati with each addition and update to your site to prompt our spiders to index your new content...
...Technorati specializes in searching all blogs, not merely those with RSS feeds, and instead of only indexing the RSS feed (often the first few hundred words of an article), Technorati reads all of the HTML code in a blog posting, and also tracks all of the activity around a blog or post such as inbound and outbound links...
Something important, Technorati does define inbound and outbound links and doesn't limit them to blog posts but appears to include blogrolls and such which may not be as important as links within the actual posts.
Inbound links refer to hyperlinks from other sources citing that weblog. Outbound links refer to hyperlinks from the weblog to outside sources.
These are some of the key questions that must be answered. BlogPulse has answered them in their faq section and I welcome similar explanations from the other blog search engines on this matter. Other issues I plan on addressing in the realm of blog search are timeliness of data, spam, performance, and a few others key to understanding the nuances of blog search based on the current market offerings. I may even put together a side-by-side blog search engine comparison chart if I'm really feelin it.
Blog Search Wrapup - Where Now?
I think I've given you enough data for one blog post. I'll most certainly have more to say on this subject in the coming days and weeks. It was very important to me that the fundamental differences in the underlying technology of blog search engines out there is uncovered. I'm not implying the major players are evil or anything of the sort but that the blog search community [novices and experts alike] are in general ignorant on the subject. How else could someone like Robert Scoble even attempt to compare the link counts of the various engines when he's not even sure of the proper way to conduct the searches, much less know how / where the data is being drawn from.
That's all for now. Feel free to email me directly if you want to engage in dialog other than a comment post on this entry.
I can be reached at jason.dowdell [at] gmail.com
By Jason Dowdell at 02:50 AM | Comments (0)