Site Updates!

As of about and hour ago, the most recent round of site updates are now live.  This update focused on bringing our updated messaging to the “static” pages (About, Technology, etc), as well as some minor edits to the home page.  The net is, we’ve refined the way we talk about what we’re doing and the markets we’re focusing on, and our site has lagged a bit in this regard. No longer.

The next update (coming in a week or two) will add a demo for our automated content extraction technology, forcing our search demo to (eeek!) share its screen space on the home page.  The search demo will continue to be enhanced and be available for all to play with, but the URL will change a bit (from www.clueray.com to www.clueray.com/search ,  or something similar).

Time Flies …

It has been over a month since the last post, but lots has been going on both the technology side and the commercial side of the business. We’ll post details soon, but that last post was starting to look lonely, so we wanted to put something new up now.  

So, keep coming back!  You can expect a pretty major site update in the near future, which will (hopefully) go a long way towards explaining what we’ve been up to these past weeks!

Traffic Growth, Ups and Downs

Traffic has been steadily increasing over the past weeks, more or less doubling every four weeks  (though the way traffic has been the past week, the growth pattern may be switching from linear to geometric).  Thanks to all who have stopped by to check us out and use our beta system. Special thanks to those who have played with the system and left comments :-). 

With the increased traffic has come a couple of unexpected interruptions in service for the search beta (specifically sometime before ~7pm EST last night and before ~10pm tonight). We’re actively combing the logs to track these things down.  Sorry about the inconvenience!  If you do happen to experience something unexpected, don’t hesitate to contact us at feedback@clueray.com.

Once again, thanks for checking us out.  And, stay tuned, as more features are on their way soon!

Segmenting Web Documents

When we started Clueray, we knew that finding a robust solution to the segmentation problem was going to be key to accomplishing our goals.  The basic premise behind the first version our IntentMatch technology is that web documents are designed to communicate specific kinds of information.  That design purpose, the Document Intent, is reflected in the manner in which information is presented in what we call the Content Display Area (CDA) of the document.  So, one key thing our Segmentation algorithm does is find the CDA of a web document — the part of the document where the unique content is presented.

Here’s a simple example of the segmentation algorithm finding the CDA of a blog (the segmented CDA is highlighted in yellow):

segmentation-example-620.jpg

Once we’ve found the “interesting” part of the page, we can do all sorts of interesting things, like categorize the content or extract key features (titles, dates, links, etc).  But I digress.  We’ll talk more about what we do with the segmented CDA in a later post.

Why is Segmentation a Difficult Problem?

Like a lot of problems that relate to huge, unstructured data sets (like the web), document segmentation is pretty easy for a human to do, but very difficult to automate with a high degree of precision.  Why is this? Well there are a lot of reasons, but a key one is the “flexibility” of HTML as a layout specification. 

Of course, HTML wasn’t intended to be a layout specification language, so that’s why you can have two completely different HTML files which look exactly the same when rendered in the browser:

html-examples.jpg

So, solving the segmentation problem isn’t just about using standard libraries to parse HTML into a DOM structure and then looking for tags.  Of course, that’s part of it, but it’s just the first step. In fact, philosophically, solving the segmentation problem is much more like an image analysis problem. That is, you want to analyze how the content is being presented: the rendered image, not the structure of the HTML.

Of course, computationally, you can’t afford to do image analysis on each rendered document. So the challenge, like with all complex computational problems, is coming up with an appropriate data representation that facilitates the kind of analysis you want to do.  So, we’ve invested a lot of time and effort in creating a data structure (part of our patent-pending methodology) that works extremely efficiently for its targeted use. 

 

Web Intelligence: Monitoring and Measuring

Been spending a lot of time thinking and talking to people about monitoring and measuing technologies.  These are the guys that crawl the web to spot market trends, to measure the effectiveness of ad campaigns, to monitor brand reputation, to find information to drive financial transactions, etc.  One executive used the term “Web Intelligence” to describe these apps, quite consciously alluding to the familiar field of Business Intelligence (BI).

Some of these apps focus on social media, others on more traditional media sources.  Many have some degree of focus on both.  All share the same challenges that you and I do when searching the internet of separating the “wheat from the chaff” (separating the signal from the noise, finding the needle in the haystack, or … pick your favorite metahpor :)) – finding quality results which meet a specific information need.

Of course, just the sheer size of the web makes the problem daunting.  Add the lack of structure, and the problem gets even that much more interesting.  Some have chosen to use a high degree of human involvement, others have opted for complete automation.  Technologies like RSS can help address some of the challenges, but not all (the most obvious limitation of RSS being discovery — you can’t find it in the feed if there isn’t a feed or you don’t know the feed exists!).

It’s clear to us (and others) that there is an interesting opportunity for our technologies to be applied in this space.  Both our segmentation (finding interesting parts of a document) and categorization (identifying the type of document) are useful in seeding the analyses these companies do.

Of course, we look forward to working with many of these companies to focus their analyses and improve the quality of results they deliver for their customers.  Indpendent of that, though, it’s an exciting space to watch.  Case in point, check out this growth in the social media monitoring space alone:  the last market report from SocialTarget, whose Guide to Social Media Analysis is “the worldwide reference to the companies who monitor, measure and analyze online social media” identified 31 vendors in the space.  The new version of the report identifies over 100.

Feedback!

Thanks to all those who have provided feedback so far.

To date, we’ve received a wide range of feedback: from layout and accessibility issues relating to our SERP, to suggestions as to different verticals which we could align our business with.  We’ve also seen suggestions as to how to better manage our blog, and, of course, partnership inquiries (we especially love those partnership inquiries :-) ).  Great stuff! 

We really appreciate the insights and thoughtful suggestions.  Keep ‘em coming!

SearchMe launches private beta

If you track the search space at all, I’m sure you already have heard about SearchMe launching their private beta.  With $31MM from Sequoia and others, they certainly have the war chest to make an interesting go of it.

Understandably, a lot of attention is being given to the sexy UI (very similar to the new iPhone/iPod interface for browsing through album covers), but I think people are missing the key point.  To that end, here are some comments I posted over at John Battelle’s Searchblog:

Yes, there is lots of room for innovation in the manner in which search results are presented, and this interface does look compelling. But what’s most interesting here (to me anyway) is not the slick visual interface, but the deep integration of categorization.

A slick UI, on its own, isn’t going to get us beyond the limitations of keywords. Let’s face it: the current keyword-and-link-driven search delivers, at best, a mixed bag of results — “here are ten links which may or may not work for you, you figure it out from here.”

Scan through a typical set of search results from Google (or Yahoo! or Live, or …) and you see many different kinds of documents (articles, homepages, directories, …) with information about completely different domains and a wide variety of authority and quality as well. Unfortunately, this is as far as the “Bag of Words” approach and the inherent ambiguity of keywords can take us.

The addition of categorization (one can imagine multiple types of categorization that would be useful: domain, document type, authority, …) is interesting because it allows the user to express their intent with a far greater degree of precision. This greater degree of precision results in a much greater likelihood that the results delivered will meet the information need.

I haven’t played with SearchMe, so I don’t know how effective their solution is. But I like the direction.

For obvious reasons, we very much like the idea of categorization being key to search.  This will be fun to watch as it plays out.

Assume an Index

Clueray’s Recommendation Engine approach (obviously) assumes the existence of a keyword-based search engine to provide an unabridged list of seed results to provide the basis for the recommendations. This isn’t to say that keyword-based search isn’t an interesting problem, but it obviously is one which has already been addressed by lots of very smart people. So, when choosing where it makes most sense for us to allocate our resources, building yet another massively-scaled information utility (where efficiency is measured in Watts per billion queries) doesn’t fall very high on our list :).

At the risk of being repetitive, Clueray’s value-add lies in providing users and search-based apps the ability to declare their intent and focus results with much more precision than possible by using keywords alone – thus our focus on segmentation (finding “interesting” regions in documents”), categorization, and quality assessment.  

This isn’t to say that we won’t ever implement our own keyword-based search engine (Lucene anyone?), but it seems to us that the future of keyword-driven technology looks a lot more like a few massively scaled “information utilities” than hundreds of upstarts with limited CPU cycles.

Blog Changes

Just in case you’re confused — we are now hosting our blog ourselves.  Previously, our blog was hosted at clueray.wordpress.com and appeared here via an RSS link.  Now, we’re doing everything locally. 

Unfortunately, it will take a little time for the broken (Wordpress) links to work their way out of other search engines, but figured the sooner we make the change, the better. 

Sorry for the inconvenience.

What’s a Recommendation Engine?

Clueray’s Recommendation Engine is a tool that takes as input a set of search results and delivers the subset of those results, the recommendations, which match the user’s declared intent. With the Recommendation Engine, the user indicates their intent through keywords (as with conventional search) and by selecting focus elements which are used to screen and prioritize results for presentation.

As one might imagine, focus elements can be based on a wide variety of characteristics, from visual properties like layout and quality of presentation, to semantic features derived through sophisticated analysis of document content. Focus elements could also be derived from different user characteristics, for example as provided by the user through a Facebook or LinkedIn profile.

The current version of Clueray’s Recommendation Engine uses something we call Document Intent to focus the results. Additional focus elements will be added as the public beta progresses.

Why a Recommendation Engine?

The basic premise behind the recommendation engine is simple: keywords alone are, at best, an incomplete representation of a user’s intent. A quick scan through pretty much any collection of search results finds a variety of document types containing information of widely varying quality (both presentation and substance). By allowing users to add focus to their search, our Recommendation Engine allows users to get directly to results which meet their specific information need (vs. having to slog through pages of conventional search results to find something interesting).

Why not just call it a “Search Engine”?

Google, Yahoo, Microsoft and Ask (and AltaVista, etc before them) have trained use well. Today, “Search” means “keyword-driven search”. Yes, at one level, it’s just a semantic issue, but what we are focusing on is what we think is an important next step — refining the search to (much) better match the user (or search-based app’s) intent. To us, it’s an important distinction: our Recommendation Engine’s “here are some results which meet your stated information need” vs. “here are ten links which may or may not work for you, you figure it out from here.”