Body Text Segementation

One of the things we’ve been spending a lot of time on over the past weeks is our body text segmentation capability.  This capability is a critical first step for any application which does text processing.  As one might infer, the purpose of this tools is to separate the body text from an article, blog post, message board entry, etc, from the headers, footers, sidebars, inset boxes, ads, captions, comments, and other unrelated content which may appear a document.

bodytextexamplesmaller.png

This is a very hard problem to solve because there are no standards as to how content is represented in HTML.  Without going into detail on the perils of HTML (we probably said enough about that here), let’s just say we’re pretty sure you’re not going to solve the problem (at web scale) if you approach it in HTML space. 

Before we go into more detail, let’s take a quick aside and address one question we often hear:  RSS delivers the body in a structured data set, why not just use that?   Three quick responses (of course, depending on your application, your mileage may vary).  First of all, a feed has to exist (obviously) for RSS to be considered. Secondly, many of the “more interesting” sources’ feeds deliver only partial content, so you don’t get the full picture. And thirdly, completeness:  the International Center for Media and the Public Agenda at the University of Maryland published an interesting study on this subject.  One of the more interesting conclusions was “.. that the RSS feeds provided by most news outlets work very poorly for anyone who uses news as more than an entertainment medium.”

Our “secret sauce” starts with transforming the HTML representation of a document into a structured space where the document elements and their geometric relationships fall out naturally.  Once in this space, we can look at computed attributes like the “flow” of the text and very easily (in a few milliseconds) pull out the body text.

There are, of course, lots of nuances to the problem.  One interesting issue relates to how text fragments are handled when they appear at the beginning or end of the body.  In some cases, these fragments correspond to subheads (with or without corresponding font changes), in others, typos, and, in a few cases, egregious violations of Strunk and White!  Depending on your application, you may or may not wish to include this fragments in your analysis.  Because of this, our algorithms can be set to be “greedy” or “conservative” with respect to these elements and can flag them so they don’t confuse the downstream analyses.

Below are a couple examples which illustrate some of the trickier aspects of the problem.  In the first, from www.bbc.com, the body starts with a paragraph in a different font than the balance of the article.  In the second, from www.infoworld.com, the inset box is an HTML sibling of the text above, below and, in that space, it’s very difficult to differentiate between the ad text that states “Smart Ways to Grow Small Business IT …” and the body text that says “HP announces 24,600 layoffs …”.

 

bodytextexamplebbc1.png

Body text example where the first paragraph is in a different font than the rest of the body (text identified as body text is highlighted in green in the right image).

 

bodytextexampleiw.png

Body text segmentation example where the inset box (in this case, an ad) is an HTML sibling of the text above, below and to the right (text identified as body text is highlighted in green in the right image).

Site Updates (redux)

OK, the latest round of site changes are now live (actually, they went live around midnight last night).  Hopefully, these changes will give you a better idea of what we have been up to over the past several months.  The search demo is still there, but is now demoted to here.  Well, demoted is probably the wrong term, but it’s clearly no longer on the home page.  The content extraction demo is still a work-in-progress, and will hopefully go live soon (the pieces are all there, we just need to wire them to an appropriate UI).

We’re very excited about the progress we’ve made over the past several weeks on the content extraction bit, particularly as it relates to Body Text extraction.  More on that when the demo goes live.

Meanwhile, we continue to crank away on the technology and several partnerships that are currently in-process.  You’ll hear more soon!

Site Updates!

As of about and hour ago, the most recent round of site updates are now live.  This update focused on bringing our updated messaging to the “static” pages (About, Technology, etc), as well as some minor edits to the home page.  The net is, we’ve refined the way we talk about what we’re doing and the markets we’re focusing on, and our site has lagged a bit in this regard. No longer.

The next update (coming in a week or two) will add a demo for our automated content extraction technology, forcing our search demo to (eeek!) share its screen space on the home page.  The search demo will continue to be enhanced and be available for all to play with, but the URL will change a bit (from www.clueray.com to www.clueray.com/search ,  or something similar).

Time Flies …

It has been over a month since the last post, but lots has been going on both the technology side and the commercial side of the business. We’ll post details soon, but that last post was starting to look lonely, so we wanted to put something new up now.  

So, keep coming back!  You can expect a pretty major site update in the near future, which will (hopefully) go a long way towards explaining what we’ve been up to these past weeks!

Traffic Growth, Ups and Downs

Traffic has been steadily increasing over the past weeks, more or less doubling every four weeks  (though the way traffic has been the past week, the growth pattern may be switching from linear to geometric).  Thanks to all who have stopped by to check us out and use our beta system. Special thanks to those who have played with the system and left comments :-). 

With the increased traffic has come a couple of unexpected interruptions in service for the search beta (specifically sometime before ~7pm EST last night and before ~10pm tonight). We’re actively combing the logs to track these things down.  Sorry about the inconvenience!  If you do happen to experience something unexpected, don’t hesitate to contact us at feedback@clueray.com.

Once again, thanks for checking us out.  And, stay tuned, as more features are on their way soon!

Segmenting Web Documents

When we started Clueray, we knew that finding a robust solution to the segmentation problem was going to be key to accomplishing our goals.  The basic premise behind the first version our IntentMatch technology is that web documents are designed to communicate specific kinds of information.  That design purpose, the Document Intent, is reflected in the manner in which information is presented in what we call the Content Display Area (CDA) of the document.  So, one key thing our Segmentation algorithm does is find the CDA of a web document — the part of the document where the unique content is presented.

Here’s a simple example of the segmentation algorithm finding the CDA of a blog (the segmented CDA is highlighted in yellow):

segmentation-example-620.jpg

Once we’ve found the “interesting” part of the page, we can do all sorts of interesting things, like categorize the content or extract key features (titles, dates, links, etc).  But I digress.  We’ll talk more about what we do with the segmented CDA in a later post.

Why is Segmentation a Difficult Problem?

Like a lot of problems that relate to huge, unstructured data sets (like the web), document segmentation is pretty easy for a human to do, but very difficult to automate with a high degree of precision.  Why is this? Well there are a lot of reasons, but a key one is the “flexibility” of HTML as a layout specification. 

Of course, HTML wasn’t intended to be a layout specification language, so that’s why you can have two completely different HTML files which look exactly the same when rendered in the browser:

html-examples.jpg

So, solving the segmentation problem isn’t just about using standard libraries to parse HTML into a DOM structure and then looking for tags.  Of course, that’s part of it, but it’s just the first step. In fact, philosophically, solving the segmentation problem is much more like an image analysis problem. That is, you want to analyze how the content is being presented: the rendered image, not the structure of the HTML.

Of course, computationally, you can’t afford to do image analysis on each rendered document. So the challenge, like with all complex computational problems, is coming up with an appropriate data representation that facilitates the kind of analysis you want to do.  So, we’ve invested a lot of time and effort in creating a data structure (part of our patent-pending methodology) that works extremely efficiently for its targeted use. 

 

Web Intelligence: Monitoring and Measuring

Been spending a lot of time thinking and talking to people about monitoring and measuing technologies.  These are the guys that crawl the web to spot market trends, to measure the effectiveness of ad campaigns, to monitor brand reputation, to find information to drive financial transactions, etc.  One executive used the term “Web Intelligence” to describe these apps, quite consciously alluding to the familiar field of Business Intelligence (BI).

Some of these apps focus on social media, others on more traditional media sources.  Many have some degree of focus on both.  All share the same challenges that you and I do when searching the internet of separating the “wheat from the chaff” (separating the signal from the noise, finding the needle in the haystack, or … pick your favorite metahpor :)) – finding quality results which meet a specific information need.

Of course, just the sheer size of the web makes the problem daunting.  Add the lack of structure, and the problem gets even that much more interesting.  Some have chosen to use a high degree of human involvement, others have opted for complete automation.  Technologies like RSS can help address some of the challenges, but not all (the most obvious limitation of RSS being discovery — you can’t find it in the feed if there isn’t a feed or you don’t know the feed exists!).

It’s clear to us (and others) that there is an interesting opportunity for our technologies to be applied in this space.  Both our segmentation (finding interesting parts of a document) and categorization (identifying the type of document) are useful in seeding the analyses these companies do.

Of course, we look forward to working with many of these companies to focus their analyses and improve the quality of results they deliver for their customers.  Indpendent of that, though, it’s an exciting space to watch.  Case in point, check out this growth in the social media monitoring space alone:  the last market report from SocialTarget, whose Guide to Social Media Analysis is “the worldwide reference to the companies who monitor, measure and analyze online social media” identified 31 vendors in the space.  The new version of the report identifies over 100.

Feedback!

Thanks to all those who have provided feedback so far.

To date, we’ve received a wide range of feedback: from layout and accessibility issues relating to our SERP, to suggestions as to different verticals which we could align our business with.  We’ve also seen suggestions as to how to better manage our blog, and, of course, partnership inquiries (we especially love those partnership inquiries :-) ).  Great stuff! 

We really appreciate the insights and thoughtful suggestions.  Keep ‘em coming!

SearchMe launches private beta

If you track the search space at all, I’m sure you already have heard about SearchMe launching their private beta.  With $31MM from Sequoia and others, they certainly have the war chest to make an interesting go of it.

Understandably, a lot of attention is being given to the sexy UI (very similar to the new iPhone/iPod interface for browsing through album covers), but I think people are missing the key point.  To that end, here are some comments I posted over at John Battelle’s Searchblog:

Yes, there is lots of room for innovation in the manner in which search results are presented, and this interface does look compelling. But what’s most interesting here (to me anyway) is not the slick visual interface, but the deep integration of categorization.

A slick UI, on its own, isn’t going to get us beyond the limitations of keywords. Let’s face it: the current keyword-and-link-driven search delivers, at best, a mixed bag of results — “here are ten links which may or may not work for you, you figure it out from here.”

Scan through a typical set of search results from Google (or Yahoo! or Live, or …) and you see many different kinds of documents (articles, homepages, directories, …) with information about completely different domains and a wide variety of authority and quality as well. Unfortunately, this is as far as the “Bag of Words” approach and the inherent ambiguity of keywords can take us.

The addition of categorization (one can imagine multiple types of categorization that would be useful: domain, document type, authority, …) is interesting because it allows the user to express their intent with a far greater degree of precision. This greater degree of precision results in a much greater likelihood that the results delivered will meet the information need.

I haven’t played with SearchMe, so I don’t know how effective their solution is. But I like the direction.

For obvious reasons, we very much like the idea of categorization being key to search.  This will be fun to watch as it plays out.

Assume an Index

Clueray’s Recommendation Engine approach (obviously) assumes the existence of a keyword-based search engine to provide an unabridged list of seed results to provide the basis for the recommendations. This isn’t to say that keyword-based search isn’t an interesting problem, but it obviously is one which has already been addressed by lots of very smart people. So, when choosing where it makes most sense for us to allocate our resources, building yet another massively-scaled information utility (where efficiency is measured in Watts per billion queries) doesn’t fall very high on our list :).

At the risk of being repetitive, Clueray’s value-add lies in providing users and search-based apps the ability to declare their intent and focus results with much more precision than possible by using keywords alone – thus our focus on segmentation (finding “interesting” regions in documents”), categorization, and quality assessment.  

This isn’t to say that we won’t ever implement our own keyword-based search engine (Lucene anyone?), but it seems to us that the future of keyword-driven technology looks a lot more like a few massively scaled “information utilities” than hundreds of upstarts with limited CPU cycles.