When we started Clueray, we knew that finding a robust solution to the segmentation problem was going to be key to accomplishing our goals. The basic premise behind the first version our IntentMatch technology is that web documents are designed to communicate specific kinds of information. That design purpose, the Document Intent, is reflected in the manner in which information is presented in what we call the Content Display Area (CDA) of the document. So, one key thing our Segmentation algorithm does is find the CDA of a web document — the part of the document where the unique content is presented.
Here’s a simple example of the segmentation algorithm finding the CDA of a blog (the segmented CDA is highlighted in yellow):

Once we’ve found the “interesting” part of the page, we can do all sorts of interesting things, like categorize the content or extract key features (titles, dates, links, etc). But I digress. We’ll talk more about what we do with the segmented CDA in a later post.
Why is Segmentation a Difficult Problem?
Like a lot of problems that relate to huge, unstructured data sets (like the web), document segmentation is pretty easy for a human to do, but very difficult to automate with a high degree of precision. Why is this? Well there are a lot of reasons, but a key one is the “flexibility” of HTML as a layout specification.
Of course, HTML wasn’t intended to be a layout specification language, so that’s why you can have two completely different HTML files which look exactly the same when rendered in the browser:

So, solving the segmentation problem isn’t just about using standard libraries to parse HTML into a DOM structure and then looking for tags. Of course, that’s part of it, but it’s just the first step. In fact, philosophically, solving the segmentation problem is much more like an image analysis problem. That is, you want to analyze how the content is being presented: the rendered image, not the structure of the HTML.
Of course, computationally, you can’t afford to do image analysis on each rendered document. So the challenge, like with all complex computational problems, is coming up with an appropriate data representation that facilitates the kind of analysis you want to do. So, we’ve invested a lot of time and effort in creating a data structure (part of our patent-pending methodology) that works extremely efficiently for its targeted use.