Body Text Segementation
One of the things we’ve been spending a lot of time on over the past weeks is our body text segmentation capability. This capability is a critical first step for any application which does text processing. As one might infer, the purpose of this tools is to separate the body text from an article, blog post, message board entry, etc, from the headers, footers, sidebars, inset boxes, ads, captions, comments, and other unrelated content which may appear a document.

This is a very hard problem to solve because there are no standards as to how content is represented in HTML. Without going into detail on the perils of HTML (we probably said enough about that here), let’s just say we’re pretty sure you’re not going to solve the problem (at web scale) if you approach it in HTML space.
Before we go into more detail, let’s take a quick aside and address one question we often hear: RSS delivers the body in a structured data set, why not just use that? Three quick responses (of course, depending on your application, your mileage may vary). First of all, a feed has to exist (obviously) for RSS to be considered. Secondly, many of the “more interesting” sources’ feeds deliver only partial content, so you don’t get the full picture. And thirdly, completeness: the International Center for Media and the Public Agenda at the University of Maryland published an interesting study on this subject. One of the more interesting conclusions was “.. that the RSS feeds provided by most news outlets work very poorly for anyone who uses news as more than an entertainment medium.”
Our “secret sauce” starts with transforming the HTML representation of a document into a structured space where the document elements and their geometric relationships fall out naturally. Once in this space, we can look at computed attributes like the “flow” of the text and very easily (in a few milliseconds) pull out the body text.
There are, of course, lots of nuances to the problem. One interesting issue relates to how text fragments are handled when they appear at the beginning or end of the body. In some cases, these fragments correspond to subheads (with or without corresponding font changes), in others, typos, and, in a few cases, egregious violations of Strunk and White! Depending on your application, you may or may not wish to include this fragments in your analysis. Because of this, our algorithms can be set to be “greedy” or “conservative” with respect to these elements and can flag them so they don’t confuse the downstream analyses.
Below are a couple examples which illustrate some of the trickier aspects of the problem. In the first, from www.bbc.com, the body starts with a paragraph in a different font than the balance of the article. In the second, from www.infoworld.com, the inset box is an HTML sibling of the text above, below and, in that space, it’s very difficult to differentiate between the ad text that states “Smart Ways to Grow Small Business IT …” and the body text that says “HP announces 24,600 layoffs …”.

Body text example where the first paragraph is in a different font than the rest of the body (text identified as body text is highlighted in green in the right image).

Body text segmentation example where the inset box (in this case, an ad) is an HTML sibling of the text above, below and to the right (text identified as body text is highlighted in green in the right image).

