About Clueray’s AutoFocusTM Technology
HTML is “Difficult”
But “Bag of Words” Works (kind of …)
AutoFocusTM Makes it Easy to Reason About Web Documents
Example Applications
Content Area Extraction
Body Text Extraction
Body Text (RSS Feed) Validation
Content Filtering
Classification, Quality Metrics, Spam Filtering and Much, Much More …
For current information retrieval and data mining techniques to work on the web, they have to make simplifying assumptions about the way information is communicated in web documents. Since the manner in which content is presented in documents is a rich source of information, these simplifying assumptions significantly limit the relevance, accuracy, and level of detail of information that can be extracted from web pages. Clueray’s AutoFocusTM technology addresses these limitations, allowing applications to incorporate a detailed understanding of the way in which information is presented in web documents as they perform their analyses. Providing this capability allows for greater understanding of web content and facilitates more accurate and sophisticated reasoning about documents on the web.
HTML is “Difficult”
Obviously, knowing how and where specific content appears in a document is quite relevant to understanding and reasoning about the contents of documents. Unfortunately, achieving this level of understanding is very hard because there are no standards as to how document elements (eg. title, body, captions, comments, etc) are represented on the web. Actually, the situation is even more challenging: not only are there no standards for representing different document elements, but HTML is so flexible that, for a given document element, the number of valid HTML encodings which correspond to any specific visual presentation of that element is practically unlimited. What this means is, even though you may be able to identify an interesting part of one document, this knowledge has essentially no relevance when it comes to finding the same part of a different document (see Figure 1).
Figure 1. Although two documents may look quite similar in a browser, their underlying HTML structures can be quite different. This is one big reason why finding document elements in “HTML Space” is very difficult.
But “Bag of Words” Works (kind of …)
To date, the solution to address these issues has been to (more or less) ignore the presentation context and treat the contents of web documents as if everything appeared in the same manner. This “bag of words” approach certainly has some level of utility – as evidenced by the popularity of today’s keyword-based search engines. However, this approach quickly hits a wall when a high degree of precision or more sophisticated kinds of reasoning about the contents of web documents are required. Consider the following “bag of words” examples (Figure 2) and try to identify how easy it might be for your application to determine if the blog post is saying negative things about your customer’s product (Hint: only one of the examples is a blog – isn’t it obvious?).
Figure 2. Ignore presentation context at your own risk! These three blocks of text correspond to the “bag of words” representation currently applications work with. One is a home page, another a company overview, and the third is a blog entry - can you tell which is which? Can your application?
AutoFocusTM Makes it Easy to Reason About Web Documents
AutoFocusTM makes it easy to reason about different document elements, their contents, their visual properties and the spatial relationships between them. It does this by using proprietary (patent-pending) algorithms to map the contents of web documents from “HTML space” to a structured data space that was designed to facilitate the kind of analyses involved in information retrieval and data mining. Put differently: AutoFocusTM takes web documents from an unstructured space where the document elements and their geometric relationships are very hard to determine to a structured space where the document elements and their geometric relationships fall out naturally. Not only can your application use this information to reason with much greater accuracy and reason in new and more sophisticated ways about the content of web documents, but this transformation takes place at “web speed” – typically tens of milliseconds – so it’s feasible to apply this technology in the context of massively-scaled web apps.
Written completely in Java, our solution is platform independent, and fully scalable. Depending on your application, AutoFocusTM can be deployed as a “black box” service which delivers specific content, or as “white box” libraries (or services), with full API access to the internal structures and relationships.
Figure 3. Which representation is easier for your application to reason with? The HTML version (left) or the AutoFocusTM version (right).
Example Applications
Facilitating the ability to reason in new and more sophisticated ways about the content of web documents opens up a world of new possibilities for many applications. Some applications look to target specific document elements with their analyses, others look to exclude elements with specific properties, and still others identify complex patterns of relationships between the properties of document elements, their locations and the content itself.
Perhaps the most obvious application of AutoFocusTM is content extraction. Content extraction involves isolating one or more specific document elements and passing that information along for further analysis. Clearly the implications of specific content appearing in a document depends on what part of the document that content appears in, but this level of detail is all but impossible to get when working in HTML-space! You need AutoFocusTM!
Content Area Extraction
For search applications, it is helpful to exclude potentially misleading text from the headers, footers, and sidebars from the relevancy (ranking) calculation. Finding the part of the document where the core content is delivered (we call this part of the document the Content Area) is very difficult to do with high precision using HTML-based approaches, but this area is easily identified by our analysis.
Body Text Extraction
Another application of content extraction focuses on a more specific document element: body text extraction. Many kinds of social- and mainstream-media monitoring applications are interested in analyzing the body text of an article or a blog post. For the reasons described above (and many others), this task is extremely difficult to do in HTML space, especially at web scale. Unfortunately, errors in body text extraction can be very costly, due to the sensitivity of the text analytics which many applications apply to the body text. Taking too much text, too little text, accidentally including ad text or comments which should be associated with an image are common examples of errors which can dramatically skew the results of text analysis. Clueray’s AutoFocusTM delivers body text with unparalleled precision – and lets you focus your efforts on your competitive advantage, not on “herding cats” in HTML space.
Body Text (RSS Feed) Validation
Obviously, technologies like RSS make structured content available for some kinds of blog monitoring. However, for a variety of reasons, only partial content is made available through feeds by many blogs (the latest data we’ve seen indicates that the current percentage of blogs who do this is greater than 15% -- and rising). Unfortunately, there is currently no “closed form” solution for looking at a feed and determining whether or not it delivers the complete body text. Applications which use AutoFocusTM can use the body text extraction capability (described above) to rapidly identify the body text and compare it to the feed data.
Content Filtering
OK, this one is similar to Content Extraction, but we’re making it a separate item to emphasize a key point. Whether you have a search engine or a data mining application, when trying to reason about the contents of a web page, it is desirable (for obvious reasons) to exclude certain types of content from your analysis. For example, if you are looking for body text, it is often very important that other types of content do not get included in the analysis. AutoFocusTM categorizes key types of content so that you can automatically exclude them (if you wish). Some example content categories include: advertisements, navigation bars (and navigation clusters), media elements, form fields and many more.
Classification, Quality Metrics, Spam Filtering and Much, Much More …
Sure, content extraction is great, but, for many apps, that’s just a first step to the more exciting stuff. Obviously, things can get proprietary very quickly at this end of the spectrum because enabling the ability to reason about the visual properties of specific document elements opens up a new world of possibilities for many applications. Case in point: our search demo uses this information to do genre-based classification, compute a quality metric for each document and (implicitly) do spam detection. Of course other applications might use this same information in a very different way (but we can’t talk about that in much detail here :-)).
