about      technology      team      partners     
About Clueray
Clueray, Inc. was created to improve the way knowledge is extracted from the Internet. Limitations of current information retrieval methodologies are a major barrier to improving the performance of existing systems and to the development of next generation capabilities such as auto-discovery, agent-based data mining, and many others. By allowing applications to understand the manner in which information is presented in web documents, Clueray’s technology addresses important limitations of current approaches and is a key enabler for the next generation of search and search-based solutions.



Background
A key limitation of conventional information retrieval methodologies is that they focus almost entirely on the words in a document, largely ignoring a rich and important source of information: the manner in which information is presented to the reader. Unfortunately, without an understanding of how the information is presented, current approaches are virtually incapable of performing fundamentally important tasks like distinguishing between a home page and a product overview, or separating body text from ad text. As a result, extracting information from the Internet remains a messy, labor-intensive process.
To emphasize the point, consider the Mona Lisa. One could describe the Mona Lisa as a painting that contains the colors green tan and black. But any description of this painting that doesn’t refer to form – the manner in which the colors are presented – clearly leaves a lot of ambiguity. This is much like describing the United States Constitution simply as a document that contains the words “we the people.” In fact, because keyword-based approaches don’t consider form, a Google (or Yahoo!, or …) query for the keywords “we the people” can’t distinguish between the preamble to the Constitution and the home page for We The People, LLC (“save on the high cost of legal fees!”), though the two results quite obviously meet very different information needs.

Though (a) and (b) contain the same colors, their form clearly communicates very different things. It’s the same with the two web pages (c) and (d): they contain the same words (“we the people”), but the form tells us they meet very different information needs.
The point is, without the ability to understand the manner in which information is presented in documents, it is very difficult for search to move beyond the mixed collections of documents of widely varying relevance returned by keyword-based techniques. Perhaps more importantly, without this ability, next generation capabilities, like auto-discovery, agent-based data mining, web-targeted business intelligence, media monitoring, and others will be very difficult to realize. This is where Clueray comes in …

Technology
Clueray’s patent-pending AutoFocusTM technology addresses the limitations of classical information retrieval methodologies by characterizing the manner in which information is presented in web documents. A unique data model allows us to automatically identify different visual elements of web documents (eg. body, title, comments, headers, footers, ads, etc) and our categorization technology uses features extracted from this data model to match documents to specific information needs. Clueray’s approach is completely complementary to conventional techniques and can be used to provide structure and context to large, unstructured collections of web documents and as an enabling technology for automated discovery or search. In a nutshell, our technology allows search and search-based applications to focus their analyses on the “right parts” of the “right documents” and dramatically improve the quality of their results.

Current Status
Clueray is currently engaged in active conversations with prospective partners and licensees for its unique content extraction and categorization technology. If you are interested in learning more about Clueray, please email us.