Saturday 31 January 2009

Google vs. FAST relevancy

Relevancy Introduction:

Relevancy is the measure of how well a set of results answers or addresses the intent of a given query.

Relevancy is the balancing of: The truth, The whole truth [Recall - all documents related to the query terms] and nothing but the truth[Precision - only those docs related to the query terms].

In general, customers need to strike a balance between finding everything related to a query and only the documents that relate to a given query.

Precision is the ratio of the number of relevant records retrieved to the total number of irrelevant and relevant records retrieved.

In an e-commerce or e-directory environment, users prefer much more precise search results so that customers are not swamped with too many non-specific results.

Recall is the ratio of the number of relevant records retrieved to the total number of relevant records in the index. knowledge discovery or compliance applications will rate recall as being more important than precision – in other words, customers do not want to “miss” any important documents.

GOOGLE vs. FAST Relevancy

The mechanism underpinning Google's relevancy model is called Page Rank. This is a formula developed by Google to determine a web page's "inbound link ranking". That is the number of web pages that link to that page. Or, to put it in other words, the number of times that page has been cited. The purpose of page rank is to measure a page's relative importance within a set of pages.

In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote.

Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important".

This algorithm works well within a network of hyper linked documents, such as the web . However, within the enterprise we do not solely have web documents, and the documents are rarely if never linked. Therefore, within the firewall Google cannot lean on its page rank algorithm which set it apart from other search engines.

FAST's enterprise search relevancy has needed to evolved beyond Web relevancy techniques. The ranking models are now based on multi-faceted quality measurements such as - context, freshness, completeness, authority, statistics, quality and geography. Each of these dimensions can be augmented by the search business manager. For example for company news search we may want to give a greater relative weighting to the freshness of a document. Where as, with the IT intranet search we may give a greater weighting to the authority of an article, how many times it has been cited or viewed.

Google offer a closed black-box approach to relevancy. However, relevancy is not one-size-fits-all it needs to be contextualised.

Who are the users? What are they trying to achieve? What are they interested in? What are they not interested in?

FAST provides an open flexible relevancy model. Out of the box a set of pre-defined relevancy model profiles are available that align with specific uses or audiences – site search, news, shopping, self-service, market intelligence, surveillance, etc. From these starting sets the relevancy profiles can be tuned based on user feedback or query log reporting.

A convenient way to understand the importance of relevancy models is to visualize a graphic equalizer on an audio system, which has pre-sets for audio environments such as concert hall, car, home, classical, and rock, for example.

It also allows for individual adjustment to meet the needs of the listener.

Similarly, FAST provides pre-set relevancy models where each of the parameters can be independently adjusted, and a change in one does not affect the others.

Four Tips for Improving Relevancy:

  1. Understand the rank profiles – by understanding the user base. What are their objectives? What are their interests? What sources do they use most often? Which of the six dimensions do they value the greatest?

  1. Augment the rank models – search business manager can alter the rank calculation assigned to documents for given queries. This can be based on user feedback, query log reporting, seasonal changes.

  1. Use linguistic tools
    1. Apply lemmatization to improve precision and recall. Go is expanded to going, gone and also went, unlike stemming which fails to capture went.
    2. Synonym expansion to improve recall. Flat is expanded to apartment, studio, condo.
    3. Abbreviation expansion to improve recall. ST is expanded to street, RD is expanded to road.
    4. Acronym expansion to improve recall. U.S.A. Is expanded to united states of America.
    5. Spell checking to prevent futile queries - zero hits.
    6. Activate antiphrasing to remove the “noise” from the query, such as the text of the phrase “how do I”.
    7. Custom vocabularies to increase recall allow for short cuts. List of company specific terms used to generate a dictionary.

  1. Relative query boosting allows the promotion of ranking score to ensure a particular document is always displayed.

  1. Test, measure and refine – use a “golden set” of well-known documents and queries to test and tune relevancy. Providers should use at least 2,000 documents and more than 50 queries.

  1. Entity extraction to unstructured data and dynamic drill-down to structured content

This allows users determine what is relevant to them and gives them hints - e.g. price, rating, availability

No comments: