Tuesday 21 April 2009

FAST ESP Deployment Considerations - partner briefing


Dimensioning - Overview:

So, why is dimensioning important?

-It enables us to avoid bottlenecks that may clog up the system.

-Provide Fault tolerance to handle potential sub system failure. We will look at optimal fault tolerance configurations a little later.

-By realising where exactly we require redundancy we can minimise costs without compromising service. Is the freshness of content imperative for instance? If not we remove the need for doc processing redundancy to save costs.

The following terms are typically used to describe the main components of a FAST ESP configuration:

SERVER:

One machine may run several nodes of same or different types. It is, however, not recommended to run more than one Search node on one machine due to the extensive CPU + RAM + DISK load this entails.

NODES:

Examples include Document Processor node, Indexer node, Search node.

As a general rule you are not recommended to run several nodes of the same type on a single machine unless it is verified that the nodes will not represent a performance bottleneck.

An important exception is the Document Processor node, which is often recommended to be deployed with more than one instance on the same machine.

Search Cluster:

The search cluster is the combination of rows and columns that make up the search index. It is modelled on the grid computing matrix model.

Adding more columns and more rows is a common way to accomplish increasing system requirements.

Increasing the number of column nodes, provides a greater indexing capacity.

Increasing the number of row nodes, provides a higher query per second rate, and also provides fault tolerance.

N.B: The Indexer nodes are allocated one per column.

N.B: Each column has their own partitions.

Dimensioning – Scoping:

The scoping of a search deployment is impacted by several different factors. To properly dimension a system, you have to have a good understanding of the system requirements with respect to these factors.

Content Volume:

- What is the Number of Documents to be indexed? Will I be able to fit them all on a single box? (~8 million)

-What is the Average Document Size? If the documents are too large we may not get 8 million on a single box. If they are quite small we may well squeeze more on.

- How do we project the content growth over the next few years? What kind of head room do I need to provision?

Content Dynamics:

- Will we require real time or batch indexing? How fresh does the content need to be? Instantanteous, horly, Daily updates?

- What tolerance do we have for searchable latency? That is, from when a document is captured to when it is made searchable how long is tolerable? How much hardware do I need to throw at latency?

- How often do the documents change? Is possible to partition the document set in groups with very different update rates or indexing latency requirements? With archive indexing the indexers will consume less system resources as only few updates are expected for an archive system?

Query Rate:

- What is the current Peak query Rate? How many rows will be required to achieve the necessary QPS?

- Do you experience Seasonal Fluctuations in usage? How do we provision for Christmas periods with retailer for example?

- What are the query Growth Projections? How much head room do we need to provision for this?

Content Characteristics:

- What is the format of the content? How complex do we need to make the pipelines and with how many stages? Is it pre structured XML or Database content or do we need to generate the structure?

- How much of the content is actually indexable and how much of it is formatting noise or images that do not need to be brought into the index? The more content indexed the bigger the index and storage requirements.

- Where exactly is the content residing and how am I going to connect to it and extract it?

Feature Set:

- What Linguistic features are required – Lemmatization, Synonym expansion, Spell Checking, Categorisation? The more content side linguistics the larger the footprint as we inflate the target words. The more query side linguistics the more memory and CPU required.

Availability Requirements:

- Which of the subsystems require fault tolerance? Should the system be Fail-safe, Fail-soft, Fail-stop?

Deployment Considerations – 3 Latency Dimensions:

When considering latency it is important to evaluate 3 separate dimensions.

Indexing latency - the time between when content is submitted from the client to the time when that content is reflected in searches.

Document Processing latency – the time it takes for content to pass through the document processing pipeline.

Search latency – the time it takes from when a users hits the search button to viewing the corresponding results.

No comments: