Tuesday 21 April 2009

ESP Sub systems at a high level - partner briefing.

Connector Sub system:

The connector sub system consists of the connectors, file traverser and crawlers.

This subsystem enables fast ESP to connect to different repositories, web sites and shares and vacuum up the underlying content.

Processing Sub system:

The processing sub system is comprised of the content distributor and the document processing framework.

Content is received from the connector sub system via the content distributor. With the content distributor we can determine the batch considerations and monitor call backs. The content is then directed to the appropriate pipeline. Here it passed through a series of stages where different operations are performed on the data to make it more findable. So, the input to the pipeline is raw data and the output is going to be normalized XML, what we call Fix ML.

Indexing & Search Subsystem:

The indexing sub system as it is more commonly referred to, consists of the index dispatcher, the indexer and the Search service. It has two high level tasks:

- Receiving processed content in the form of fix ML, persisting it to disk, indexing the new content.

- Serving search results from the index.

Query/Results Subsystem:

The query and results sub system has two high level responsibilities.

- Match documents in the index against queries submitted

- Refine both queries and results

Administration sub-system:

The Admin sub system is comprised of the Admin GUI, Configuration Server, Log server, Naming Service and the Search Business Centre. It is responsible for:

- Monitoring the health of system.

- Reporting on the search experience.

- Augmenting the dynamic relevancy of documents.

- Synonym expansion to increase the size of the target.

- Removing certain documents from the results list.

Sub-system Resource Usage:

This is a very useful chart to refer to when determining how to architect your system.

For the 5 dimensions of Processing cycles, Memory, Disk I O, Storage Space and Bandwidth we want to distribute the nodes so as to create the most optimal configuration. We don’t want to have one server doing all the work and maxing out while another is left idle.

Your greediest processes are going to be the:

Search Engine, indexer, Crawler, QR Server.

You may want to consider separating as many of these processes as possible.

For example, Avoid crawling and indexing on the same nodes.

No comments: