inanzzz | Some useful elasticsearch notes

These are some notes related to elasticsearch.

Elasticsearch index equals to database database, type equals to table and mapping equals to field.

If a field is set as analyzed then the field is full-text searchable.

If a field is set as not_analyzed then the field is not full-text searchable, instead, it is used for exact value search like = sign.

Searching and sorting only works on fields which were mapped. If you try then you'll get an error like "Parse Failure [No mapping found for [author] in order to sort on".

If a string field is used in full-text search and in sorting then you must read String Sorting and Multifields. This helps you to define a field as analyzed and not_analyzed at same time with different names.

If not specified by user, elasticsearch will use "sort":[{"_score":"desc"}] for sorting by default. This is good if you're doing a full-text search.

The match and multi_match queries provide case-insensitive search capabilities.

If you're searching multiple keywords in multiple fields then use "type": "most_fields" key-value as part of multi_match property.

For more complex queries, use Combining Filters or Bool Query.

If you want to use AVG, MIN, MAX, SUM, COUNT, GROUP BY like database functions in elasticsearch, you need to use Aggregations.

If you're searching a keyword in more than one fields and if it appears in both fields then "type": "most_fields" flag would score the record higher. For more information, read Most Fields page.

If your query uses joins then you must read Handling Relationships pages carefully. Pay attention to "Application-side Joins", "Denormalizing Your Data" and "Nested Objects" sections.

If you're dealing with NULL or NOT NULL values then you must use missing and exists as described in Dealing with Null Values.

If you get SearchParseException error when using sort property in your query, add "ignore_unmapped": true to sort property of your query.

If you're planning to use boost flag in your query, you're better of using Function Score Query feature and include "score_mode": "sum" and "boost_mode": "replace" flags as part of your query to fetch results in more adequate scoring order.

Basic concept

If you want in-depth information, you can visit Basic Concept page.

Cluster (physical unit)

A cluster is a collection of one or more nodes (servers) that holds your entire data. It provides search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch".

Node (physical unit)

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID).

Index (non-physical unit)

An index is a collection of documents (data). An index is identified by a lowercase name. In a single cluster, you can define as many indexes as you want.

Type (non-physical unit)

A type is a logical category of an index which allows you to store different types of documents in the same index.

Document (non-physical unit)

A document is a basic unit of information that can be indexed. For example: single customer, product, order so on. The document is expressed in JSON format. An index can contain as many documents as you want.

Shards (physical unit) and Replica (physical unit)

An index can potentially store a large amount of data that can exceed the disk limits of a single node. This would result in slow search operations. Elasticsearch provides the ability to divide your index into multiple pieces called shards. Sharding is important because:

It allows you to horizontally scale your content volume.

It allows you to distribute and parallelise operations across shards to increase performance.

Elasticsearch allows you to make one or more copies of your index's shards into replicas (replica shard). It is important because in real world failures can be expected at anytime. For example, a shard or node might go offline or disappears. Replication is important because:

It provides high availability in case a shard/node fails. Note: It is important to note that a replica is never allocated on the same node as the original/primary shard that it was copied from.

It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.