¶ 6.1

Data types

Full-text fields and attributes

Manticore's data types can be split into two categories: full-text fields and attributes.

Full-text fields

Full-text fields:

can be indexed with natural language processing algorithms, therefore can be searched for keywords
cannot be used for sorting or grouping
original document's content can be retrieved
original document's content can be used for highlighting

Full-text fields are represented by the data type text. All other data types are called "attributes".

Attributes

Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.

It is often desired to process full-text search results based not only on matching document ID and its rank, but also on a number of other per-document values. For example, one might need to sort news search results by date and then relevance, or search through products within a specified price range, or limit a blog search to posts made by selected users, or group results by month. To do this efficiently, Manticore enables not only full-text fields, but also additional attributes to be added to each document. These attributes can be used to filter, sort, or group full-text matches, or to search only by attributes.

The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.

A good example for attributes would be a forum posts table. Assume that only the title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (i.e., search only those rows that have some specific values of author_id or forum_id); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

SQL

CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);

JSON

POST /cli -d "CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('forum');
$index->create([
    'title'=>['type'=>'text'],
    'content'=>['type'=>'text'],
    'author_id'=>['type'=>'int'],
    'forum_id'=>['type'=>'int'],
    'post_date'=>['type'=>'timestamp']
]);

Python

utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)')

Javascript

res = await utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)');

Java

utilsApi.sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");

utilsApi.Sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");

config

table forum
{
    type = rt
    path = forum

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
    rt_field = content

    # this option needs to be specified for the field to be stored
    stored_fields = title, content

    rt_attr_uint = author_id
    rt_attr_uint = forum_id
    rt_attr_timestamp = post_date
}

This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.

SQL

select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc

JSON

POST /search
{
  "index": "forum",
  "query":
  {
    "match_all": {},
    "bool":
    {
      "must":
      [
        { "equals": { "author_id": 123 } },
        { "in": { "forum_id": [1,3,7] } }
      ]
    }
  },
  "sort": [ { "post_date": "desc" } ]
}

PHP

$client->search([
        'index' => 'forum',
        'query' =>
        [
            'match_all' => [],
            'bool' => [
                'must' => [
                    'equals' => ['author_id' => 123],
                    'in' => [
                        'forum_id' => [
                            1,3,7
                        ]
                    ]
                ]
            ]
        ],
        'sort' => [
        ['post_date' => 'desc']
    ]
]);

Python

searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]})

javascript

res = await searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]});

java

HashMap<String,Object> filters = new HashMap<String,Object>(){{
    put("must", new HashMap<String,Object>(){{
        put("equals",new HashMap<String,Integer>(){{
            put("author_id",123);
        }});
        put("in",
            new HashMap<String,Object>(){{
                put("forum_id",new int[] {1,3,7});
        }});
    }});
}};
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("bool",filters);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("post_date","desc");}});
}});
SearchResponse searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var boolFilter = new BoolFilter();
boolFilter.Must = new List<Object> {
    new EqualsFilter("author_id", 123),
    new InFilter("forum_id", new List<Object> {1,3,7})
};
searchRequest.AttrFilter = boolFilter;
searchRequest.Sort = new List<Object> { new SortOrder("post_date", SortOrder.OrderEnum.Desc) };
var searchResponse = searchApi.Search(searchRequest);

Row-wise and columnar attribute storages

Manticore supports two types of attribute storages:

row-wise - traditional storage available in Manticore Search out of the box
columnar - provided by Manticore Columnar Library

As can be understood from their names, they store data differently. The traditional row-wise storage:

stores attributes uncompressed
all attributes of the same document are stored in one row close to each other
rows are stored one by one
accessing attributes is basically done by just multiplying the row ID by the stride (length of a single vector) and getting the requested attribute from the calculated memory location. It gives very low random access latency.
attributes have to be in memory to get acceptable performance, otherwise due to the row-wise nature of the storage Manticore may have to read from disk too much unneeded data which is in many cases suboptimal.

With the columnar storage:

each attribute is stored independently of all other attributes in its separate "column"
storage is split into blocks of 65536 entries
the blocks are stored compressed. This often allows storing just a few distinct values instead of storing all of them like in the row-wise storage. High compression ratio allows reading from disk faster and makes the memory requirement much lower
when data is indexed, storage scheme is selected for each block independently. For example, if all values in a block are the same, it gets "const" storage and only one value is stored for the whole block. If there are less than 256 unique values per block, it gets "table" storage and stores indexes to a table of values instead of the values themselves
search in a block can be early rejected if it's clear the requested value is not present in the block.

The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:

if you have enough memory for all your attributes you will benefit from the row-wise storage
otherwise, the columnar storage can still give you decent performance with a much lower memory footprint which will allow you to store much more documents in your table

How to switch between the storages

The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.

To enable the columnar storage you need to:

specify engine='columnar' in CREATE TABLE to make all attributes of the table columnar. Then, if you want to keep a specific attribute row-wise, you need to add engine='rowwise' when you declare it. For example:

create table tbl(title text, type int, price float engine='rowwise') engine='columnar'

specify engine='columnar' for a specific attribute in CREATE TABLE to make it columnar. For example:

create table tbl(title text, type int, price float engine='columnar');

create table tbl(title text, type int, price float engine='columnar') engine='rowwise';

in the plain mode you need to list attributes you want to be columnar in columnar_attrs.

Below is the list of data types supported by Manticore Search:

Document ID

The document identifier is a mandatory attribute, and document IDs must be unique 64-bit unsigned integers. Document IDs can be explicitly specified, but if not, they are still enabled. Document IDs cannot be updated. Note that when retrieving document IDs, they are treated as signed 64-bit integers, which means they may be negative. Use the UINT64() function to cast them to unsigned 64-bit integers if necessary.

Explicit ID

When you create a table, you can specify ID explicitly, but no matter what data type you use, it will be always as said previously - a signed 64-bit integer.

CREATE TABLE tbl(id bigint, content text);
DESC tbl;

+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Implicit ID

You can also omit specifying ID at all, it will be enabled automatically.

CREATE TABLE tbl(content text);
DESC tbl;

+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Character data types

General syntax:

string|text [stored|attribute] [indexed]

Properties:

indexed - full-text indexed (can be used in full-text queries)
stored - stored in a docstore (stored on disk, not in RAM, lazy read)
attribute - makes it a string attribute (can sort/group by it)

Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.

No properties specified:

string and text are aliases, but if you don’t specify any properties, they by default mean different things:

just string by default means attribute (see details below).
just text by default means stored + indexed (see details below).

Text

The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.

Text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations, etc. Eventually, a full-text table (a special data structure that enables quick searches for a keyword) gets built from that text.

Full-text fields can only be used in the MATCH() clause and cannot be used for sorting or aggregation. Words are stored in an inverted index along with references to the fields they belong to and positions in the field. This allows searching a word inside each field and using advanced operators like proximity. By default, the original text of the fields is both indexed and stored in document storage. It means that the original text can be returned with the query results and used in search result highlighting.

SQL

CREATE TABLE products(title text);

JSON

POST /cli -d "CREATE TABLE products(title text)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text']
]);

Python

utilsApi.sql('CREATE TABLE products(title text)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text)');

java

utilsApi.sql("CREATE TABLE products(title text)");

utilsApi.Sql("CREATE TABLE products(title text)");

config

table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title

    # this option needs to be specified for the field to be stored
    stored_fields = title
}

This behavior can be overridden by explicitly specifying that the text is only indexed.

SQL

CREATE TABLE products(title text indexed);

JSON

POST /cli -d "CREATE TABLE products(title text indexed)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text','options'=>['indexed']]
]);

Python

utilsApi.sql('CREATE TABLE products(title text indexed)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text indexed)');

java

utilsApi.sql("CREATE TABLE products(title text indexed)");

utilsApi.Sql("CREATE TABLE products(title text indexed)");

config

table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
}

Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. "title" and "abstract" only). You can have up to 256 full-text fields.

SQL

select * from products where match('@title first');

JSON

POST /search
{
    "index": "products",
    "query":
    {
        "match": { "title": "first" }
    }
}

PHP

$index->setName('products')->search('@title')->get();

Python

searchApi.search({"index":"products","query":{"match":{"title":"first"}}})

javascript

res = await searchApi.search({"index":"products","query":{"match":{"title":"first"}}});

java

utilsApi.sql("CREATE TABLE products(title text indexed)");

utilsApi.Sql("CREATE TABLE products(title text indexed)");

String

Unlike full-text fields, string attributes (just string or string/text attribute) are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used in the WHERE clause for comparison filtering or REGEX, and can be used for sorting and aggregation. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys.

If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.

SQL

CREATE TABLE products(title text, keys string);

JSON

POST /cli -d "CREATE TABLE products(title text, keys string)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'keys'=>['type'=>'string']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, keys string)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, keys string)');

java

utilsApi.sql("CREATE TABLE products(title text, keys string)");

utilsApi.Sql("CREATE TABLE products(title text, keys string)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_string = keys
}

You can create a full-text field that is also stored as a string attribute. This approach creates a full-text field and a string attribute that have the same name. Note that you can't add a `stored` property to store the data as a string attribute and in the document storage at the same time.

SQL

`string attribute indexed` means that we're working with a string data type that is stored as an attribute and indexed as a full-text field.

CREATE TABLE products ( title string attribute indexed );

JSON

POST /cli -d "CREATE TABLE products ( title string attribute indexed )"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'string','options'=>['indexed','attribute']]
]);

Python

utilsApi.sql('CREATE TABLE products ( title string attribute indexed )')

javascript

res = await utilsApi.sql('CREATE TABLE products ( title string attribute indexed )');

java

utilsApi.sql("CREATE TABLE products ( title string attribute indexed )");

utilsApi.Sql("CREATE TABLE products ( title string attribute indexed )");

config

table products
{
    type = rt
    path = products

    rt_field = title
    rt_attr_string = title
}

Integer

Integer type allows storing 32 bit unsigned integer values.

SQL

CREATE TABLE products(title text, price int);

JSON

POST /cli -d "CREATE TABLE products(title text, price int)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'int']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, price int)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price int)');

java

utilsApi.sql("CREATE TABLE products(title text, price int)");

utilsApi.Sql("CREATE TABLE products(title text, price int)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = type
}

Integers can be stored in shorter sizes than 32-bit by specifying a bit count. For example, if we want to store a numeric value which we know is not going to be bigger than 8, the type can be defined as bit(3). Bitcount integers perform slower than the full-size ones, but they require less RAM. They are saved in 32-bit chunks, so in order to save space, they should be grouped at the end of attribute definitions (otherwise a bitcount integer between 2 full-size integers will occupy 32 bits as well).

SQL

CREATE TABLE products(title text, flags bit(3), tags bit(2) );

JSON

POST /cli -d "CREATE TABLE products(title text, flags bit(3), tags bit(2))"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'flags'=>['type'=>'bit(3)'],
    'tags'=>['type'=>'bit(2)']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ');

java

utilsApi.sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");

utilsApi.Sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = flags:3
    rt_attr_uint = tags:2
}

Big Integer

Big integers (bigint) are 64-bit wide signed integers.

SQL

CREATE TABLE products(title text, price bigint );

JSON

POST /cli -d "CREATE TABLE products(title text, price bigint)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'bigint']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, price bigint )')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price bigint )');

java

utilsApi.sql("CREATE TABLE products(title text, price bigint )");

utilsApi.Sql("CREATE TABLE products(title text, price bigint )");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bigint = type
}

Boolean

Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.

SQL

CREATE TABLE products(title text, sold bool );

JSON

POST /cli -d "CREATE TABLE products(title text, sold bool)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'sold'=>['type'=>'bool']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, sold bool )')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, sold bool )');

java

utilsApi.sql("CREATE TABLE products(title text, sold bool )");

utilsApi.Sql("CREATE TABLE products(title text, sold bool )");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bool = sold
}

Timestamps

Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.

SQL

CREATE TABLE products(title text, date timestamp);

JSON

POST /cli -d "CREATE TABLE products(title text, date timestamp)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'date'=>['type'=>'timestamp']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, date timestamp)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, date timestamp)');

java

utilsApi.sql("CREATE TABLE products(title text, date timestamp)");

utilsApi.Sql("CREATE TABLE products(title text, date timestamp)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_timestamp = date
}

Float

Real numbers are stored as 32-bit IEEE 754 single precision floats.

SQL

CREATE TABLE products(title text, coeff float);

JSON

POST /cli -d "CREATE TABLE products(title text, coeff float)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'coeff'=>['type'=>'float']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, coeff float)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, coeff float)');

java

utilsApi.sql("CREATE TABLE products(title text, coeff float)");

utilsApi.Sql("CREATE TABLE products(title text, coeff float)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float = coeff
}

Unlike integer types, comparing two floating-point numbers for equality is not recommended due to potential rounding errors. A more reliable approach is to use a near-equal comparison, by checking the absolute error margin.

SQL

select abs(a-b)<=0.00001 from products

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "eps": "abs(a-b)" }
}

PHP

$index->setName('products')->search('')->expression('eps','abs(a-b)')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}},"expressions":{"eps":"abs(a-b)"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"eps":"abs(a-b)"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("ebs","abs(a-b)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object>{
    new Dictionary<string, string> { {"ebs", "abs(a-b)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Another alternative, which can also be used to perform IN(attr,val1,val2,val3) is to compare floats as integers by choosing a multiplier factor and convert the floats to integers in operations. The following example illustrates modifying IN(attr,2.0,2.5,3.5) to work with integer values.

SQL

select in(ceil(attr*100),200,250,350) from products

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "inc": "in(ceil(attr*100),200,250,350)" }
}

PHP

$index->setName('products')->search('')->expression('inc','in(ceil(attr*100),200,250,350)')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("inc","in(ceil(attr*100),200,250,350)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"ebs", "in(ceil(attr*100),200,250,350)"} }
};
var searchResponse = searchApi.Search(searchRequest);

JSON

This data type allows storing JSON objects, which is useful for storing schema-less data. However, it is not supported by columnar storage. However, it can be stored in traditional storage, as it's possible to combine both storage types in the same table.

SQL

CREATE TABLE products(title text, data json);

JSON

POST /cli -d "CREATE TABLE products(title text, data json)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'data'=>['type'=>'json']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, data json)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, data json)');

java

utilsApi.sql'CREATE TABLE products(title text, data json)');

utilsApi.Sql'CREATE TABLE products(title text, data json)');

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_json = data
}

JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.

SQL

select indexof(x>2 for x in data.intarray) from products

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "idx": "indexof(x>2 for x in data.intarray)" }
}

PHP

$index->setName('products')->search('')->expression('idx','indexof(x>2 for x in data.intarray)')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.

SQL

select regex(data.name, 'est') as c from products where c>0

JSON

POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "range": { "c": { "gt": 0 } } }
  },
  "expressions": { "c": "regex(data.name, 'est')" }
}

PHP

$index->setName('products')->search('')->expression('idx',"regex(data.name, 'est')")->filter('c','gt',0)->get();

Python

searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("range", new HashMap<String,Object>(){{
    put("c", new HashMap<String,Object>(){{
        put("gt",0);
    }});
}});
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var rangeFilter = new RangeFilter("c");
rangeFilter.Gt = 0;
searchRequest.AttrFilter = rangeFilter;
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

In the case of JSON properties, enforcing data type may be required for proper functionality in certain situations. For example, when working with float values, DOUBLE() must be used for proper sorting.

SQL

select * from products order by double(data.myfloat) desc

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "sort": [ { "double(data.myfloat)": { "order": "desc"} } ]
}

PHP

$index->setName('products')->search('')->sort('double(data.myfloat)','desc')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("double(data.myfloat)",new HashMap<String,String>(){{ put("order","desc");}});}});
}});
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortOrder("double(data.myfloat)", SortOrder.OrderEnum.Desc)
};
var searchResponse = searchApi.Search(searchRequest);

Float vector

Float vector attributes allow storing variable-length lists of floats. It's important to note that this concept differs from multi-valued attributes. Multi-valued attributes (MVAs) are essentially sets; they do not preserve value order, and duplicate values are not retained. In contrast, float vectors perform no additional processing on values during insertion.

Float vector attributes can be used in k-nearest neighbor searches; see KNN search.

** Currently, float_vector fields can only be utilized in KNN search within real-time tables and the data type is not supported in any other functions or expressions, nor is it supported in plain tables. **

SQL

CREATE TABLE products(title text, image_vector float_vector);

JSON

POST /cli -d "CREATE TABLE products(title text, image_vector float_vector)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'image_vector'=>['type'=>'float_vector']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)');

java

utilsApi.sql("CREATE TABLE products(title text, image_vector float_vector)");

utilsApi.Sql("CREATE TABLE products(title text, image_vector float_vector)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float_vector = image_vector
}

Multi-value integer (MVA)

Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.

SQL

CREATE TABLE products(title text, product_codes multi);

JSON

POST /cli -d "CREATE TABLE products(title text, product_codes multi)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'product_codes'=>['type'=>'multi']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, product_codes multi)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, product_codes multi)');

java

utilsApi.sql("CREATE TABLE products(title text, product_codes multi)");

utilsApi.Sql("CREATE TABLE products(title text, product_codes multi)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi = product_codes
}

It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.

SQL

select * from products where any(product_codes)=3

JSON

POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "equals" : { "any(product_codes)": 3 }
  }
}

PHP

$index->setName('products')->search('')->filter('any(product_codes)','equals',3)->get();

Python

searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})'

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("equals",new HashMap<String,Integer>(){{
     put("any(product_codes)",3);
}});
searchRequest.setQuery(query);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.AttrFilter = new EqualsFilter("any(product_codes)", 3);
var searchResponse = searchApi.Search(searchRequest);

Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.

SQL

select least(product_codes) l from products order by l asc

JSON

POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "sort": [ { "product_codes":{ "order":"asc", "mode":"min" } } ]
  }
}

PHP

$index->setName('products')->search('')->sort('product_codes','asc','min')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("product_codes",new HashMap<String,String>(){{ put("order","asc");put("mode","min");}});}});
}});
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortMVA("product_codes", SortOrder.OrderEnum.Asc, SortMVA.ModeEnum.Min)
};
searchResponse = searchApi.search(searchRequest);

When grouping by a multi-value attribute, a document will contribute to as many groups as there are different values associated with that document. For instance, if a collection contains exactly one document having a 'product_codes' multi-value attribute with values 5, 7, and 11, grouping on 'product_codes' will produce 3 groups with COUNT(*)equal to 1 and GROUPBY() key values of 5, 7, and 11, respectively. Also, note that grouping by multi-value attributes may lead to duplicate documents in the result set because each document can participate in many groups.

SQL

insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;

Query OK, 1 row affected (0.00 sec)

+------+----------+-----------+
| id   | count(*) | groupby() |
+------+----------+-----------+
|    1 |        1 |        11 |
|    1 |        1 |         7 |
|    1 |        1 |         5 |
+------+----------+-----------+
3 rows in set (0.00 sec)

The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.

SQL

insert into product values (1,'first',(4,2,1,3));
select * from products;

Query OK, 1 row affected (0.00 sec)

+------+---------------+-------+
| id   | product_codes | title |
+------+---------------+-------+
|    1 | 1,2,3,4       | first |
+------+---------------+-------+
1 row in set (0.01 sec)

JSON

POST /insert
{
    "index":"products",
    "id":1,
    "doc":
    {
        "title":"first",
        "product_codes":[4,2,1,3]
    }
}

POST /search
{
  "index": "products",
  "query": { "match_all": {} }
}

{
   "_index":"products",
   "_id":1,
   "created":true,
   "result":"created",
   "status":201
}

{
   "took":0,
   "timed_out":false,
   "hits":{
      "total":1,
      "hits":[
         {
            "_id":"1",
            "_score":1,
            "_source":{
               "product_codes":[
                  1,
                  2,
                  3,
                  4
               ],
               "title":"first"
            }
         }
      ]
   }
}

PHP

$index->addDocument([
    "title"=>"first",
    "product_codes"=>[4,2,1,3]
]);
$index->search('')-get();

Array
(
    [_index] => products
    [_id] => 1
    [created] => 1
    [result] => created
    [status] => 201
)
Array
(
    [took] => 0
    [timed_out] =>
    [hits] => Array
        (
            [total] => 1
            [hits] => Array
                (
                    [0] => Array
                        (
                            [_id] => 1
                            [_score] => 1
                            [_source] => Array
                                (
                                    [product_codes] => Array
                                        (
                                            [0] => 1
                                            [1] => 2
                                            [2] => 3
                                            [3] => 4
                                        )

                                    [title] => first
                                )
                        )
                )
        )
)

Python

indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}})
searchApi.search({"index":"products","query":{"match_all":{}}})

{'created': True,
 'found': None,
 'id': 1,
 'index': 'products',
 'result': 'created'}
{'hits': {'hits': [{u'_id': u'1',
                    u'_score': 1,
                    u'_source': {u'product_codes': [1, 2, 3, 4],
                                 u'title': u'first'}}],
          'total': 1},
 'profile': None,
 'timed_out': False,
 'took': 29}

javascript

await indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}});
res = await searchApi.search({"index":"products","query":{"match_all":{}}});

{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1,"_source":{"product_codes":[1,2,3,4],"title":"first"}}]}}

java

InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
    put("title","first");
    put("product_codes",new int[] {4,2,1,3});
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
System.out.println(searchResponse.toString() );

class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}

Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "first");
doc.Add("product_codes", new List<Object> {4,2,1,3});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
object query =  new { match_all=null };
var searchRequest = new SearchRequest("products", query);
var searchResponse = searchApi.Search(searchRequest);
Console.WriteLine(searchResponse.ToString())

class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}

Multi-value big integer

A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.

SQL

CREATE TABLE products(title text, values multi64);

JSON

POST /cli -d "CREATE TABLE products(title text, values multi64)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'values'=>['type'=>'multi64']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, values multi64))')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, values multi64))');

java

utilsApi.sql("CREATE TABLE products(title text, values multi64))");

utilsApi.Sql("CREATE TABLE products(title text, values multi64))");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi_64 = values
}

Columnar attribute properties

When you use the columnar storage you can specify the following properties for the attributes.

fast_fetch

By default, Manticore Columnar storage stores all attributes in a columnar fashion, as well as in a special docstore row by row. This enables fast execution of queries like SELECT * FROM ..., especially when fetching a large number of records at once. However, if you are sure that you do not need it or wish to save disk space, you can disable it by specifying fast_fetch='0' when creating a table or (if you are defining a table in a config) by using columnar_no_fast_fetch as shown in the following example.

RT mode

create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;

+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
3 rows in set (0.00 sec)

Plain mode

source min {
    type = mysql
    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_query = select 1, 1 a, 1 b
    sql_attr_uint = a
    sql_attr_uint = b
}

table tbl {
    path = tbl/col
    source = min
    columnar_attrs = *
    columnar_no_fast_fetch = b
}

+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+

¶ 6.1.1

Data types

Full-text fields and attributes

Manticore's data types can be split into two categories: full-text fields and attributes.

Full-text fields

Full-text fields:

can be indexed with natural language processing algorithms, therefore can be searched for keywords
cannot be used for sorting or grouping
original document's content can be retrieved
original document's content can be used for highlighting

Full-text fields are represented by the data type text. All other data types are called "attributes".

Attributes

Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.

The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.

SQL

CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);

JSON

POST /cli -d "CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('forum');
$index->create([
    'title'=>['type'=>'text'],
    'content'=>['type'=>'text'],
    'author_id'=>['type'=>'int'],
    'forum_id'=>['type'=>'int'],
    'post_date'=>['type'=>'timestamp']
]);

Python

utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)')

Javascript

res = await utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)');

Java

utilsApi.sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");

utilsApi.Sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");

config

table forum
{
    type = rt
    path = forum

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
    rt_field = content

    # this option needs to be specified for the field to be stored
    stored_fields = title, content

    rt_attr_uint = author_id
    rt_attr_uint = forum_id
    rt_attr_timestamp = post_date
}

This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.

SQL

select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc

JSON

POST /search
{
  "index": "forum",
  "query":
  {
    "match_all": {},
    "bool":
    {
      "must":
      [
        { "equals": { "author_id": 123 } },
        { "in": { "forum_id": [1,3,7] } }
      ]
    }
  },
  "sort": [ { "post_date": "desc" } ]
}

PHP

$client->search([
        'index' => 'forum',
        'query' =>
        [
            'match_all' => [],
            'bool' => [
                'must' => [
                    'equals' => ['author_id' => 123],
                    'in' => [
                        'forum_id' => [
                            1,3,7
                        ]
                    ]
                ]
            ]
        ],
        'sort' => [
        ['post_date' => 'desc']
    ]
]);

Python

searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]})

javascript

res = await searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]});

java

HashMap<String,Object> filters = new HashMap<String,Object>(){{
    put("must", new HashMap<String,Object>(){{
        put("equals",new HashMap<String,Integer>(){{
            put("author_id",123);
        }});
        put("in",
            new HashMap<String,Object>(){{
                put("forum_id",new int[] {1,3,7});
        }});
    }});
}};
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("bool",filters);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("post_date","desc");}});
}});
SearchResponse searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var boolFilter = new BoolFilter();
boolFilter.Must = new List<Object> {
    new EqualsFilter("author_id", 123),
    new InFilter("forum_id", new List<Object> {1,3,7})
};
searchRequest.AttrFilter = boolFilter;
searchRequest.Sort = new List<Object> { new SortOrder("post_date", SortOrder.OrderEnum.Desc) };
var searchResponse = searchApi.Search(searchRequest);

Row-wise and columnar attribute storages

Manticore supports two types of attribute storages:

row-wise - traditional storage available in Manticore Search out of the box
columnar - provided by Manticore Columnar Library

As can be understood from their names, they store data differently. The traditional row-wise storage:

stores attributes uncompressed
all attributes of the same document are stored in one row close to each other
rows are stored one by one
accessing attributes is basically done by just multiplying the row ID by the stride (length of a single vector) and getting the requested attribute from the calculated memory location. It gives very low random access latency.
attributes have to be in memory to get acceptable performance, otherwise due to the row-wise nature of the storage Manticore may have to read from disk too much unneeded data which is in many cases suboptimal.

With the columnar storage:

each attribute is stored independently of all other attributes in its separate "column"
storage is split into blocks of 65536 entries
the blocks are stored compressed. This often allows storing just a few distinct values instead of storing all of them like in the row-wise storage. High compression ratio allows reading from disk faster and makes the memory requirement much lower
when data is indexed, storage scheme is selected for each block independently. For example, if all values in a block are the same, it gets "const" storage and only one value is stored for the whole block. If there are less than 256 unique values per block, it gets "table" storage and stores indexes to a table of values instead of the values themselves
search in a block can be early rejected if it's clear the requested value is not present in the block.

The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:

if you have enough memory for all your attributes you will benefit from the row-wise storage
otherwise, the columnar storage can still give you decent performance with a much lower memory footprint which will allow you to store much more documents in your table

How to switch between the storages

The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.

To enable the columnar storage you need to:

specify engine='columnar' in CREATE TABLE to make all attributes of the table columnar. Then, if you want to keep a specific attribute row-wise, you need to add engine='rowwise' when you declare it. For example:

create table tbl(title text, type int, price float engine='rowwise') engine='columnar'

specify engine='columnar' for a specific attribute in CREATE TABLE to make it columnar. For example:

create table tbl(title text, type int, price float engine='columnar');

create table tbl(title text, type int, price float engine='columnar') engine='rowwise';

in the plain mode you need to list attributes you want to be columnar in columnar_attrs.

Below is the list of data types supported by Manticore Search:

Document ID

Explicit ID

When you create a table, you can specify ID explicitly, but no matter what data type you use, it will be always as said previously - a signed 64-bit integer.

CREATE TABLE tbl(id bigint, content text);
DESC tbl;

+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Implicit ID

You can also omit specifying ID at all, it will be enabled automatically.

CREATE TABLE tbl(content text);
DESC tbl;

+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Character data types

General syntax:

string|text [stored|attribute] [indexed]

Properties:

indexed - full-text indexed (can be used in full-text queries)
stored - stored in a docstore (stored on disk, not in RAM, lazy read)
attribute - makes it a string attribute (can sort/group by it)

Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.

No properties specified:

string and text are aliases, but if you don’t specify any properties, they by default mean different things:

just string by default means attribute (see details below).
just text by default means stored + indexed (see details below).

Text

The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.

SQL

CREATE TABLE products(title text);

JSON

POST /cli -d "CREATE TABLE products(title text)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text']
]);

Python

utilsApi.sql('CREATE TABLE products(title text)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text)');

java

utilsApi.sql("CREATE TABLE products(title text)");

utilsApi.Sql("CREATE TABLE products(title text)");

config

table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title

    # this option needs to be specified for the field to be stored
    stored_fields = title
}

This behavior can be overridden by explicitly specifying that the text is only indexed.

SQL

CREATE TABLE products(title text indexed);

JSON

POST /cli -d "CREATE TABLE products(title text indexed)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text','options'=>['indexed']]
]);

Python

utilsApi.sql('CREATE TABLE products(title text indexed)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text indexed)');

java

utilsApi.sql("CREATE TABLE products(title text indexed)");

utilsApi.Sql("CREATE TABLE products(title text indexed)");

config

table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
}

SQL

select * from products where match('@title first');

JSON

POST /search
{
    "index": "products",
    "query":
    {
        "match": { "title": "first" }
    }
}

PHP

$index->setName('products')->search('@title')->get();

Python

searchApi.search({"index":"products","query":{"match":{"title":"first"}}})

javascript

res = await searchApi.search({"index":"products","query":{"match":{"title":"first"}}});

java

utilsApi.sql("CREATE TABLE products(title text indexed)");

utilsApi.Sql("CREATE TABLE products(title text indexed)");

String

If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.

SQL

CREATE TABLE products(title text, keys string);

JSON

POST /cli -d "CREATE TABLE products(title text, keys string)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'keys'=>['type'=>'string']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, keys string)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, keys string)');

java

utilsApi.sql("CREATE TABLE products(title text, keys string)");

utilsApi.Sql("CREATE TABLE products(title text, keys string)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_string = keys
}

SQL

`string attribute indexed` means that we're working with a string data type that is stored as an attribute and indexed as a full-text field.

CREATE TABLE products ( title string attribute indexed );

JSON

POST /cli -d "CREATE TABLE products ( title string attribute indexed )"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'string','options'=>['indexed','attribute']]
]);

Python

utilsApi.sql('CREATE TABLE products ( title string attribute indexed )')

javascript

res = await utilsApi.sql('CREATE TABLE products ( title string attribute indexed )');

java

utilsApi.sql("CREATE TABLE products ( title string attribute indexed )");

utilsApi.Sql("CREATE TABLE products ( title string attribute indexed )");

config

table products
{
    type = rt
    path = products

    rt_field = title
    rt_attr_string = title
}

Integer

Integer type allows storing 32 bit unsigned integer values.

SQL

CREATE TABLE products(title text, price int);

JSON

POST /cli -d "CREATE TABLE products(title text, price int)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'int']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, price int)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price int)');

java

utilsApi.sql("CREATE TABLE products(title text, price int)");

utilsApi.Sql("CREATE TABLE products(title text, price int)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = type
}

SQL

CREATE TABLE products(title text, flags bit(3), tags bit(2) );

JSON

POST /cli -d "CREATE TABLE products(title text, flags bit(3), tags bit(2))"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'flags'=>['type'=>'bit(3)'],
    'tags'=>['type'=>'bit(2)']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ');

java

utilsApi.sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");

utilsApi.Sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = flags:3
    rt_attr_uint = tags:2
}

Big Integer

Big integers (bigint) are 64-bit wide signed integers.

SQL

CREATE TABLE products(title text, price bigint );

JSON

POST /cli -d "CREATE TABLE products(title text, price bigint)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'bigint']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, price bigint )')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price bigint )');

java

utilsApi.sql("CREATE TABLE products(title text, price bigint )");

utilsApi.Sql("CREATE TABLE products(title text, price bigint )");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bigint = type
}

Boolean

Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.

SQL

CREATE TABLE products(title text, sold bool );

JSON

POST /cli -d "CREATE TABLE products(title text, sold bool)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'sold'=>['type'=>'bool']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, sold bool )')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, sold bool )');

java

utilsApi.sql("CREATE TABLE products(title text, sold bool )");

utilsApi.Sql("CREATE TABLE products(title text, sold bool )");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bool = sold
}

Timestamps

Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.

SQL

CREATE TABLE products(title text, date timestamp);

JSON

POST /cli -d "CREATE TABLE products(title text, date timestamp)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'date'=>['type'=>'timestamp']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, date timestamp)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, date timestamp)');

java

utilsApi.sql("CREATE TABLE products(title text, date timestamp)");

utilsApi.Sql("CREATE TABLE products(title text, date timestamp)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_timestamp = date
}

Float

Real numbers are stored as 32-bit IEEE 754 single precision floats.

SQL

CREATE TABLE products(title text, coeff float);

JSON

POST /cli -d "CREATE TABLE products(title text, coeff float)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'coeff'=>['type'=>'float']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, coeff float)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, coeff float)');

java

utilsApi.sql("CREATE TABLE products(title text, coeff float)");

utilsApi.Sql("CREATE TABLE products(title text, coeff float)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float = coeff
}

SQL

select abs(a-b)<=0.00001 from products

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "eps": "abs(a-b)" }
}

PHP

$index->setName('products')->search('')->expression('eps','abs(a-b)')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}},"expressions":{"eps":"abs(a-b)"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"eps":"abs(a-b)"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("ebs","abs(a-b)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object>{
    new Dictionary<string, string> { {"ebs", "abs(a-b)"} }
};
var searchResponse = searchApi.Search(searchRequest);

SQL

select in(ceil(attr*100),200,250,350) from products

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "inc": "in(ceil(attr*100),200,250,350)" }
}

PHP

$index->setName('products')->search('')->expression('inc','in(ceil(attr*100),200,250,350)')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("inc","in(ceil(attr*100),200,250,350)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"ebs", "in(ceil(attr*100),200,250,350)"} }
};
var searchResponse = searchApi.Search(searchRequest);

JSON

SQL

CREATE TABLE products(title text, data json);

JSON

POST /cli -d "CREATE TABLE products(title text, data json)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'data'=>['type'=>'json']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, data json)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, data json)');

java

utilsApi.sql'CREATE TABLE products(title text, data json)');

utilsApi.Sql'CREATE TABLE products(title text, data json)');

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_json = data
}

JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.

SQL

select indexof(x>2 for x in data.intarray) from products

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "idx": "indexof(x>2 for x in data.intarray)" }
}

PHP

$index->setName('products')->search('')->expression('idx','indexof(x>2 for x in data.intarray)')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.

SQL

select regex(data.name, 'est') as c from products where c>0

JSON

POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "range": { "c": { "gt": 0 } } }
  },
  "expressions": { "c": "regex(data.name, 'est')" }
}

PHP

$index->setName('products')->search('')->expression('idx',"regex(data.name, 'est')")->filter('c','gt',0)->get();

Python

searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("range", new HashMap<String,Object>(){{
    put("c", new HashMap<String,Object>(){{
        put("gt",0);
    }});
}});
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var rangeFilter = new RangeFilter("c");
rangeFilter.Gt = 0;
searchRequest.AttrFilter = rangeFilter;
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

SQL

select * from products order by double(data.myfloat) desc

JSON

POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "sort": [ { "double(data.myfloat)": { "order": "desc"} } ]
}

PHP

$index->setName('products')->search('')->sort('double(data.myfloat)','desc')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("double(data.myfloat)",new HashMap<String,String>(){{ put("order","desc");}});}});
}});
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortOrder("double(data.myfloat)", SortOrder.OrderEnum.Desc)
};
var searchResponse = searchApi.Search(searchRequest);

Float vector

Float vector attributes can be used in k-nearest neighbor searches; see KNN search.

SQL

CREATE TABLE products(title text, image_vector float_vector);

JSON

POST /cli -d "CREATE TABLE products(title text, image_vector float_vector)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'image_vector'=>['type'=>'float_vector']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)');

java

utilsApi.sql("CREATE TABLE products(title text, image_vector float_vector)");

utilsApi.Sql("CREATE TABLE products(title text, image_vector float_vector)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float_vector = image_vector
}

Multi-value integer (MVA)

Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.

SQL

CREATE TABLE products(title text, product_codes multi);

JSON

POST /cli -d "CREATE TABLE products(title text, product_codes multi)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'product_codes'=>['type'=>'multi']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, product_codes multi)')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, product_codes multi)');

java

utilsApi.sql("CREATE TABLE products(title text, product_codes multi)");

utilsApi.Sql("CREATE TABLE products(title text, product_codes multi)");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi = product_codes
}

It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.

SQL

select * from products where any(product_codes)=3

JSON

POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "equals" : { "any(product_codes)": 3 }
  }
}

PHP

$index->setName('products')->search('')->filter('any(product_codes)','equals',3)->get();

Python

searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})'

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("equals",new HashMap<String,Integer>(){{
     put("any(product_codes)",3);
}});
searchRequest.setQuery(query);
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.AttrFilter = new EqualsFilter("any(product_codes)", 3);
var searchResponse = searchApi.Search(searchRequest);

Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.

SQL

select least(product_codes) l from products order by l asc

JSON

POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "sort": [ { "product_codes":{ "order":"asc", "mode":"min" } } ]
  }
}

PHP

$index->setName('products')->search('')->sort('product_codes','asc','min')->get();

Python

searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}})

javascript

res = await searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}});

java

searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("product_codes",new HashMap<String,String>(){{ put("order","asc");put("mode","min");}});}});
}});
searchResponse = searchApi.search(searchRequest);

object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortMVA("product_codes", SortOrder.OrderEnum.Asc, SortMVA.ModeEnum.Min)
};
searchResponse = searchApi.search(searchRequest);

SQL

insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;

Query OK, 1 row affected (0.00 sec)

+------+----------+-----------+
| id   | count(*) | groupby() |
+------+----------+-----------+
|    1 |        1 |        11 |
|    1 |        1 |         7 |
|    1 |        1 |         5 |
+------+----------+-----------+
3 rows in set (0.00 sec)

The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.

SQL

insert into product values (1,'first',(4,2,1,3));
select * from products;

Query OK, 1 row affected (0.00 sec)

+------+---------------+-------+
| id   | product_codes | title |
+------+---------------+-------+
|    1 | 1,2,3,4       | first |
+------+---------------+-------+
1 row in set (0.01 sec)

JSON

POST /insert
{
    "index":"products",
    "id":1,
    "doc":
    {
        "title":"first",
        "product_codes":[4,2,1,3]
    }
}

POST /search
{
  "index": "products",
  "query": { "match_all": {} }
}

{
   "_index":"products",
   "_id":1,
   "created":true,
   "result":"created",
   "status":201
}

{
   "took":0,
   "timed_out":false,
   "hits":{
      "total":1,
      "hits":[
         {
            "_id":"1",
            "_score":1,
            "_source":{
               "product_codes":[
                  1,
                  2,
                  3,
                  4
               ],
               "title":"first"
            }
         }
      ]
   }
}

PHP

$index->addDocument([
    "title"=>"first",
    "product_codes"=>[4,2,1,3]
]);
$index->search('')-get();

Array
(
    [_index] => products
    [_id] => 1
    [created] => 1
    [result] => created
    [status] => 201
)
Array
(
    [took] => 0
    [timed_out] =>
    [hits] => Array
        (
            [total] => 1
            [hits] => Array
                (
                    [0] => Array
                        (
                            [_id] => 1
                            [_score] => 1
                            [_source] => Array
                                (
                                    [product_codes] => Array
                                        (
                                            [0] => 1
                                            [1] => 2
                                            [2] => 3
                                            [3] => 4
                                        )

                                    [title] => first
                                )
                        )
                )
        )
)

Python

indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}})
searchApi.search({"index":"products","query":{"match_all":{}}})

{'created': True,
 'found': None,
 'id': 1,
 'index': 'products',
 'result': 'created'}
{'hits': {'hits': [{u'_id': u'1',
                    u'_score': 1,
                    u'_source': {u'product_codes': [1, 2, 3, 4],
                                 u'title': u'first'}}],
          'total': 1},
 'profile': None,
 'timed_out': False,
 'took': 29}

javascript

await indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}});
res = await searchApi.search({"index":"products","query":{"match_all":{}}});

{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1,"_source":{"product_codes":[1,2,3,4],"title":"first"}}]}}

java

InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
    put("title","first");
    put("product_codes",new int[] {4,2,1,3});
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
System.out.println(searchResponse.toString() );

class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}

Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "first");
doc.Add("product_codes", new List<Object> {4,2,1,3});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
object query =  new { match_all=null };
var searchRequest = new SearchRequest("products", query);
var searchResponse = searchApi.Search(searchRequest);
Console.WriteLine(searchResponse.ToString())

class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}

Multi-value big integer

A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.

SQL

CREATE TABLE products(title text, values multi64);

JSON

POST /cli -d "CREATE TABLE products(title text, values multi64)"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'values'=>['type'=>'multi64']
]);

Python

utilsApi.sql('CREATE TABLE products(title text, values multi64))')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, values multi64))');

java

utilsApi.sql("CREATE TABLE products(title text, values multi64))");

utilsApi.Sql("CREATE TABLE products(title text, values multi64))");

config

table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi_64 = values
}

Columnar attribute properties

When you use the columnar storage you can specify the following properties for the attributes.

fast_fetch

RT mode

create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;

+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
3 rows in set (0.00 sec)

Plain mode

source min {
    type = mysql
    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_query = select 1, 1 a, 1 b
    sql_attr_uint = a
    sql_attr_uint = b
}

table tbl {
    path = tbl/col
    source = min
    columnar_attrs = *
    columnar_no_fast_fetch = b
}

+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+

¶ 6.2

Creating a local table

In Manticore Search, there are two ways to manage tables:

Online schema management (RT mode)

Real-time mode requires no table definition in the configuration file. However, the data_dir directive in the searchd section is mandatory. Index files are stored inside the data_dir.

Replication is only available in this mode.

You can use SQL commands such as CREATE TABLE, ALTER TABLE and DROP TABLE to create and modify table schema, and to drop it. This mode is particularly useful for real-time and percolate tables.

Table names are converted to lowercase when created.

Defining table schema in config (Plain mode)

In this mode, you can specify the table schema in the configuration file. Manticore reads this schema on startup and creates the table if it doesn't exist yet. This mode is particularly useful for plain tables that use data from an external storage.

To drop a table, remove it from the configuration file or remove the path setting and send a HUP signal to the server or restart it.

Table names are case-sensitive in this mode.

All table types are supported in this mode.

Table types and modes

Table type	RT mode	Plain mode
Real-time	supported	supported
Plain	not supported	supported
Percolate	supported	supported
Distributed	supported	supported
Template	not supported	supported

¶ 6.2.1

Real-time table

A real-time table is a main type of table in Manticore. It lets you add, update, and delete documents, and you can see these changes right away. You can set up a real-time Table in a configuration file or use commands like CREATE, UPDATE, DELETE, or ALTER.

Internally a real-time table consists of one or more plain tables called chunks. There are two kinds of chunks:

multiple disk chunks - these are saved on a disk and are structured like a plain table.
single ram chunk - this is kept in memory and collects all changes.

The size of the RAM chunk is controlled by the rt_mem_limit setting. Once this limit is reached, the RAM chunk is transferred to disk as a disk chunk. If there are too many disk chunks, Manticore combines some of them to improve performance.

Creating a real-time table:

You can create a new real-time table in two ways: by using the CREATE TABLE command or through the _mapping endpoint of the HTTP JSON API.

CREATE TABLE command:

You can use this command via both SQL and HTTP protocols:

SQL

CREATE TABLE products(title text, price float) morphology='stem_en';

Query OK, 0 rows affected (0.00 sec)

JSON

POST /cli -d "CREATE TABLE products(title text, price float)  morphology='stem_en'"

{
"total":0,
"error":"",
"warning":""
}

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'float'],
]);

Python

utilsApi.sql('CREATE TABLE forum(title text, price float)')

Javascript

res = await utilsApi.sql('CREATE TABLE forum(title text, price float)');

Java

utilsApi.sql("CREATE TABLE forum(title text, price float)");

utilsApi.Sql("CREATE TABLE forum(title text, price float)");

CONFIG

table products {
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
  stored_fields = title
}

_mapping API:

Alternatively, you can create a new table via the _mapping endpoint. This endpoint allows you to define an Elasticsearch-like table structure to be converted to a Manticore table.

The body of your request must have the following structure:

"properties"
{
  "FIELD_NAME_1":
  {
    "type": "FIELD_TYPE_1"
  },
  "FIELD_NAME_2":
  {
    "type": "FIELD_TYPE_2"
  },

  ...

  "FIELD_NAME_N":
  {
    "type": "FIELD_TYPE_M"
  }
}

When creating a table, Elasticsearch data types will be mapped to Manticore types according to the following rules:
- aggregate_metric => json
- binary => string
- boolean => bool
- byte => int
- completion => string
- date => timestamp
- date_nanos => bigint
- date_range => json
- dense_vector => json
- flattened => json
- flat_object => json
- float => float
- float_range => json
- geo_point => json
- geo_shape => json
- half_float => float
- histogram => json
- integer => int
- integer_range => json
- ip => string
- ip_range => json
- keyword => string
- knn_vector => float_vector
- long => bigint
- long_range => json
- match_only_text => text
- object => json
- point => json
- scaled_float => float
- search_as_you_type => text
- shape => json
- short => int
- text => text
- unsigned_long => int
- version => string

JSON

POST /your_table_name/_mapping -d '
{
  "test": {
    "mappings": {
      "properties": {
        "price": {
            "type": "float"
        },
        "title": {
            "type": "text"
        }
      }
    }
  }
}
'

{
"total":0,
"error":"",
"warning":""
}

👍 What you can do with a real-time table:

Add documents.
Update attributes and full-text fields using the Update process.
Delete documents.
Empty the table.
Change the schema online with the ALTER command, as explained in Change schema online.
Define the table in a configuration file, as detailed in Define table.
Use the UUID feature for automatic ID provisioning.

⛔ What you cannot do with a real-time table:

Ingest data using the indexer feature.
Connect it to sources for easy indexing from external storage.
Update the killlist_target, as it is automatically managed by the real-time table.

Real-time table files structure

The following table outlines the different file extensions and their respective descriptions in a real-time table:

Extension	Description
`.lock`	A lock file that ensures that only one process can access the table at a time.
`.ram`	The RAM chunk of the table, stored in memory and used as an accumulator of changes.
`.meta`	The headers of the real-time table that define its structure and settings.
`..sp`	Disk chunks that are stored on disk with the same format as plain tables. They are created when the RAM chunk size exceeds the rt_mem_limit.

For more information on the structure of disk chunks, refer to the plain table files structure.

¶ 6.2.2

Plain table

Plain table is a basic element for non-percolate searching. It can be defined only in a configuration file using the Plain mode, and is not supported in the RT mode. It is typically used in conjunction with a source to process data from the external storage and can later be attached to a real-time table.

Creating a plain table

To create a plain table, you'll need to define it in a configuration file. It's not supported by the CREATE TABLE command.

Here's an example of a plain table configuration and a source for fetching data from a MySQL database:

How to create a plain table

Plain table example

source source {
  type             = mysql
  sql_host         = localhost
  sql_user         = myuser
  sql_pass         = mypass
  sql_db           = mydb
  sql_query        = SELECT id, title, description, category_id  from mytable
  sql_attr_uint    = category_id
  sql_field_string = title
 }

table tbl {
  type   = plain
  source = source
  path   = /path/to/table
 }

👍 What you can do with a plain table:

Build it from an external storage using a source and indexer
Perform an in-place update of integer, float, string and MVA attribute
Update it's killlist_target

⛔ What you cannot do with a plain table:

Insert additional data into the table once it has been built
Delete data from the table
Create, delete, or alter the table schema online
Use UUID for automatic ID generation (data from external storage must include a unique identifier)

Numeric attributes, including MVAs, are the only elements that can be updated in a plain table. All other data in the table is immutable. If updates or new records are required, the table must be rebuilt. During the rebuilding process, the existing table remains available to serve requests, and a process called rotation is performed when the new version is ready, bringing it online and discarding the old version.

Plain table building performance

The speed at which a plain table is indexed depends on several factors, including:

Data source retrieval speed
Tokenization settings
The hardware specifications (such as CPU, RAM, and disk performance)

Plain table building scenarios

Rebuild fully when needed

For small data sets, the simplest option is to have a single plain table that is fully rebuilt as needed. This approach is acceptable when:

The data in the table is not as fresh as the data in the source
The time it takes to build the table increases as the data set grows

Main+delta scenario

For larger data sets, a plain table can be used instead of a Real-Time. The main+delta scenario involves:

Creating a smaller table for incremental indexing
Combining the two tables using a distributed table

This approach allows for infrequent rebuilding of the larger table and more frequent processing of updates from the source. The smaller table can be rebuilt more often (e.g. every minute or even every few seconds).

However, as time goes on, the indexing duration for the smaller table will become too long, requiring a rebuild of the larger table and the emptying of the smaller one.

The main+delta schema is explained in detail in this interactive course.

The mechanism of kill list and killlist_target directive is used to ensure that documents from the current table take precedence over those from the other table.

For more information on this topic, see here.

Plain table files structure

The following table outlines the various file extensions used in a plain table and their respective descriptions:

Extension	Description
`.spa`	stores document attributes in row-wise mode
`.spb`	stores blob attributes in row-wise mode: strings, MVA, json
`.spc`	stores document attributes in columnar mode
`.spd`	stores matching document ID lists for each word ID
`.sph`	stores table header information
`.sphi`	stores histograms of attribute values
`.spi`	stores word lists (word IDs and pointers to `.spd` file)
`.spidx`	stores secondary indexes data
`.spk`	stores kill-lists
`.spl`	lock file
`.spm`	stores a bitmap of killed documents
`.spp`	stores hit (aka posting, aka word occurrence) lists for each word ID
`.spt`	stores additional data structures to speed up lookups by document ids
`.spe`	stores skip-lists to speed up doc-list filtering
`.spds`	stores document texts
`.tmp*`	temporary files during index_settings_and_status
`.new.sp*`	new version of a plain table before rotation
`.old.sp*`	old version of a plain table after rotation

¶ 6.2.3

Plain and real-time table settings

Defining table schema in a configuration file

table <index_name>[:<parent table name>] {
...
}

Plain

table <table name> {
  type = plain
  path = /path/to/table
  source = <source_name>
  source = <another source_name>
  [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
}

Real-time

table <table name> {
  type = rt
  path = /path/to/table

  rt_field = <full-text field name>
  rt_field = <another full-text field name>
  [rt_attr_uint = <integer field name>]
  [rt_attr_uint = <another integer field name, limit by N bits>:N]
  [rt_attr_bigint = <bigint field name>]
  [rt_attr_bigint = <another bigint field name>]
  [rt_attr_multi = <multi-integer (MVA) field name>]
  [rt_attr_multi = <another multi-integer (MVA) field name>]
  [rt_attr_multi_64 = <multi-bigint (MVA) field name>]
  [rt_attr_multi_64 = <another multi-bigint (MVA) field name>]
  [rt_attr_float = <float field name>]
  [rt_attr_float = <another float field name>]
  [rt_attr_float_vector = <float vector field name>]
  [rt_attr_float_vector = <another float vector field name>]
  [rt_attr_bool = <boolean field name>]
  [rt_attr_bool = <another boolean field name>]
  [rt_attr_string = <string field name>]
  [rt_attr_string = <another string field name>]
  [rt_attr_json = <json field name>]
  [rt_attr_json = <another json field name>]
  [rt_attr_timestamp = <timestamp field name>]
  [rt_attr_timestamp = <another timestamp field name>]

  [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]

  [rt_mem_limit = <RAM chunk max size, default 128M>]
  [optimize_cutoff = <max number of RT table disk chunks>]

}

Common plain and real-time tables settings

type

type = plain

type = rt

Table type: "plain" or "rt" (real-time)

Value: plain (default), rt

path

path = path/to/table

The path to where the table will be stored or located, either absolute or relative, without the extension.

Value: The path to the table, mandatory

stored_fields

stored_fields = title, content

By default, the original content of full-text fields is indexed and stored when a table is defined in a configuration file. This setting allows you to specify the fields that should have their original values stored.

Value: A comma-separated list of full-text fields that should be stored. An empty value (i.e. stored_fields = ) disables the storage of original values for all fields.

Note: In the case of a real-time table, the fields listed in stored_fields should also be declared as rt_field.

Also, note that you don't need to list attributes in stored_fields, since their original values are stored anyway. stored_fields can only be used for full-text fields.

See also docstore_block_size, docstore_compression for document storage compression options.

SQL

CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)

JSON

POST /cli -d "
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)"

PHP

$params = [
    'body' => [
        'columns' => [
            'title'=>['type'=>'text'],
            'content'=>['type'=>'text', 'options' => ['indexed', 'stored']],
            'name'=>['type'=>'text', 'options' => ['indexed']],
            'price'=>['type'=>'float']
        ]
    ],
    'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);

Python

utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)')

Javascript

res = await utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)');

Java

utilsApi.sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");

utilsApi.Sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");

CONFIG

table products {
  stored_fields = title, content # we want to store only "title" and "content", "name" shouldn't be stored

  type = rt
  path = tbl
  rt_field = title
  rt_field = content
  rt_field = name
  rt_attr_uint = price
}

stored_only_fields

stored_only_fields = title,content

List of fields that will be stored in the table, but not indexed. This setting works similarly to stored_fields except that when a field is specified in stored_only_fields t will only be stored, not indexed, and cannot be searched using full-text queries. It can only be retrieved in search results.

The value is a comma-separated list of fields that should be stored only, not indexed. By default, this value is empty. If a real-time table is being defined, the fields listed in stored_only_fields must also be declared as rt_field.

Note also, that you don't need to list attributes instored_only_fields,since their original values are stored anyway. If to compare stored_only_fields to string attributes the former (stored field):

is stored on disk and does not take up memory
is stored in a compressed format
can only be fetched, it cannot be used for sorting, filtering or grouping

In contrast, the latter (string attribute):

is stored on disk and in memory
is stored in an uncompressed format
can be used for sorting, grouping, filtering, and any other actions you want to take with attributes.

Real-time table settings:

optimize_cutoff

The maximum number of disk chunks for the RT table. Learn more here.

rt_field

rt_field = subject

This field declaration determines the full-text fields that will be indexed. The field names must be unique, and the order is preserved. When inserting data, the field values must be in the same order as specified in the configuration.

This is a multi-value, optional field.

rt_attr_uint

rt_attr_uint = gid

This declaration defines an unsigned integer attribute.

Value: the field name or field_name:N (where N is the maximum number of bits to keep).

rt_attr_bigint

rt_attr_bigint = gid

This declaration defines a BIGINT attribute.

Value: field name, multiple records allowed.

rt_attr_multi

rt_attr_multi = tags

Declares a multi-valued attribute (MVA) with unsigned 32-bit integer values.

Value: field name. Multiple records allowed.

rt_attr_multi_64

rt_attr_multi_64 = wide_tags

Declares a multi-valued attribute (MVA) with signed 64-bit BIGINT values.

Value: field name. Multiple records allowed.

rt_attr_float

rt_attr_float = lat
rt_attr_float = lon

Declares floating point attributes with single precision, 32-bit IEEE 754 format.

Value: field name. Multiple records allowed.

rt_attr_float_vector

rt_attr_float_vector = image_vector

Declares a vector of floating-point values.

Value: field name. Multiple records allowed.

rt_attr_bool

rt_attr_bool = available

Declares a boolean attribute with 1-bit unsigned integer values.

Value: field name.

rt_attr_string

rt_attr_string = title

String attribute declaration.

Value: field name.

rt_attr_json

rt_attr_json = properties

Declares a JSON attribute.

Value: field name.

rt_attr_timestamp

rt_attr_timestamp = date_added

Declares a timestamp attribute.

Value: field name.

rt_mem_limit

rt_mem_limit = 512M

Memory limit for a RAM chunk of the table. Optional, default is 128M.

RT tables store some data in memory, known as the "RAM chunk," and also maintain a number of on-disk tables, referred to as "disk chunks." This directive allows you to control the size of the RAM chunk. When there is too much data to keep in memory, RT tables will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.

Please note that the limit is strict, and RT tables will never allocate more memory than what is specified in the rt_mem_limit. Additionally, memory is not preallocated, so specifying a 512MB limit and only inserting 3MB of data will result in allocating only 3MB, not 512MB.

The rt_mem_limit is never exceeded, but the actual RAM chunk size can be significantly lower than the limit. RT tables adapt to the data insertion pace and adjust the actual limit dynamically to minimize memory usage and maximize data write speed. This is how it works:

By default, the RAM chunk size is 50% of the rt_mem_limit, referred to as the "rt_mem_limit".
As soon as the RAM chunk accumulates data equivalent to rt_mem_limit * rate data (50% of rt_mem_limit by default), Manticore starts saving the RAM chunk as a new disk chunk.
While a new disk chunk is being saved, Manticore assesses the number of new/updated documents.
After saving a new disk chunk, the rt_mem_limit rate is updated.
The rate is reset to 50% each time you restart the searchd.

For instance, if 90MB of data is saved to a disk chunk and an additional 10MB of data arrives while the save is in progress, the rate would be 90%. Next time, the RT table will collect up to 90% of rt_mem_limit before flushing the data. The faster the insertion pace, the lower the rt_mem_limit rate. The rate varies between 33.3% to 95%. You can view the current rate of a table using the SHOW TABLE STATUS command.

How to change rt_mem_limit and optimize_cutoff

In real-time mode, you can adjust the size limit of RAM chunks and the maximum number of disk chunks using the ALTER TABLE statement. To set rt_mem_limit to 1 gigabyte for the table "t," run the following query: ALTER TABLE t rt_mem_limit='1G'. To change the maximum number of disk chunks, run the query: ALTER TABLE t optimize_cutoff='5'.

In the plain mode, you can change the values of rt_mem_limit and optimize_cutoff by updating the table configuration or running the command ALTER TABLE <index_name> RECONFIGURE

Important notes about RAM chunks

Real-time tables are similar to distributed consisting of multiple local tables, also known as disk chunks.
Each RAM chunk is made up of multiple segments, which are special RAM-only tables.
While disk chunks are stored on disk, RAM chunks are stored in memory.
Each transaction made to a real-time table generates a new segment, and RAM chunk segments are merged after each transaction commit. It is more efficient to perform bulk INSERTs of hundreds or thousands of documents rather than multiple separate INSERTs with one document to reduce the overhead from merging RAM chunk segments.
When the number of segments exceeds 32, they will be merged to keep the count below 32.
Real-time tables always have one RAM chunk (which may be empty) and one or more disk chunks.
Merging larger segments takes longer, so it's best to avoid having a very large RAM chunk (and therefore rt_mem_limit).
The number of disk chunks depends on the data in the table and the rt_mem_limit setting.
Searchd flushes the RAM chunk to disk (as a persisted file, not as a disk chunk) on shutdown and periodically according to the rt_flush_period setting. Flushing several gigabytes to disk may take some time.
A large RAM chunk puts more pressure on storage, both when flushing to disk into the .ram file and when the RAM chunk is full and dumped to disk as a disk chunk.
The RAM chunk is backed up by a binary log until it is flushed to disk, and a larger rt_mem_limit, setting will increase the time it takes to replay the binary log and recover the RAM chunk.
The RAM chunk may be slightly slower than a disk chunk.
Although the RAM chunk itself doesn't take up more memory than rt_mem_limit Manticore may take up more memory in some cases, such as when you start a transaction to insert data and don't commit it for a while. In this case, the data you have already transmitted within the transaction will remain in memory.

Plain table settings:

source

source = srcpart1
source = srcpart2
source = srcpart3

The source field specifies the source from which documents will be obtained during indexing of the current table. There must be at least one source. The sources can be of different types (e.g. one could be MySQL, another PostgreSQL). For more information on indexing from external storages, indexing from external storages here

Value: The name of the source is mandatory. Multiple values are allowed.

killlist_target

killlist_target = main:kl

This setting determines the table(s) to which the kill-list will be applied. Matches in the targeted table that are updated or deleted in the current table will be suppressed. In :kl mode, the documents to suppress are taken from the kill-list. In :id mode, all document IDs from the current table are suppressed in the targeted one. If neither is specified, both modes will take effect. Learn more about kill-lists here

Value: not specified (default), target_index_name:kl, target_index_name:id, target_index_name. Multiple values are allowed

columnar_attrs

columnar_attrs = *
columnar_attrs = id, attr1, attr2, attr3

This configuration setting determines which attributes should be stored in the columnar storage instead of the row-wise storage.

You can set columnar_attrs = * to store all supported data types in the columnar storage.

Additionally, id is a supported attribute to store in the columnar storage.

columnar_strings_no_hash

columnar_strings_no_hash = attr1, attr2, attr3

By default, all string attributes stored in columnar storage store pre-calculated hashes. These hashes are used for grouping and filtering. However, they occupy extra space, and if you don't need to group by that attribute, you can save space by disabling hash generation.

Creating a real-time table online via CREATE TABLE

CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]

For more information on data types, see more about data types here.

Type	Equivalent in a configuration file	Notes	Aliases
text	rt_field	Options: indexed, stored. Default: both. To keep text stored, but indexed, specify "stored" only. To keep text indexed only, specify "indexed" only.	string
integer	rt_attr_uint	integer	int, uint
bigint	rt_attr_bigint	big integer
float	rt_attr_float	float
float_vector	rt_attr_float_vector	a vector of float values
multi	rt_attr_multi	multi-integer
multi64	rt_attr_multi_64	multi-bigint
bool	rt_attr_bool	boolean
json	rt_attr_json	JSON
string	rt_attr_string	string. Option `indexed, attribute` will make the value full-text indexed and filterable, sortable and groupable at the same time
timestamp	rt_attr_timestamp	timestamp
bit(n)	rt_attr_uint field_name:N	N is the max number of bits to keep

SQL

CREATE TABLE products (title text, price float) morphology='stem_en'

This creates the "products" table with two fields: "title" (full-text) and "price" (float), and sets the "morphology" to "stem_en".

CREATE TABLE products (title text indexed, description text stored, author text, price float)

This creates the "products" table with three fields: * "title" is indexed, but not stored. * "description" is stored, but not indexed. * "author" is both stored and indexed.

Engine

create table ... engine='columnar';
create table ... engine='rowwise';

The engine setting changes the default attribute storage for all attributes in the table. You can also specify engine separately for each attribute.

For information on how to enable columnar storage for a plain table, see columnar_attrs .

Values:

columnar - Enables columnar storage for all table attributes, except for json
rowwise (default) - Doesn't change anything and uses the traditional row-wise storage for the table.

Other settings

The following settings are applicable for both real-time and plain tables, regardless of whether they are specified in a configuration file or set online using the CREATE or ALTER command.

Accessing table files

Manticore supports two access modes for reading table data: seek+read and mmap.

In seek+read mode, the server uses the pread system call to read document lists and keyword positions, represented by the*.spd and *.spp files. The server uses internal read buffers to optimize the reading process, and the size of these buffers can be adjusted using the options read_buffer_docs and read_buffer_hits.There is also the option preopen that controls how Manticore opens files at start.

In mmap access mode, the search server maps the table's file into memory using the mmap system call, and the OS caches the file contents. The options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock the table's data in memory using themlock privileged call, which prevents the OS from swapping the cached data out to disk.

To control which access mode to use, the options access_plain_attrs, access_blob_attrs, access_doclists, access_hitlists and access_dict are available, with the following values:

Value	Description
file	server reads the table files from disk with seek+read using internal buffers on file access
mmap	server maps the table files into memory and OS caches up its contents on file access
mmap_preread	server maps the table files into memory and a background thread reads it once to warm up the cache
mlock	server maps the table files into memory and then executes the mlock() system call to cache up the file contents and lock it into memory to prevent it being swapped out

Setting	Values	Description
access_plain_attrs	mmap, mmap_preread (default), mlock	controls how `.spa` (plain attributes) `.spe` (skip lists) `.spt` (lookups) `.spm` (killed docs) will be read
access_blob_attrs	mmap, mmap_preread (default), mlock	controls how `*.spb` (blob attributes) (string, mva and json attributes) will be read
access_doclists	file (default), mmap, mlock	controls how `*.spd` (doc lists) data will be read
access_hitlists	file (default), mmap, mlock	controls how `*.spp` (hit lists) data will be read
access_dict	mmap, mmap_preread (default), mlock	controls how `*.spi` (dictionary) will be read

Here is a table which can help you select your desired mode:

table part	keep it on disk	keep it in memory	cached in memory on server start	lock it in memory
plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docs	mmap	mmap	mmap_preread (default)	mlock
row-wise string, multi-value attributes (MVA) and json attributes	mmap	mmap	mmap_preread (default)	mlock
columnar numeric, string and multi-value attributes	always	only by means of OS	no	not supported
doc lists	file (default)	mmap	no	mlock
hit lists	file (default)	mmap	no	mlock
dictionary	mmap	mmap	mmap_preread (default)	mlock

The recommendations are:

For the fastest search response time and ample memory availability, use row-wise attributes and lock them in memory using mlock. Additionally, use mlock for doclists/hitlists.
If you prioritize can't afford lower performance after start and are willing to sacrifice longer startup time, use the --force-preread. option. If you desire faster searchd restart, stick to the default mmap_preread option.
If you are looking to conserve memory, while still having enough memory for all attributes, skip the use of mlock. The operating system will determine what should be kept in memory based on frequent disk reads.
If row-wise attributes do not fit into memory, opt for columnar attributes
If full-text search performance is not a concern, and you wish to save memory, use access_doclists/access_hitlists=file

The default mode offers a balance of:

mmap,
Prereading non-columnar attributes,
Seeking and reading columnar attributes with no preread,
Seeking and reading doclists/hitlists with no preread.

This provides a decent search performance, optimal memory utilization, and faster searchd restart in most scenarios.

attr_update_reserve

attr_update_reserve = 256k

This setting reserves extra space for updates to blob attributes such as multi-value attributes (MVA), strings, and JSON. The default value is 128k. When updating these attributes, their length may change. If the updated string is shorter than the previous one, it will overwrite the old data in the *.spb file. If the updated string is longer, it will be written to the end of the *.spb file. This file is memory-mapped, making resizing it a potentially slow process, depending on the operating system's memory-mapped file implementation. To avoid frequent resizing, you can use this setting to reserve extra space at the end of the .spb file.

Value: size, default 128k.

docstore_block_size

docstore_block_size = 32k

This setting controls the size of blocks used by the document storage. The default value is 16kb. When original document text is stored using stored_fields or stored_only_fields, it is stored within the table and compressed for efficiency. To optimize disk access and compression ratios for small documents, these documents are concatenated into blocks. The indexing process collects documents until their total size reaches the threshold specified by this option. At that point, the block of documents is compressed. This option can be adjusted to achieve better compression ratios (by increasing the block size) or faster access to document text (by decreasing the block size).

Value: size, default 16k.

docstore_compression

docstore_compression = lz4hc

This setting determines the type of compression used for compressing blocks of documents stored in document storage. If stored_fields or stored_only_fields are specified, the document storage stores compressed document blocks. 'lz4' offers fast compression and decompression speeds, while 'lz4hc' (high compression) sacrifices some compression speed for a better compression ratio. 'none' disables compression completely.

Values: lz4 (default), lz4hc, none.

docstore_compression_level

docstore_compression_level = 12

The compression level used when 'lz4hc' compression is applied in document storage. By adjusting the compression level, you can find the right balance between performance and compression ratio when using 'lz4hc' compression. Note that this option is not applicable when using 'lz4' compression.

Value: An integer between 1 and 12, with a default of 9.

preopen

preopen = 1

This setting indicates that searchd should open all table files on startup or rotation, and keep them open while running. By default, the files are not pre-opened. Pre-opened tables require a few file descriptors per table, but they eliminate the need for per-query open() calls and are immune to race conditions that might occur during table rotation under high load. However, if you are serving many tables, it may still be more efficient to open them on a per-query basis in order to conserve file descriptors.

Value: 0 (default), or 1.

read_buffer_docs

read_buffer_docs = 1M

Buffer size for storing the list of documents per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.

Value: size, default 256k, minimum value is 8k.

read_buffer_hits

read_buffer_hits = 1M

Buffer size for storing the list of hits per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.

Value: size, default 256k, minimum value is 8k.

Plain table disk footprint settings

inplace_enable

inplace_enable = {0|1}

Enables in-place table inversion. Optional, default is 0 (uses separate temporary files).

The inplace_enable option reduces the disk footprint during indexing of plain tables, while slightly slowing down indexing (it uses approximately 2 times less disk, but yields around 90-95% of the original performance).

Indexing is comprised of two primary phases. During the first phase, documents are collected, processed, and partially sorted by keyword, and the intermediate results are written to temporary files (.tmp*). During the second phase, the documents are fully sorted and the final table files are created. Rebuilding a production table on-the-fly requires approximately 3 times the peak disk footprint: first for the intermediate temporary files, second for the newly constructed copy, and third for the old table that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final table.) This may be too much disk footprint for large data collections, and the inplace_enable option can be used to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them upon completion. However, this may require additional temporary data chunk relocation, which is where the performance impact comes from.

This directive has no effect on searchd, it only affects the indexer.

CONFIG

table products {
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_hit_gap

inplace_hit_gap = size

The option In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.

This directive only affects the searchd tool, and does not have any impact on the indexer.

CONFIG

table products {
  inplace_hit_gap = 1M
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_reloc_factor

inplace_reloc_factor = 0.1

The inplace_reloc_factor setting determines the size of the relocation buffer within the memory arena used during indexing. The default value is 0.1.

This option is optional and only affects the indexer tool, not the searchd server.

CONFIG

table products {
  inplace_reloc_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_write_factor

inplace_write_factor = 0.1

Controls the size of the buffer used for in-place writing during indexing. Optional, with a default value of 0.1.

It's important to note that this directive only impacts the indexer tool and not the searchd server.

CONFIG

table products {
  inplace_write_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

Natural language processing specific settings

The following settings are supported. They are all described in section NLP and tokenization.

¶ 6.2.4

Percolate table

A percolate table is a special table that stores queries rather than documents. It is used for prospective searches, or "search in reverse."

To learn more about performing a search query against a percolate table, see the section Percolate query.
To learn how to prepare a table for searching, see the section Adding rules to a percolate table.

The schema of a percolate table is fixed and contains the following fields:

Field	Description
ID	An unsigned 64-bit integer with auto-increment functionality. It can be omitted when adding a PQ rule, as described in add a PQ rule
Query	Full-text query of the rule, which can be thought of as the value of MATCH clause or JSON /search. If per field operators are used inside the query, the full-text fields need to be declared in the percolate table configuration. If the stored query is only for attribute filtering (without full-text querying), the query value can be empty or omitted. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table.
Filters	Optional. Filters are an optional string containing attribute filters and/or expressions, defined the same way as in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table.
Tags	Optional. Tags represent a list of string labels separated by commas that can be used for filtering/deleting PQ rules. The tags can also be returned along with matching documents when performing a Percolate query

Note that you do not need to add the above fields when creating a percolate table.

What you need to keep in mind when creating a new percolate table is to specify the expected schema of a document, which will be checked against the rules you will add later. This is done in the same way as for any other local table.

SQL

CREATE TABLE products(title text, meta json) type='pq';

Query OK, 0 rows affected (0.00 sec)

JSON

POST /cli -d "CREATE TABLE products(title text, meta json) type='pq'"

{
"total":0,
"error":"",
"warning":""
}

PHP

$index = [
    'index' => 'products',
    'body' => [
        'columns' => [
            'title' => ['type' => 'text'],
            'meta' => ['type' => 'json']
        ],
        'settings' => [
            'type' => 'pq'
        ]
    ]
];
$client->indices()->create($index);

Array(
    [total] => 0
    [error] =>
    [warning] =>
)

Python

utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'');

java

utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");

utilsApi.Sql("CREATE TABLE products(title text, meta json) type='pq'");

typescript

res = await utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");

apiClient.UtilsAPI.Sql(context.Background()).Body("CREATE TABLE products(title text, meta json) type='pq'").Execute()

CONFIG

table products {
  type = percolate
  path = tbl_pq
  rt_field = title
  rt_attr_json = meta
}

¶ 6.2.5

Template table

A Template Table is a special type of table in Manticore that doesn't store any data and doesn't create any files on your disk. Despite this, it can have the same NLP settings as a plain or real-time table. Template tables can be used for the following purposes:

As a template to inherit settings in the Plain mode, simplifying your Manticore configuration file.
Keyword generation with the help of the CALL KEYWORDS command.
Highlighting an arbitrary string using the CALL SNIPPETS command.

CONFIG

table template {
  type = template
  morphology = stem_en
  wordforms = wordforms.txt
  exceptions = exceptions.txt
  stopwords = stopwords.txt
}

¶ 6.3

NLP and tokenization

¶ 6.3.1

Data tokenization

Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.

On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.

At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.

Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.

Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.

A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.

In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.

¶ 6.3.2

Supported languages

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cjk (which is the default value).

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

basic support (column "Supported")
stopwords (column "Stopwords file name")
advanced morphology (column "Advanced morphology")

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cjk	af	-
Arabic	charset_table=non_cjk	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cjk	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cjk	eu	-
Bengali	charset_table=non_cjk	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cjk	bg	-
Catalan	charset_table=non_cjk	ca	morphology=libstemmer_ca
Chinese	charset_table=chinese or ngram_chars=chinese	zh	morphology=icu_chinese or ngram_chars=1 correspondingly	ICU dictionary based segmentation is much more accurate than ngram-based
Croatian	charset_table=non_cjk	hr	-
Kurdish	charset_table=non_cjk	ckb	-
Czech	charset_table=non_cjk	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cjk	da	morphology=libstemmer_da
Dutch	charset_table=non_cjk	nl	morphology=libstemmer_nl
English	charset_table=non_cjk	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cjk	eo	-
Estonian	charset_table=non_cjk	et	-
Finnish	charset_table=non_cjk	fi	morphology=libstemmer_fi
French	charset_table=non_cjk	fr	morphology=libstemmer_fr
Galician	charset_table=non_cjk	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cjk	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cjk	el	morphology=libstemmer_el
Hebrew	charset_table=non_cjk	he	-
Hindi	charset_table=non_cjk	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cjk	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cjk	id	morphology=libstemmer_id
Irish	charset_table=non_cjk	ga	morphology=libstemmer_ga
Italian	charset_table=non_cjk	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cjk	la	-
Latvian	charset_table=non_cjk	lv	-
Lithuanian	charset_table=non_cjk	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cjk	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cjk	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cjk	fa	-
Polish	charset_table=non_cjk	pl	-
Portuguese	charset_table=non_cjk	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cjk	ro	morphology=libstemmer_ro
Russian	charset_table=non_cjk	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cjk	sk	-
Slovenian	charset_table=non_cjk	sl	-
Somali	charset_table=non_cjk	so	-
Sotho	charset_table=non_cjk	st	-
Spanish	charset_table=non_cjk	es	morphology=libstemmer_es
Swahili	charset_table=non_cjk	sw	-
Swedish	charset_table=non_cjk	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=non_cjk	th	-
Turkish	charset_table=non_cjk	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	Requires installation of UK lemmatizer
Yoruba	charset_table=non_cjk	yo	-
Zulu	charset_table=non_cjk	zu	-

¶ 6.3.3

Chinese, Japanese and Korean (CJK) languages

Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:

Precise segmentation using the ICU library. Currently, only Chinese is supported.

SQL

CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cjk',
            'morphology' => 'icu_chinese'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'')

Javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");

CONFIG

table products {
  charset_table = cjk
  morphology = icu_chinese

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Basic support using the N-gram options ngram_len and ngram_chars
For each CJK language, there are separate character set tables (chinese, korean, japanese) that can be used, or you can use the common cjk character set table.

SQL

CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

CONFIG

table products {
  charset_table = non_cjk
  ngram_len = 1
  ngram_chars = cjk

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Additionally, there is built-in support for Chinese stopwords with the alias zh.

SQL

CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");

CONFIG

table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.3.4

Low-level tokenization

When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.

To perform these operations correctly, Manticore must know:

the encoding of the source text (which should always be UTF-8)
which characters are considered letters and which are not
which letters should be folded to other letters

You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.

The default character set is non_cjk, which includes most languages.

You can also define text pattern replacement rules. For example, with the following rules:

regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR

The text RED TUBE 5" LONG would be indexed as COLOR TUBE 5 INCH LONG, and PLANK 2" x 4" would be indexed as PLANK 2 INCH x 4 INCH. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE would actually search for COLOR TUBE.

You can learn more about regexp_filter here.

Index configuration options

charset_table

# default
charset_table = non_cjk

# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

# english charset defined with alias
charset_table = 0..9, english, _

# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF

charset_table specifies an array that maps letter characters to their case folded versions (or any other characters if you like). The default character set is non_cjk which includes most non-CJK languages.

charset_table is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).

By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.

charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:

Single char mapping: A->a. Declares the source character 'A' as allowed within keywords and maps it to the destination character 'a' (but does not declare 'a' as allowed).
Range mapping: A..Z->a..z. Declares all characters in the source range as allowed and maps them to the destination range. Does not declare the destination range as allowed. Checks the lengths of both ranges.
Stray char mapping: a. Declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
Stray range mapping: a..z. Declares all characters in the range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
Checkerboard range mapping: A..Z/2. Maps every pair of characters to the second character. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for Unicode blocks where uppercase and lowercase letters go in an interleaved order.

For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX form, where XXX is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021.

If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cjk array includes characters Ä and ä and maps them both to the ASCII character a, you can redefine those characters by adding the Unicode code points for them, like this:

charset_table = non_cjk,U+00E4,U+00C4

for case sensitive search or

charset_table = non_cjk,U+00E4,U+00C4->U+00E4

for case insensitive search.

SQL

CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");

CONFIG

table products {
  charset_table = 0..9, A..Z->a..z, _, a..z, \
    U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:

english
russian
non_cjk
cjk

SQL

CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => '0..9, english, _'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");

CONFIG

table products {
  charset_table = 0..9, english, _

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cjk and cjk, that cover non-CJK and CJK (Chinese, Japanese, Korean) languages respectively. In most cases, these charsets should be sufficient for your needs.

Please note that the following languages are currently not supported:

Assamese
Bishnupriya
Buhid
Garo
Hmong
Ho
Komi
Large Flowery Miao
Maba
Maithili
Marathi
Mende
Mru
Myene
Ngambay
Odia
Santali
Sindhi
Sylheti

All other languages listed in the Unicode languages
list are supported by default.

To work with both cjk and non-cjk languages, set the options in your configuration file as shown below (with an exception for Chinese):

SQL

CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");

CONFIG

table products {
  charset_table       = non_cjk
  ngram_len           = 1
  ngram_chars         = cjk

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you do not require support for cjk-languages, you can simply exclude the ngram_len and ngram_chars
options. For more information on these options, refer to the corresponding documentation sections.

To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.

blend_chars

blend_chars = +, &, U+23
blend_chars = +, &->+

Blended characters list. Optional, default is empty.

Blended characters are indexed as both separators and valid characters. For example, when & is defined as a blended character and AT&T appears in an indexed document, three different keywords will be indexed, at&t, at and t.

Additionally, blended characters can influence indexing in such a way that keywords are indexed as if the blended characters were not typed at all. This behavior is particularly evident when blend_mode = trim_all is specified. For example, the phrase some_thing will be indexed as some, something, and thing with blend_mode = trim_all.

Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.

Therefore, if you put a comma to the blend_chars and search for dog,cat, it will treat that as a single token dog,cat. If dog,cat was not indexed as dog,cat, but left as dog cat only, then it will not match.
Hence, this behavior should be controlled with the blend_mode setting.

Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company occurs in the very beginning of the text field, at will be given position 1, t position 2, company position 3, and AT&T will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T or just AT will match that document. A phrase query for "AT T" will also match, as well as a phrase query for "AT&T company".

Blended characters can overlap with special characters used in query syntax, such as T-Mobile or @twitter. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter is within quotes (a phrase operator), the query parser will handle the @ symbol as blended. However, if the @ symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.

Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.

SQL

CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'blend_chars' => '+, &, U+23, @->_'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");

utilsApi.Sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");

CONFIG

table products {
  blend_chars = +, &, U+23, @->_

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

blend_mode

blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure

The blended tokens indexing mode is enabled by the blend_mode directive.

By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars, the string @dude! will be indexed as two tokens: @dude! (with all the blended characters) and dude (without any). As a result, a query of @dude will not match it.

blend_mode adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.

If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.

The options are:

trim_none - Index the entire token
trim_head - Trim heading blended characters, and index the resulting token
trim_tail - Trim trailing blended characters, and index the resulting token
trim_both- Trim both heading and trailing blended characters, and index the resulting token
trim_all - Trim heading, trailing, and middle blended characters, and index the resulting token
skip_pure - Do not index the token if it is purely blended, that is, consists of blended characters only

Using blend_mode with the example @dude! string above, the setting blend_mode = trim_head, trim_tail would result in two indexed tokens: @dude and dude!. Using trim_both would have no effect because trimming both blended characters results in dude, which is already indexed as a regular keyword. Indexing @U.S.A. with trim_both (and assuming that dot is blended two) would result in U.S.A being indexed. Lastly, skip_pure enables you to ignore sequences of blended characters only. For example, one @@@ two would be indexed as one two, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.

Default behavior is to index the entire token, equivalent to blend_mode = trim_none.

Be aware that using blend modes limits your search, even with the default mode trim_none if you assume . is a blended character:

.dog. will become dog. dog during indexing
and you won't be able to find it by dog..

Using more modes increases the chance your keyword will match something.

SQL

CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'blend_mode' => 'trim_tail, skip_pure',
            'blend_chars' => '+, &'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");

utilsApi.Sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");

CONFIG

table products {
  blend_mode = trim_tail, skip_pure
  blend_chars = +, &

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

min_word_len

min_word_len = length

min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.

Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.

SQL

CREATE TABLE products(title text, price float) min_word_len = '4'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) min_word_len = '4'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_word_len' => '4'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) min_word_len = '4'");

utilsApi.Sql("CREATE TABLE products(title text, price float) min_word_len = '4'");

CONFIG

table products {
  min_word_len = 4

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ngram_len

ngram_len = 1

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.

N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there may be no clear separators between the words. In some cases, you may not want to use dictionary-based segmentation the one available for Chinese. In those cases, n-gram segmentation might work well too.

When this feature is enabled, streams of CJK (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some CJK characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.

Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram CJK words can mean closer multi-character word matches) to the top.

SQL

CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'cjk',
             'ngram_len' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

CONFIG

table products {
  ngram_chars = cjk
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ngram_chars

ngram_chars = cjk

ngram_chars = cjk, U+3000..U+2FA1F

N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.

SQL

CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'U+3000..U+2FA1F',
             'ngram_len' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");

CONFIG

table products {
  ngram_chars = U+3000..U+2FA1F
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.

SQL

CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'cjk',
             'ngram_len' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");

CONFIG

table products {
  ngram_chars = cjk
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ignore_chars

ignore_chars = U+AD

Ignored characters list. Optional, default is empty.

Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.

The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.

SQL

CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'ignore_chars' => 'U+AD'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");

utilsApi.Sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");

CONFIG

table products {
  ignore_chars = U+AD

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

bigram_index

bigram_index = {none|all|first_freq|both_freq}

Bigram indexing mode. Optional, default is none.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

bigram_index controls the selection of specific word pairs. The known modes are:

all, index every single word pair
first_freq, only index word pairs where the first word is in a list of frequent words (see bigram_freq_words). For example, with bigram_freq_words = the, in, i, a, indexing "alone in the dark" text will result in "in the" and "the dark" pairs being stored as bigrams, because they begin with a frequent keyword (either "in" or "the" respectively), but "alone in" would not be indexed, because "in" is a second word in that pair.
both_freq, only index word pairs where both words are frequent. Continuing with the same example, in this mode indexing "alone in the dark" would only store "in the" (the very worst of them all from searching perspective) as a bigram, but none of the other word pairs.

For most use cases, both_freq would be the best mode, but your mileage may vary.

SQL

CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'bigram_freq_words' => 'the, a, you, i',
            'bigram_index' => 'both_freq'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");

utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");

CONFIG

table products {
  bigram_index = both_freq
  bigram_freq_words = the, a, you, i

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

bigram_freq_words

bigram_freq_words = the, a, you, i

A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.

Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.

bigram_freq_words lets you define a list of such keywords.

SQL

CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'bigram_freq_words' => 'the, a, you, i',
            'bigram_index' => 'first_freq'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");

utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");

CONFIG

table products {
  bigram_freq_words = the, a, you, i
  bigram_index = first_freq

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

dict

dict = {keywords|crc}

The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.

Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.

CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:

Firstly, there's a risk of control sum collisions between different keywords pairs. This risk grows in proportion to the number of unique keywords in the index. Nonetheless, this concern is minor as the probability of a single FNV64 collision in a dictionary of 1 billion entries is roughly 1 in 16, or 6.25 percent. Most dictionaries will have far fewer than a billion keywords given that a typical spoken human language has between 1 and 10 million word forms.
Secondly, and more crucially, it's not straightforward to perform substring searches with control sums. Manticore addressed this issue by pre-indexing all possible substrings as separate keywords (see min_prefix_len, min_infix_len directives). This method even has an added advantage of matching substrings in the fastest way possible. Yet, pre-indexing all substrings significantly increases the index size (often by factors of 3-10x or more) and subsequently affects the indexing time, making substring searches on large indexes rather impractical.

The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test* prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.

For substring (infix) searches, extended wildcards can be used. Special characters such as ? and % are compatible with substring (infix) search (e.g., t?st*, run%, *abc*). Note that the wildcards operators and the REGEX only function with dict=keywords.

Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.

In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).

SQL

CREATE TABLE products(title text, price float) dict = 'keywords'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) dict = 'keywords'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'dict' => 'keywords'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) dict = 'keywords'");

utilsApi.Sql("CREATE TABLE products(title text, price float) dict = 'keywords'");

CONFIG

table products {
  dict = keywords

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

embedded_limit

embedded_limit = size

Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.

When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.

With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit is that threshold.

CONFIG

table products {
  embedded_limit = 32K

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

global_idf

global_idf = /path/to/global.idf

The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).

On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.

The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.

IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.

You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all the .idf files across cluster using --mergeidf.

SQL

CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'global_idf' => '/usr/local/manticore/var/global.idf'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");

utilsApi.Sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");

CONFIG

table products {
  global_idf = /usr/local/manticore/var/global.idf

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

hitless_words

hitless_words = {all|path/to/file}

Hitless words list. Optional, allowed values are 'all', or a list file name.

By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.

hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.

Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.

If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world" will be converted to ("simon says" & hello & world), matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.

A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all) only boolean matching can be used on the respective index.

SQL

CREATE TABLE products(title text, price float) hitless_words = 'all'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) hitless_words = 'all'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'hitless_words' => 'all'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");

utilsApi.Sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");

CONFIG

table products {
  hitless_words = all

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_field_lengths

index_field_lengths = {0|1}

Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).

When index_field_lengths is set to 1 Manticore will:

create a respective length attribute for every full-text field, sharing the same name but with __len suffix
compute a field length (counted in keywords) for every document and store in to a respective attribute
compute the per-index averages. The lengths attributes will have a special TOKENCOUNT type, but their values are in fact regular 32-bit integers, and their values are generally accessible.

BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.

SQL

CREATE TABLE products(title text, price float) index_field_lengths = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) index_field_lengths = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_field_lengths' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");

CONFIG

table products {
  index_field_lengths = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_token_filter

index_token_filter = my_lib.so:custom_blend:chars=@#&

Index-time token filter for full-text indexing. Optional, default is empty.

The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT or REPLACE statements. The plugins are defined using the format library name:plugin name:optional string of settings. For example, my_lib.so:custom_blend:chars=@#&.

SQL

CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_token_filter' => 'my_lib.so:custom_blend:chars=@#&'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");

CONFIG

table products {
  index_token_filter = my_lib.so:custom_blend:chars=@#&

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

overshort_step

overshort_step = {0|1}

Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.

SQL

CREATE TABLE products(title text, price float) overshort_step = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) overshort_step = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'overshort_step' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) overshort_step = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) overshort_step = '1'");

CONFIG

table products {
  overshort_step = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

phrase_boundary

phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis

Phrase boundary characters list. Optional, default is empty.

This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.

On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.

Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.

SQL

CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'phrase_boundary' => '., ?, !, U+2026',
             'phrase_boundary_step' => '10'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");

utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");

CONFIG

table products {
  phrase_boundary = ., ?, !, U+2026
  phrase_boundary_step = 10

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

phrase_boundary_step

phrase_boundary_step = 100

Phrase boundary word position increment. Optional, default is 0.

On phrase boundary, current word position will be additionally incremented by this number.

SQL

CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'phrase_boundary_step' => '100',
             'phrase_boundary' => '., ?, !, U+2026'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");

utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");

CONFIG

table products {
  phrase_boundary_step = 100
  phrase_boundary = ., ?, !, U+2026

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

regexp_filter

# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch

# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regular expressions. The regular expressions engine used by Manticore Search is Google's RE2, which is known for its speed and safety. For detailed information on the syntax supported by RE2, you can visit the RE2 syntax guide.

In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs and iPhone 3 gs (or even iPhone3 gs) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch, 13 inch, 13", or 13in.

Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.

Regular expressions listed in regexp_filter are applied in the order they are listed, at the earliest stage possible, before any other processing (including exceptions), even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.

SQL

CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'regexp_filter' => '(blue|red) => color'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");

utilsApi.Sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");

CONFIG

table products {
  # index '13"' as '13inch'
  regexp_filter = \b(\d+)\" => \1inch

  # index 'blue' or 'red' as 'color'
  regexp_filter = (blue|red) => color

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.3.5

Wildcard searching settings

Wildcard searching is a common text search type. In Manticore, it is performed at the dictionary level. By default, both plain and RT tables use a dictionary type called dict. In this mode, words are stored as they are, so enabling wildcarding does not affect the size of the table. When a wildcard search is performed, the dictionary is searched to find all possible expansions of the wildcarded word. This expansion can be problematic in terms of computation at query time when the expanded word provides many expansions or expansions that have huge hitlists, especially in the case of infixes where the wildcard is added at the start and end of the word. To avoid such problems, the expansion_limit can be used.

min_prefix_len

min_prefix_len = length

This setting determines the minimum word prefix length to index and search. By default, it is set to 0, meaning prefixes are not allowed.

Prefixes allow for wildcard searching by wordstart* wildcards.

For example, if the word "example" is indexed with min_prefix_len=3, it can be found by searching for "exa", "exam", "examp", "exampl", as well as the full word.

Note that with dict=crc min_prefix_len will affect the size of the full-text index since each word expansion will be stored additionally.

Manticore can differentiate perfect word matches from prefix matches and rank the former higher if the following conditions are met:

dict=keywords (on by default)
index_exact_words=1 (off by default),
expand_keywords=1 (also off by default)

Note that with either dict=crc mode or any of the above options disabled, it is not possible to differentiate between prefixes and full words, and perfect word matches cannot be ranked higher.

When the minimum infix length is set to a positive number, the minimum prefix length is always considered 1.

SQL

CREATE TABLE products(title text, price float) min_prefix_len = '3'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) min_prefix_len = '3'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_prefix_len' => '3'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'");

utilsApi.Sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'");

CONFIG

table products {
  min_prefix_len = 3

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

min_infix_len

min_infix_len = length

The min_infix_len setting determines the minimum length of an infix prefix to index and search. It is optional and its default value is 0, which means that infixes are not allowed. The minimum allowed non-zero value is 2.

When enabled, infixes allow for wildcard searching with term patterns like start*, *end, *middle*, , and so on. It also allows you to disable too short wildcards if they are too expensive to search for.

If the following conditions are met, Manticore can differentiate perfect word matches from infix matches and rank the former higher:

dict=keywords (on by default)
index_exact_words=1 (off by default),
expand_keywords=1 (also off by default)

Note that with the dict=crc mode or any of the above options disabled, there is no way to differentiate between infixes and full words, and thus perfect word matches cannot be ranked higher.

Infix wildcard search query time can vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like *in* or *ti* might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches, you would set min_infix_len to 2. To limit the impact from wildcard searches with too short wildcards, you might set it higher.

Infixes must be at least 2 characters long, and wildcards like *a* are not allowed for performance reasons.

When min_infix_len is set to a positive number, the minimum prefix length is considered 1. For dict word infixing and prefixing cannot be both enabled at the same time. For dict and other fields to have prefixes declared with prefix_fields, it is forbidden to declare the same field in both lists.

If dict=keywords, besides the wildcard * two other wildcard characters can be used:

? can match any (one) character: t?st will match test, but not teast
% can match zero or one character: tes% will match tes or test, but not testing

SQL

CREATE TABLE products(title text, price float) min_infix_len = '3'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) min_infix_len = '3'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_infix_len' => '3'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) min_infix_len = '3'");

utilsApi.Sql("CREATE TABLE products(title text, price float) min_infix_len = '3'");

CONFIG

table products {
  min_infix_len = 3

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

prefix_fields

prefix_fields = field1[, field2, ...]

The prefix_fields setting is used to limit prefix indexing to specific full-text fields in dict=crc mode. By default, all fields are indexed in prefix mode, but because prefix indexing can affect both indexing and searching performance, it may be desired to limit it to certain fields.

To limit prefix indexing to specific fields, use the prefix_fields setting followed by a comma-separated list of field names. If prefix_fields is not set, then all fields will be indexed in prefix mode.

CONFIG

table products {
  prefix_fields = title, name
  min_prefix_len = 3
  dict = crc

infix_fields

infix_fields = field1[, field2, ...]

The infix_fields setting allows you to specify a list of full-text fields to limit infix indexing to. This applies to dict=crc only and is optional, with the default being to index all fields in infix mode.
This setting is similar to prefix_fields, but instead allows you to limit infix indexing to specific fields.

CONFIG

table products {
  infix_fields = title, name
  min_infix_len = 3
  dict = crc

max_substring_len

max_substring_len = length

The max_substring_len directive sets the maximum substring length to be indexed for either prefix or infix searches. This setting is optional, and its default value is 0 (which means that all possible substrings are indexed). It only applies to dict.

By default, substring indexing in dict indexes all possible substrings as separate keywords, which can result in an overly large full-text index. Therefore, the max_substring_len directive allows you to skip too-long substrings that will probably never be searched for.

For example, a test table of 10,000 blog posts takes up a different amount of disk space depending on the settings:

6.4 MB baseline (no substrings)
24.3 MB (3.8x) with min_prefix_len = 3
22.2 MB (3.5x) with min_prefix_len = 3, max_substring_len = 8
19.3 MB (3.0x) with min_prefix_len = 3, max_substring_len = 6
94.3 MB (14.7x) with min_infix_len = 3
84.6 MB (13.2x) with min_infix_len = 3, max_substring_len = 8
70.7 MB (11.0x) with min_infix_len = 3, max_substring_len = 6

Therefore, limiting the max substring length can save 10-15% of the table size.

When using dict=keywords mode, there is no performance impact associated with substring length. Therefore, this directive is not applicable and is intentionally forbidden in that case. However, if required, you can still limit the length of a substring that you search for in the application code.

CONFIG

table products {
  max_substring_len = 12
  min_infix_len = 3
  dict = crc

expand_keywords

expand_keywords = {0|1|exact|star}

This setting expands keywords with their exact forms and/or with stars when possible. The supported values are:

1 - expand to both the exact form and the form with the stars. For instance,running will become (running | *running* | =running)
exact - - augment the keyword with only its exact form. For instance, running will become (running | =running)
star - augment the keyword by adding * around it. For instance, running will become (running | *running*)
This setting is optional, and the default value is 0 (keywords are not expanded).

Queries against tables with expand_keywords feature enabled are internally expanded as follows: if the table was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of the keyword itself and a respective prefix or infix (keyword with stars). If the table was built with both stemming and index_exact_words enabled, exact form is also added.

SQL

CREATE TABLE products(title text, price float) expand_keywords = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) expand_keywords = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'expand_keywords' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) expand_keywords = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) expand_keywords = '1'");

CONFIG

table products {
  expand_keywords = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.

Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax). Take a look at the examples and how expand_keywords affects the search result weights and how "runsy" is found by "runs" w/o the need to add a star:

expand_keywords_enabled

mysql> create table t(f text) min_infix_len='2' expand_keywords='1' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    2 | runs    |     1560 |
|    1 | running |     1500 |
|    3 | runsy   |     1500 |
+------+---------+----------+
3 rows in set (0.01 sec)

mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t(f text) min_infix_len='2' expand_keywords='exact' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1590 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

expand_keywords_disabled

mysql> create table t(f text) min_infix_len='2' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1500 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t(f text) min_infix_len='2' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1500 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

This directive does not affect indexer in any way, it only affects searchd.

expansion_limit

expansion_limit = number

Maximum number of expanded keywords for a single wildcard. Details are here.

¶ 6.3.6

Ignoring stop words

Stop words are words that are ignored during indexing and searching, typically due to their high frequency and low value to search results.

Manticore Search applies stemming to stop words by default, which can lead to undesired results, but this can be turned off using the stopwords_unstemmed.

Small stop word files are stored in the table header, and there is a limit to the size of files that can be embedded, as defined by the embedded_limit option.

Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.

stopwords

stopwords=path/to/stopwords/file[ path/to/another/file ...]

The stopwords setting is optional and by default empty. It allows you to specify the path to one or more stop word files, separated by spaces. All the files will be loaded. In the real-time mode, only absolute paths are allowed.

The stop word file format is simple plain text with UTF-8 encoding. The file data will be tokenized with respect to the charset_table settings, so you can use the same separators as in the indexed data.

Stop word files can be created manually or semi-automatically. The indexer provides a mode that creates a frequency dictionary of the table, sorted by the keyword frequency. Top keywords from that dictionary can usually be used as stop words. See --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.

SQL

CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");

utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");

CONFIG

table products {
  stopwords = /usr/local/manticore/data/stopwords.txt
  stopwords = stopwords-ru.txt stopwords-en.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:

af - Afrikaans
ar - Arabic
bg - Bulgarian
bn - Bengali
ca - Catalan
ckb- Kurdish
cz - Czech
da - Danish
de - German
el - Greek
en - English
eo - Esperanto
es - Spain
et - Estonian
eu - Basque
fa - Persian
fi - Finnish
fr - French
ga - Irish
gl - Galician
hi - Hindi
he - Hebrew
hr - Croatian
hu - Hungarian
hy - Armenian
id - Indonesian
it - Italian
ja - Japanese
ko - Korean
la - Latin
lt - Lithuanian
lv - Latvian
mr - Marathi
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
ro - Romanian
ru - Russian
sk - Slovak
sl - Slovenian
so - Somali
st - Sotho
sv - Swedish
sw - Swahili
th - Thai
tr - Turkish
yo - Yoruba
zh - Chinese
zu - Zulu

For example, to use stop words for Italian language just put the following line in your config file:

SQL

CREATE TABLE products(title text, price float) stopwords = 'it'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'it'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'it'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'it'");

utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = 'it'");

CONFIG

table products {
  stopwords = it

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you need to use stop words for multiple languages you should list all their aliases, separated with commas (RT mode) or spaces (plain mode):

SQL

CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");

CONFIG

table products {
  stopwords = en it ru

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

stopword_step

stopword_step={0|1}

The position_increment setting on stopwords is optional, and the allowed values are 0 and 1, with the default being 1.

SQL

CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopword_step' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");

CONFIG

table products {
  stopwords = en
  stopword_step = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

stopwords_unstemmed

stopwords_unstemmed={0|1}

Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).

By default, stop words are stemmed themselves, and then applied to tokens after stemming (or any other morphology processing). This means that a token is stopped when stem(token) is equal to stem(stopword). This default behavior can lead to unexpected results when a token is erroneously stemmed to a stopped root. For example, "Andes" might get stemmed to "and", so when "and" is a stopword, "Andes" is also skipped.

However, you can change this behavior by enabling the stopwords_unstemmed directive. When this is enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when the token is equal to the stopword.

SQL

CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopwords_unstemmed' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");

utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");

CONFIG

table products {
  stopwords = en
  stopwords_unstemmed = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.3.7

Word forms

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the word forms file.

Word forms support in Manticore is designed to handle large dictionaries well. They moderately affect indexing speed; for example, a dictionary with 1 million entries slows down indexing by about 1.5 times. Searching speed is not affected at all. The additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across tables. For instance, if the very same 50 MB word forms file is specified for 10 different tables, the additional searchd RAM usage will be about 50 MB.

The dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by a "greater than" sign. Rules from the charset_table will be applied when the file is loaded, so if you are using built-in charset_table options, it is typically case-insensitive, just like your other full-text indexed data. Here is a sample file contents:

walks > walk
walked > walk
walking > walk

There is a bundled utility called Spelldump that helps you create a dictionary file in a format that Manticore can read. The utility can read from source .dict and .aff dictionary files in the ispell or MySpell format, as bundled with OpenOffice.

You can map several source words to a single destination word. The process happens on tokens, not the source text, so differences in whitespace and markup are ignored.

You can use the => symbol instead of >. Comments (starting with #) are also allowed. Finally, if a line starts with a tilde (~), the wordform will be applied after morphology, instead of before (note that only a single source and destination word are supported in this case).

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

You can specify multiple destination tokens:

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify multiple files, not just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order.

In the RT mode, only absolute paths are allowed.

If multi-byte codepages are used and file names include foreign characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in multiple files, the latter one is used and overrides previous definitions.

SQL

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");

utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");

CONFIG

table products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.3.8

Exceptions

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions	Word forms
Case sensitive	Case insensitive
Can use special characters that are not in charset_table	Fully obey charset_table
Underperform on huge dictionaries	Designed to handle millions of entries

exceptions

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty.
In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the "at&t" text will be tokenized as two keywords "at" and "t" due to lowercase letters. On the other hand, "AT&T" will match exactly and produce a single "AT&T" keyword.

Note that this map-to keyword:

is always interpreted as a single word
is both case and space sensitive

In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". The mapping for "MS Windows" is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able to search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.

Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the exceptions file.

SQL

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");

CONFIG

table products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.3.9

Advanced morphology

Morphology preprocessors can be applied to words during indexing to normalize different forms of the same word and improve segmentation. For example, an English stemmer can normalize "dogs" and "dog" to "dog", resulting in identical search results for both keywords.

Manticore has four built-in morphology preprocessors:

Lemmatizer: reduces a word to its root or lemma. For example, "running" can be reduced to "run" and "octopi" can be reduced to "octopus". Note that some words may have multiple corresponding root forms. For example, "dove" can be either the past tense of "dive" or a noun meaning a bird, as in "A white dove flew over the cuckoo's nest." In this case, a lemmatizer can generate all the possible root forms.
Stemmer: reduces a word to its stem by removing or replacing certain known suffixes. The resulting stem may not necessarily be a valid word. For example, the Porter English stemmer reduces "running" to "run", "business" to "busi" (not a valid word), and does not reduce "octopi" at all.
Phonetic algorithms: replace words with phonetic codes that are the same even if the words are different but phonetically close.
Word breaking algorithms: split text into words. Currently available only for Chinese.

morphology

morphology = morphology1[, morphology2, ...]

The morphology directive specifies a list of morphology preprocessors to apply to the words being indexed. This is an optional setting, with the default being no preprocessor applied.

Manticore comes with built-in morphological preprocessors for:

English, Russian, and German lemmatizers
English, Russian, Arabic, and Czech stemmers
SoundEx and MetaPhone phonetic algorithms
Chinese word breaking algorithm
Snowball (libstemmer) stemmers for more than 15 other languages are also available.

Lemmatizers require dictionary .pak files that can be downloaded from the Manticore website. The dictionaries need to be put in the directory specified by lemmatizer_base. Additionally, the lemmatizer_cache setting can be used to speed up lemmatizing by spending more RAM for an uncompressed dictionary cache.

The Chinese language segmentation can be performed using ICU. It provides more precise segmentation compared to n-grams but is slightly slower. The charset_table must include all Chinese characters, which can be done by using the "cjk" alias. When "morphology=icu_chinese" is specified, the documents are first pre-processed by ICU. Then, the result is processed by the tokenizer according to the charset_table, and finally, other morphology processors specified in the "morphology" option are applied. Only those parts of texts that contain Chinese are passed to ICU for segmentation, while others can be modified by different means such as different morphologies or charset_table.

Built-in English and Russian stemmers are faster than their libstemmer counterparts but may produce slightly different results

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

To use the morphology option, specify one or multiple of the built-in options, including:

none: do not perform any morphology processing
lemmatize_ru - apply Russian lemmatizer and pick a single root form
lemmatize_uk - apply Ukrainian lemmatizer and pick a single root form (install it first in Centos or Ubuntu/Debian). For correct work of the lemmatizer make sure specific Ukrainian characters are preserved in your charset_table since by default they are not. For that override them, like this: charset_table='non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'. Here is an interactive course about how to install and use the urkainian lemmatizer.
lemmatize_en - apply English lemmatizer and pick a single root form
lemmatize_de - apply German lemmatizer and pick a single root form
lemmatize_ru_all - apply Russian lemmatizer and index all possible root forms
lemmatize_uk_all - apply Ukrainian lemmatizer and index all possible root forms. Find the installation links above and take care of the charset_table.
lemmatize_en_all - apply English lemmatizer and index all possible root forms
lemmatize_de_all - apply German lemmatizer and index all possible root forms
stem_en - apply Porter's English stemmer
stem_ru - apply Porter's Russian stemmer
stem_enru - apply Porter's English and Russian stemmers
stem_cz - apply Czech stemmer
stem_ar - apply Arabic stemmer
soundex - replace keywords with their SOUNDEX code
metaphone - replace keywords with their METAPHONE code
icu_chinese - apply Chinese text segmentation using ICU
libstemmer_* . Refer to the list of supported languages for details

Multiple stemmers can be specified, separated by commas. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers modifies the word. Additionally, when wordforms feature is enabled, the word will be looked up in the word forms dictionary first. If there is a matching entry in the dictionary, stemmers will not be applied at all. wordforms сan be used to implement stemming exceptions.

SQL

CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'

JSON

POST /cli -d "CREATE TABLE products(title text, price float)  morphology = 'stem_en, libstemmer_sv'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology' => 'stem_en, libstemmer_sv'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");

CONFIG

table products {
  morphology = stem_en, libstemmer_sv

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

morphology_skip_fields

morphology_skip_fields = field1[, field2, ...]

A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).

SQL

CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'

JSON

POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology_skip_fields' => 'name',
            'morphology' => 'stem_en'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");

CONFIG

table products {
  morphology_skip_fields = name
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_field = name
  rt_attr_uint = price
}

min_stemming_len

min_stemming_len = length

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

SQL

CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'min_stemming_len' => '4',
             'morphology' => 'stem_en'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");

utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");

CONFIG

table products {
  min_stemming_len = 4
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_exact_words

index_exact_words = {0|1}

This option allows for the indexing of original keywords along with their morphologically modified versions. However, original keywords that are remapped by the wordforms and exceptions cannot be indexed. The default value is 0, indicating that this feature is disabled by default.

This allows the use of the exact form operator in the query language. Enabling this feature will increase the full-text index size and indexing time, but will not impact search performance.

SQL

CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'index_exact_words' => '1',
             'morphology' => 'stem_en'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");

CONFIG

table products {
  index_exact_words = 1
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.3.10

Advanced HTML tokenization

Stripping HTML tags

html_strip

html_strip = {0|1}

This option determines whether HTML markup should be stripped from the incoming full-text data. The default value is 0, which disables stripping. To enable stripping, set the value to 1.

HTML tags and entities are considered as markup and will be processed.

HTML tags are removed, while the contents between them (e.g. everything between <p> and </p>) are left intact. You can choose to keep and index tag attributes (e.g. HREF attribute in an A tag or ALT in an IMG tag). Some well-known inline tags, such as A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, and TT, are completely removed. All other tags are treated as block level and are replaced with whitespace. For example, the text te<b>st</b> will be indexed as a single keyword 'test', while te<p>st</p> will be indexed as two keywords 'te' and 'st'.

HTML entities are decoded and replaced with their corresponding UTF-8 characters. The stripper supports both numeric forms (e.g. ï) and text forms (e.g. ó or  ) of entities, and supports all entities specified by the HTML4 standard.

The stripper is designed to work with properly formed HTML and XHTML, but may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Please note that only the tags themselves, as well as HTML comments, are stripped. To strip the contents of the tags, including embedded scripts, see the html_remove_elements option. There are no restrictions on tag names, meaning that everything that looks like a valid tag start, end, or comment will be stripped.

SQL

CREATE TABLE products(title text, price float) html_strip = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) html_strip = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_strip' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_strip = '1'");

CONFIG

table products {
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_index_attrs

html_index_attrs = img=alt,title; a=title;

The html_index_attrs option allows you to specify which HTML markup attributes should be indexed even though other HTML markup is stripped. The default value is empty, meaning no attributes will be indexed.
The format of the option is a per-tag enumeration of indexable attributes, as demonstrated in the example above. The contents of the specified attributes will be retained and indexed, providing a way to extract additional information from your full-text data.

SQL

CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_index_attrs' => 'img=alt,title; a=title;',
            'html_strip' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

CONFIG

table products {
  html_index_attrs = img=alt,title; a=title;
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_remove_elements

html_remove_elements = element1[, element2, ...]

A list of HTML elements whose contents, along with the elements themselves, will be stripped. Optional, the default is an empty string (do not strip contents of any elements).

This option allows you to remove the contents of elements, meaning everything between the opening and closing tags. It is useful for removing embedded scripts, CSS, etc. The short tag form for empty elements (e.g.
) is properly supported, and the text following such a tag will not be removed.

The value is a comma-separated list of element (tag) names, the contents of which should be removed. Tag names are case-insensitive.

SQL

CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_remove_elements' => 'style, script',
            'html_strip' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

CONFIG

table products {
  html_remove_elements = style, script
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Extracting important parts from HTML

index_sp

index_sp = {0|1}

Controls detection and indexing of sentence and paragraph boundaries. Optional, default is 0 (no detection or indexing).

This directive enables the detection and indexing of sentence and paragraph boundaries, making it possible for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, and only requires setting index_sp = 1 to enable it. Paragraph detection, however, relies on HTML markup and occurs during the [HTML stripping process](06-creating_a_table.html#toc-06.03.10-advanced_html_tokenization#html_strip. As such, to index paragraph boundaries, both the index_sp directive and the html_strip directive must be set to 1.

The following rules are used to determine sentence boundaries:

Question marks (?) and exclamation marks (!) always indicate a sentence boundary.
Trailing dots (.) indicate a sentence boundary, except in the following cases:
- When followed by a letter. This is considered part of an abbreviation (e.g. "S.T.A.L.K.E.R." or "Goldman Sachs S.p.A.").
- When followed by a comma. This is considered an abbreviation followed by a comma (e.g. "Telecom Italia S.p.A., founded in 1994").
- When followed by a space and a lowercase letter. This is considered an abbreviation within a sentence (e.g. "News Corp. announced in February").
- When preceded by a space and an uppercase letter, and followed by a space. This is considered a middle initial (e.g. "John D. Doe").

Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

SQL

CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_sp' => '1',
            'html_strip' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");

CONFIG

table products {
  index_sp = 1
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_zones

index_zones = h*, th, title

A list of HTML/XML zones within a field to be indexed. The default is an empty string (no zones will be indexed).

A "zone" is defined as everything between an opening and a matching closing tag, and all spans sharing the same tag name are referred to as a "zone." For example, everything between <H1> and </H1> in a document field belongs to the H1 zone.

The index_zones directive enables zone indexing, but the HTML stripper must also be enabled (by setting html_strip = 1). The value of index_zones should be a comma-separated list of tag names and wildcards (ending with a star) to be indexed as zones.

Zones can be nested and overlap, as long as every opening tag has a matching tag. Zones can also be used for matching with the ZONE operator, as described in the extended_query_syntax.

SQL

CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'

JSON

POST /cli -d "
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"

PHP

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_zones' => 'h*,th,title',
            'html_strip' => '1'
        ]);

Python

utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')

javascript

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');

Java

utilsApi.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");

CONFIG

table products {
  index_zones = h*, th, title
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

¶ 6.4

Creating a distributed table

Manticore allows for the creation of distributed tables, which act like regular plain or real-time tables, but are actually a collection of child tables used for searching. When a query is sent to a distributed table, it is distributed among all tables in the collection. The server then collects and processes the responses to sort and recalculate values of aggregates, if necessary.

From the client's perspective, it appears as if they are querying a single table.

Distributed tables can be composed of any combination of tables, including:

Local storage tables (plain table and Real-Time)
Remote tables
A combination of local and remote tables
Percolate tables (local, remote, or a combination)
Single local and multiple remote tables, or any other combination

Mixing percolate and template tables with plain and real-time tables is not recommended.

A distributed table is defined as type 'distributed' in the configuration file or through the SQL clause CREATE TABLE

In a configuration file

table foo {
    type = distributed
    local = bar
    local = bar1, bar2
    agent = 127.0.0.1:9312:baz
    agent = host1|host2:tbl
    agent = host1:9301:tbl1|host2:tbl2 [ha_strategy=random retry_count=10]
    ...
}

Via SQL

CREATE TABLE distributed_index type='distributed' local='local_index' agent='127.0.0.1:9312:remote_index'

Children

The essence of a distributed table lies in its list of child tables, to which it points. There are two types of child tables in a distributed table:

Local tables: These are tables that are served within the same server as the distributed table. To enumerate local tables, you use the syntax local =. You can list several local tables using multiple local = lines, or combine them into one list separated by commas.
Remote tables: These are tables that are served anywhere outside the server. To enumerate remote tables, you use the syntax agent =. Each line represents one endpoint or agent. Each agent can have multiple external locations and options for how it should work. More details here. It is important to note that the server does not have any information about the type of table it is working with. This may lead to errors if, for example, you issue a CALL PQ to a remote table 'foo' that is not a percolate table.

¶ 6.4.1

Creating a local distributed table

A distributed table in Manticore Search acts as a "master node" that proxies the demanded query to other tables and provides merged results from the responses it receives. The table doesn't hold any data on its own. It can connect to both local tables and tables located on other servers. Here's an example of a simple distributed table:

Configuration file

table index_dist {
  type  = distributed
  local = index1
  local = index2
  ...
 }

RT mode

CREATE TABLE local_dist type='distributed' local='index1' local='index2';

PHP

$params = [
    'body' => [
        'settings' => [
            'type' => 'distributed',
            'local' => [
                'index1',
                'index2'
            ]
        ]
    ],
    'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);

Python

utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'')

javascript

res = await utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'');

Java

utilsApi.sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");

utilsApi.Sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");

¶ 6.4.2

Remote tables

A remote table in Manticore Search is represented by the agent prefix in the definition of a distributed table. A distributed table can include a combination of local and remote tables. If there are no local tables provided, the distributed table will be purely remote and serve as a proxy only. For example, you might have a Manticore instance that listens on multiple ports and serves different protocols, and then redirects queries to backend servers that only accept connections via Manticore's internal binary protocol, using persistent connections to reduce the overhead of establishing connections.
Even though a purely remote distributed table doesn't serve local tables itself, it still consumes machine resources, as it still needs to perform final calculations, such as merging results and calculating final aggregated values.

agent

agent = address1 [ | address2 [...] ][:table-list]
agent = address1[:table-list [ | address2[:table-list [...] ] ] ]

agent directive declares the remote agents that are searched each time the enclosing distributed table is searched. These agents are essentially pointers to networked tables. The value specified includes the address and can also include multiple alternatives (agent mirrors) for either the address only or the address and table list.

The address specification must be one of the following:

address = hostname[:port] # eg. server2:9312
address = /absolute/unix/socket/path # eg. /var/run/manticore2.sock

The hostname is the remote host name, port is the remote TCP port number, table-list is a comma-separated list of table names, and square brackets [] indicate an optional clause.

If the table name is omitted, it is assumed to be the same table as the one where this line is defined. In other words, when defining agents for the 'mycoolindex' distributed table, you can simply point to the address, and it will be assumed that you are querying the mycoolindex table on the agent's endpoints.

If the port number is omitted, it is assumed to be 9312. If it is defined but invalid (e.g. 70000), the agent will be skipped.

You can point each agent to one or more remote tables residing on one or more networked servers with no restrictions. This enables several different usage modes:

Sharding over multiple agent servers and creating an arbitrary cluster topology
Sharding over multiple agent servers mirrored for high availability and load balancing purposes
Sharding within localhost to utilize multiple cores (however, it is simpler just to use multiple local tables)

All agents are searched in parallel. The index list is passed verbatim to the remote agent. The exact way that list is searched within the agent (i.e. sequentially or in parallel) depends solely on the agent's configuration (see the threads setting). The master has no remote control over this.

It is important to note that the LIMIT, option is ignored in agent queries. This is because each agent can contain different tables, so it is the responsibility of the client to apply the limit to the final result set. This is why the query to a physical table is different from the query to a distributed table when viewed in the query logs. The query cannot be a simple copy of the original query, as this would not produce the correct results.

For example, if a client makes a query SELECT ... LIMIT 10, 10, and there are two agents, with the second agent having only 10 documents, broadcasting the original LIMIT 10, 10 query would result in receiving 0 documents from the second agent. However, LIMIT 10,10 should return documents 10-20 from the resulting set. To resolve this, the query must be sent to the agents with a broader limit, such as the default max_matches value of 1000.

For instance, if there is a distributed table dist that refers to the remote table user, a client query SELECT * FROM dist LIMIT 10,10 would be converted to SELECT * FROM user LIMIT 0,1000 and sent to the remote table user. Once the distributed table receives the result, it will apply the LIMIT 10,10 and return the requested 10 documents.

SELECT * FROM dist LIMIT 10,10;

the query will be converted to:

SELECT * FROM user LIMIT 0,1000

Additionally, the value can specify options for each individual agent, such as:

ha_strategy - random, roundrobin, nodeads, noerrors (overrides the global ha_strategy setting for the particular agent)
conn - pconn, persistent (equivalent to setting agent_persistent at the table level)
blackhole 0,1 (identical to the agent_blackhole setting for the agent)
retry_count an integer value (corresponding to agent_retry_count , but the provided value will not be multiplied by the number of mirrors)

agent = address1:table-list[[ha_strategy=value, conn=value, blackhole=value]]

Example:

# config on box1
# sharding a table over 3 servers
agent = box2:9312:shard1
agent = box3:9312:shard2

# config on box2
# sharding a table over 3 servers
agent = box1:9312:shard2
agent = box3:9312:shard3

# config on box3
# sharding a table over 3 servers
agent = box1:9312:shard1
agent = box2:9312:shard3

# per agent options
agent = box1:9312:shard1[ha_strategy=nodeads]
agent = box2:9312:shard2[conn=pconn]
agent = box2:9312:shard2[conn=pconn,ha_strategy=nodeads]
agent = test:9312:any[blackhole=1]
agent = test:9312|box2:9312|box3:9312:any2[retry_count=2]
agent = test:9312|box2:9312:any2[retry_count=2,conn=pconn,ha_strategy=noerrors]

For optimal performance, it's recommended to place remote tables that reside on the same server within the same record. For instance, instead of:

agent = remote:9312:idx1
agent = remote:9312:idx2

you should prefer:

agent = remote:9312:idx1,idx2

agent_persistent

agent_persistent = remotebox:9312:index2

The agent_persistent option allows you to persistently connect to an agent, meaning the connection will not be dropped after a query is executed. The syntax for this directive is the same as the agent directive. However, instead of opening a new connection to the agent for each query and then closing it, the master will keep a connection open and reuse it for subsequent queries. The maximum number of persistent connections per agent host is defined by the persistent_connections_limit option in the searchd section.

It's important to note that the persistent_connections_limit must be set to a value greater than 0 in order to use persistent agent connections. If it's not defined, it defaults to 0, and the agent_persistent directive will act the same as the agentdirective.

Using persistent master-agent connections reduces TCP port pressure and saves time on connection handshakes, making it more efficient.

agent_blackhole

agent_blackhole = testbox:9312:testindex1,testindex2

The agent_blackhole directive allows you to forward queries to remote agents without waiting for or processing their responses. This is useful for debugging or testing production clusters, as you can set up a separate debugging/testing instance and forward requests to it from the production master (aggregator) instance, without interfering with production work. The master searchd will attempt to connect to the blackhole agent and send queries as normal, but will not wait for or process any responses, and all network errors on the blackhole agents will be ignored. The format of the value is identical to that of the regular agent directive.

agent_connect_timeout

agent_connect_timeout = 300

The agent_connect_timeout directive defines the timeout for connecting to remote agents. By default, the value is assumed to be in milliseconds, but can have another suffix). The default value is 1000 (1 second).

When connecting to remote agents, searchd will wait for this amount of time at most to complete the connection successfully. If the timeout is reached but the connection has not been established, and retries are enabled, a retry will be initiated.

agent_query_timeout

agent_query_timeout = 10000 # our query can be long, allow up to 10 sec

The agent_query_timeout sets the amount of time that searchd will wait for a remote agent to complete a query. The default value is 3000 milliseconds (3 seconds), but can be suffixed to indicate a different unit of time.

After establishing a connection, searchd will wait for a maximum of agent_query_timeout for remote queries to complete. Note that this timeout is separate from the agent_connection_timeout and the total possible delay caused by a remote agent will be the sum of both values. If the agent_query_timeout is reached, the query will not be retried, instead, a warning will be produced.

Note that behavior is also affected by reset_network_timeout_on_packet

agent_retry_count

The agent_retry_count is an integer that specifies how many times Manticore will attempt to connect and query remote agents in a distributed table before reporting a fatal query error. It works similarly to agent_retry_count defined in the "searchd" section of the configuration file but applies specifically to the table.

mirror_retry_count

mirror_retry_count serves the same purpose as agent_retry_count. If both values are provided, mirror_retry_count will take precedence, and a warning will be raised.

Instance-wide options

The following options manage the overall behavior of remote agents and are specified in the searchd section of the configuration file. They set default values for the entire Manticore instance.

agent_connect_timeout - default value for the agent_connect_timeout parameter.
agent_query_timeout - default value for the agent_query_timeout parameter. This can also be overridden on a per-query basis using the same setting name in a distributed (network) table.
agent_retry_count is an integer that specifies the number of times Manticore will attempt to connect and query remote agents in a distributed table before reporting a fatal query error. The default value is 0 (i.e. no retries). This value can also be specified on a per-query basis using the 'OPTION retry_count=XXX' clause. If a per-query option is provided, it will take precedence over the value specified in the config.

Note, that if you use agent mirrors in the definition of your distributed table, the server will select a different mirror before each connection attempt according to the specified ha_strategy specified. In this case the agent_retry_count will be aggregated for all mirrors in the set.

For example, if you have 10 mirrors and set agent_retry_count=5, he server will attempt up to 50 retries (assuming an average of 5 tries per every 10 mirrors). In case of the option ha_strategy = roundrobin, it will actually be exactly 5 tries per mirror.

At the same time, the value provided as the retry_count option in the agent definition serves as an absolute limit. In other words, the [retry_count=2] option in the agent definition means there will be a maximum of 2 tries, regardless of whether there is 1 or 10 mirrors in the line.

agent_retry_delay

The agent_retry_delay is an integer value that determines the amount of time, in milliseconds, that Manticore Search will wait before retrying to query a remote agent in case of a failure. This value can be specified either globally in the searchd configuration or on a per-query basis using the OPTION retry_delay=XXX clause. If both options are provided, the per-query option will take precedence over the global one. The default value is 500 milliseconds (0.5 seconds). This option is only relevant if agent_retry_count or the per-query OPTION retry_count are non-zero.

client_timeout

The client_timeout option sets the maximum waiting time between requests when using persistent connections. This value is expressed in seconds or with a time suffix. The default value is 5 minutes.

Example:

client_timeout = 1h

hostname_lookup

The hostname_lookup option defines the strategy for renewing hostnames. By default, the IP addresses of agent host names are cached at server start to avoid excessive access to DNS. However, in some cases, the IP can change dynamically (e.g. cloud hosting) and it may be desirable to not cache the IPs. Setting this option to request disables the caching and queries the DNS for each query. The IP addresses can also be manually renewed using the FLUSH HOSTNAMES command.

listen_tfo

The listen_tfo option allows for the use of the TCP_FASTOPEN flag for all listeners. By default, it is managed by the system, but it can be explicitly turned off by setting it to '0'.

For more information about the TCP Fast Open extension, please refer to Wikipedia. In short, it allows to eliminate one TCP round-trip when establishing a connection.

In practice, using TFO can optimize the client-agent network efficiency, similar to when agent_persistent is in use, but without holding active connections and without limitations on the maximum number of connections.

Most modern operating systems support TFO. Linux (as one of the most progressive) has supported it since 2011, with kernels starting from 3.7 (for the server side). Windows has supported it since some builds of Windows 10. Other systems, such as FreeBSD and MacOS, are also in the game.

For Linux systems, the server checks the variable /proc/sys/net/ipv4/tcp_fastopen and behaves accordingly. Bit 0 manages the client side, while bit 1 rules the listeners. By default, the system has this parameter set to 1, i.e., clients are enabled and listeners are disabled.

persistent_connections_limit

persistent_connections_limit = 29 # assume that each host of agents has max_connections = 30 (or 29).

The persistent_connections_limit option defines the maximum number of simultaneous persistent connections to remote persistent agents. This is an instance-wide setting and must be defined in the searchd configuration section. Each time a connection to an agent defined under agent_persistent is made, we attempt to reuse an existing connection (if one exists) or create a new connection and save it for future use. However, in some cases it may be necessary to limit the number of persistent connections. This directive defines the limit and affects the number of connections to each agent's host across all distributed tables.

It is recommended to set this value equal to or less than the max_connections option in the agent's configuration.

Distributed snippets creation

A special case of a distributed table is a single local and multiple remotes, which is used exclusively for distributed snippets creation, when snippets are sourced from files. In this case, the local table may act as a "template" table, providing settings for tokenization when building snippets.

snippets_file_prefix

snippets_file_prefix = /mnt/common/server1/

The snippets_file_prefix is an optional prefix that can be added to the local file names when generating snippets. The default value is the current working folder.

To learn more about distributed snippets creation, see CALL SNIPPETS.

Distributed percolate tables (DPQ tables)

You can create a distributed table from multiple percolate tables. The syntax for constructing this type of table is the same as for other distributed tables, and can include multiplelocal tables as well as agents.

For DPQ, the operations of listing stored queries and searching through them (using CALL PQ) are transparent and work as if all the tables were one single local table. However, data manipulation statements such as insert, replace, truncate are not available.

If you include a non-percolate table in the list of agents, the behavior will be undefined. If the incorrect agent has the same schema as the outer schema of the PQ table (id, query, tags, filters), it will not trigger an error when listing stored PQ rules, and may pollute the list of actual PQ rules stored in PQ tables with its own non-PQ strings. As a result, be cautious and aware of the confusion that this may cause. ACALL PQ to such an incorrect agent will trigger an error.

For more information on making queries to a distributed percolate table, see making queries to a distribute percolate table.