¶ 6

Creating a table

¶ 6.1

Data types

Full-text fields and attributes

Manticore's data types can be split into two categories: full-text fields and attributes.

Full-text fields

Full-text fields:

Full-text fields are represented by the data type text. All other data types are called "attributes".

Attributes

Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.

It is often desired to process full-text search results based not only on matching document ID and its rank, but also on a number of other per-document values. For example, one might need to sort news search results by date and then relevance, or search through products within a specified price range, or limit a blog search to posts made by selected users, or group results by month. To do this efficiently, Manticore enables not only full-text fields, but also additional attributes to be added to each document. These attributes can be used to filter, sort, or group full-text matches, or to search only by attributes.

The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.

A good example for attributes would be a forum posts table. Assume that only the title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (i.e., search only those rows that have some specific values of author_id or forum_id); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

SQL
CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);
JSON
POST /cli -d "CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('forum');
$index->create([
    'title'=>['type'=>'text'],
    'content'=>['type'=>'text'],
    'author_id'=>['type'=>'int'],
    'forum_id'=>['type'=>'int'],
    'post_date'=>['type'=>'timestamp']
]);
Python
utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)')
Javascript
res = await utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)');
Java
utilsApi.sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
C#
utilsApi.Sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
config
table forum
{
    type = rt
    path = forum

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
    rt_field = content

    # this option needs to be specified for the field to be stored
    stored_fields = title, content

    rt_attr_uint = author_id
    rt_attr_uint = forum_id
    rt_attr_timestamp = post_date
}

This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.

SQL
select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc
JSON
POST /search
{
  "index": "forum",
  "query":
  {
    "match_all": {},
    "bool":
    {
      "must":
      [
        { "equals": { "author_id": 123 } },
        { "in": { "forum_id": [1,3,7] } }
      ]
    }
  },
  "sort": [ { "post_date": "desc" } ]
}
PHP
$client->search([
        'index' => 'forum',
        'query' =>
        [
            'match_all' => [],
            'bool' => [
                'must' => [
                    'equals' => ['author_id' => 123],
                    'in' => [
                        'forum_id' => [
                            1,3,7
                        ]
                    ]
                ]
            ]
        ],
        'sort' => [
        ['post_date' => 'desc']
    ]
]);
Python
searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]})
javascript
res = await searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]});
java
HashMap<String,Object> filters = new HashMap<String,Object>(){{
    put("must", new HashMap<String,Object>(){{
        put("equals",new HashMap<String,Integer>(){{
            put("author_id",123);
        }});
        put("in",
            new HashMap<String,Object>(){{
                put("forum_id",new int[] {1,3,7});
        }});
    }});
}};
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("bool",filters);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("post_date","desc");}});
}});
SearchResponse searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var boolFilter = new BoolFilter();
boolFilter.Must = new List<Object> {
    new EqualsFilter("author_id", 123),
    new InFilter("forum_id", new List<Object> {1,3,7})
};
searchRequest.AttrFilter = boolFilter;
searchRequest.Sort = new List<Object> { new SortOrder("post_date", SortOrder.OrderEnum.Desc) };
var searchResponse = searchApi.Search(searchRequest);

Row-wise and columnar attribute storages

Manticore supports two types of attribute storages:

As can be understood from their names, they store data differently. The traditional row-wise storage:

With the columnar storage:

The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:

How to switch between the storages

The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.

To enable the columnar storage you need to:

create table tbl(title text, type int, price float engine='rowwise') engine='columnar'
create table tbl(title text, type int, price float engine='columnar');

or

create table tbl(title text, type int, price float engine='columnar') engine='rowwise';

Below is the list of data types supported by Manticore Search:

Document ID

The document identifier is a mandatory attribute, and document IDs must be unique 64-bit unsigned integers. Document IDs can be explicitly specified, but if not, they are still enabled. Document IDs cannot be updated. Note that when retrieving document IDs, they are treated as signed 64-bit integers, which means they may be negative. Use the UINT64() function to cast them to unsigned 64-bit integers if necessary.

Explicit ID
When you create a table, you can specify ID explicitly, but no matter what data type you use, it will be always as said previously - a signed 64-bit integer.
CREATE TABLE tbl(id bigint, content text);
DESC tbl;
+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)
Implicit ID
You can also omit specifying ID at all, it will be enabled automatically.
CREATE TABLE tbl(content text);
DESC tbl;
+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Character data types

General syntax:

string|text [stored|attribute] [indexed]

Properties:

  1. indexed - full-text indexed (can be used in full-text queries)
  2. stored - stored in a docstore (stored on disk, not in RAM, lazy read)
  3. attribute - makes it a string attribute (can sort/group by it)

Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.

No properties specified:

string and text are aliases, but if you don’t specify any properties, they by default mean different things:

Text

The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.

Text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations, etc. Eventually, a full-text table (a special data structure that enables quick searches for a keyword) gets built from that text.

Full-text fields can only be used in the MATCH() clause and cannot be used for sorting or aggregation. Words are stored in an inverted index along with references to the fields they belong to and positions in the field. This allows searching a word inside each field and using advanced operators like proximity. By default, the original text of the fields is both indexed and stored in document storage. It means that the original text can be returned with the query results and used in search result highlighting.

SQL
CREATE TABLE products(title text);
JSON
POST /cli -d "CREATE TABLE products(title text)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text']
]);
Python
utilsApi.sql('CREATE TABLE products(title text)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text)');
java
utilsApi.sql("CREATE TABLE products(title text)");
C#
utilsApi.Sql("CREATE TABLE products(title text)");
config
table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title

    # this option needs to be specified for the field to be stored
    stored_fields = title
}

This behavior can be overridden by explicitly specifying that the text is only indexed.

SQL
CREATE TABLE products(title text indexed);
JSON
POST /cli -d "CREATE TABLE products(title text indexed)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text','options'=>['indexed']]
]);
Python
utilsApi.sql('CREATE TABLE products(title text indexed)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text indexed)');
java
utilsApi.sql("CREATE TABLE products(title text indexed)");
C#
utilsApi.Sql("CREATE TABLE products(title text indexed)");
config
table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
}

Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. "title" and "abstract" only). You can have up to 256 full-text fields.

SQL
select * from products where match('@title first');
JSON
POST /search
{
    "index": "products",
    "query":
    {
        "match": { "title": "first" }
    }
}
PHP
$index->setName('products')->search('@title')->get();
Python
searchApi.search({"index":"products","query":{"match":{"title":"first"}}})
javascript
res = await searchApi.search({"index":"products","query":{"match":{"title":"first"}}});
java
utilsApi.sql("CREATE TABLE products(title text indexed)");
C#
utilsApi.Sql("CREATE TABLE products(title text indexed)");

String

Unlike full-text fields, string attributes (just string or string/text attribute) are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used in the WHERE clause for comparison filtering or REGEX, and can be used for sorting and aggregation. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys.

If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.

SQL
CREATE TABLE products(title text, keys string);
JSON
POST /cli -d "CREATE TABLE products(title text, keys string)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'keys'=>['type'=>'string']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, keys string)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, keys string)');
java
utilsApi.sql("CREATE TABLE products(title text, keys string)");
C#
utilsApi.Sql("CREATE TABLE products(title text, keys string)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_string = keys
}
More You can create a full-text field that is also stored as a string attribute. This approach creates a full-text field and a string attribute that have the same name. Note that you can't add a `stored` property to store the data as a string attribute and in the document storage at the same time.
SQL
`string attribute indexed` means that we're working with a string data type that is stored as an attribute and indexed as a full-text field.
CREATE TABLE products ( title string attribute indexed );
JSON
POST /cli -d "CREATE TABLE products ( title string attribute indexed )"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'string','options'=>['indexed','attribute']]
]);
Python
utilsApi.sql('CREATE TABLE products ( title string attribute indexed )')
javascript
res = await utilsApi.sql('CREATE TABLE products ( title string attribute indexed )');
java
utilsApi.sql("CREATE TABLE products ( title string attribute indexed )");
C#
utilsApi.Sql("CREATE TABLE products ( title string attribute indexed )");
config
table products
{
    type = rt
    path = products

    rt_field = title
    rt_attr_string = title
}

Integer

Integer type allows storing 32 bit unsigned integer values.

SQL
CREATE TABLE products(title text, price int);
JSON
POST /cli -d "CREATE TABLE products(title text, price int)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'int']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, price int)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price int)');
java
utilsApi.sql("CREATE TABLE products(title text, price int)");
C#
utilsApi.Sql("CREATE TABLE products(title text, price int)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = type
}

Integers can be stored in shorter sizes than 32-bit by specifying a bit count. For example, if we want to store a numeric value which we know is not going to be bigger than 8, the type can be defined as bit(3). Bitcount integers perform slower than the full-size ones, but they require less RAM. They are saved in 32-bit chunks, so in order to save space, they should be grouped at the end of attribute definitions (otherwise a bitcount integer between 2 full-size integers will occupy 32 bits as well).

SQL
CREATE TABLE products(title text, flags bit(3), tags bit(2) );
JSON
POST /cli -d "CREATE TABLE products(title text, flags bit(3), tags bit(2))"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'flags'=>['type'=>'bit(3)'],
    'tags'=>['type'=>'bit(2)']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ');
java
utilsApi.sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
C#
utilsApi.Sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = flags:3
    rt_attr_uint = tags:2
}

Big Integer

Big integers (bigint) are 64-bit wide signed integers.

SQL
CREATE TABLE products(title text, price bigint );
JSON
POST /cli -d "CREATE TABLE products(title text, price bigint)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'bigint']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, price bigint )')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price bigint )');
java
utilsApi.sql("CREATE TABLE products(title text, price bigint )");
C#
utilsApi.Sql("CREATE TABLE products(title text, price bigint )");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bigint = type
}

Boolean

Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.

SQL
CREATE TABLE products(title text, sold bool );
JSON
POST /cli -d "CREATE TABLE products(title text, sold bool)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'sold'=>['type'=>'bool']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, sold bool )')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, sold bool )');
java
utilsApi.sql("CREATE TABLE products(title text, sold bool )");
C#
utilsApi.Sql("CREATE TABLE products(title text, sold bool )");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bool = sold
}

Timestamps

Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.

SQL
CREATE TABLE products(title text, date timestamp);
JSON
POST /cli -d "CREATE TABLE products(title text, date timestamp)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'date'=>['type'=>'timestamp']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, date timestamp)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, date timestamp)');
java
utilsApi.sql("CREATE TABLE products(title text, date timestamp)");
C#
utilsApi.Sql("CREATE TABLE products(title text, date timestamp)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_timestamp = date
}

Float

Real numbers are stored as 32-bit IEEE 754 single precision floats.

SQL
CREATE TABLE products(title text, coeff float);
JSON
POST /cli -d "CREATE TABLE products(title text, coeff float)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'coeff'=>['type'=>'float']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, coeff float)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, coeff float)');
java
utilsApi.sql("CREATE TABLE products(title text, coeff float)");
C#
utilsApi.Sql("CREATE TABLE products(title text, coeff float)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float = coeff
}

Unlike integer types, comparing two floating-point numbers for equality is not recommended due to potential rounding errors. A more reliable approach is to use a near-equal comparison, by checking the absolute error margin.

SQL
select abs(a-b)<=0.00001 from products
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "eps": "abs(a-b)" }
}
PHP
$index->setName('products')->search('')->expression('eps','abs(a-b)')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}},"expressions":{"eps":"abs(a-b)"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"eps":"abs(a-b)"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("ebs","abs(a-b)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object>{
    new Dictionary<string, string> { {"ebs", "abs(a-b)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Another alternative, which can also be used to perform IN(attr,val1,val2,val3) is to compare floats as integers by choosing a multiplier factor and convert the floats to integers in operations. The following example illustrates modifying IN(attr,2.0,2.5,3.5) to work with integer values.

SQL
select in(ceil(attr*100),200,250,350) from products
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "inc": "in(ceil(attr*100),200,250,350)" }
}
PHP
$index->setName('products')->search('')->expression('inc','in(ceil(attr*100),200,250,350)')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("inc","in(ceil(attr*100),200,250,350)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"ebs", "in(ceil(attr*100),200,250,350)"} }
};
var searchResponse = searchApi.Search(searchRequest);

JSON

This data type allows storing JSON objects, which is useful for storing schema-less data. However, it is not supported by columnar storage. However, it can be stored in traditional storage, as it's possible to combine both storage types in the same table.

SQL
CREATE TABLE products(title text, data json);
JSON
POST /cli -d "CREATE TABLE products(title text, data json)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'data'=>['type'=>'json']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, data json)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, data json)');
java
utilsApi.sql'CREATE TABLE products(title text, data json)');
C#
utilsApi.Sql'CREATE TABLE products(title text, data json)');
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_json = data
}

JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.

SQL
select indexof(x>2 for x in data.intarray) from products
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "idx": "indexof(x>2 for x in data.intarray)" }
}
PHP
$index->setName('products')->search('')->expression('idx','indexof(x>2 for x in data.intarray)')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.

SQL
select regex(data.name, 'est') as c from products where c>0
JSON
POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "range": { "c": { "gt": 0 } } }
  },
  "expressions": { "c": "regex(data.name, 'est')" }
}
PHP
$index->setName('products')->search('')->expression('idx',"regex(data.name, 'est')")->filter('c','gt',0)->get();
Python
searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("range", new HashMap<String,Object>(){{
    put("c", new HashMap<String,Object>(){{
        put("gt",0);
    }});
}});
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var rangeFilter = new RangeFilter("c");
rangeFilter.Gt = 0;
searchRequest.AttrFilter = rangeFilter;
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

In the case of JSON properties, enforcing data type may be required for proper functionality in certain situations. For example, when working with float values, DOUBLE() must be used for proper sorting.

SQL
select * from products order by double(data.myfloat) desc
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "sort": [ { "double(data.myfloat)": { "order": "desc"} } ]
}
PHP
$index->setName('products')->search('')->sort('double(data.myfloat)','desc')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("double(data.myfloat)",new HashMap<String,String>(){{ put("order","desc");}});}});
}});
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortOrder("double(data.myfloat)", SortOrder.OrderEnum.Desc)
};
var searchResponse = searchApi.Search(searchRequest);

Float vector

Float vector attributes allow storing variable-length lists of floats. It's important to note that this concept differs from multi-valued attributes. Multi-valued attributes (MVAs) are essentially sets; they do not preserve value order, and duplicate values are not retained. In contrast, float vectors perform no additional processing on values during insertion.

Float vector attributes can be used in k-nearest neighbor searches; see KNN search.

** Currently, float_vector fields can only be utilized in KNN search within real-time tables and the data type is not supported in any other functions or expressions, nor is it supported in plain tables. **

SQL
CREATE TABLE products(title text, image_vector float_vector);
JSON
POST /cli -d "CREATE TABLE products(title text, image_vector float_vector)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'image_vector'=>['type'=>'float_vector']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)');
java
utilsApi.sql("CREATE TABLE products(title text, image_vector float_vector)");
C#
utilsApi.Sql("CREATE TABLE products(title text, image_vector float_vector)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float_vector = image_vector
}

Multi-value integer (MVA)

Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.

SQL
CREATE TABLE products(title text, product_codes multi);
JSON
POST /cli -d "CREATE TABLE products(title text, product_codes multi)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'product_codes'=>['type'=>'multi']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, product_codes multi)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, product_codes multi)');
java
utilsApi.sql("CREATE TABLE products(title text, product_codes multi)");
C#
utilsApi.Sql("CREATE TABLE products(title text, product_codes multi)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi = product_codes
}

It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.

SQL
select * from products where any(product_codes)=3
JSON
POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "equals" : { "any(product_codes)": 3 }
  }
}
PHP
$index->setName('products')->search('')->filter('any(product_codes)','equals',3)->get();
Python
searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})'
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("equals",new HashMap<String,Integer>(){{
     put("any(product_codes)",3);
}});
searchRequest.setQuery(query);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.AttrFilter = new EqualsFilter("any(product_codes)", 3);
var searchResponse = searchApi.Search(searchRequest);

Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.

SQL
select least(product_codes) l from products order by l asc
JSON
POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "sort": [ { "product_codes":{ "order":"asc", "mode":"min" } } ]
  }
}
PHP
$index->setName('products')->search('')->sort('product_codes','asc','min')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("product_codes",new HashMap<String,String>(){{ put("order","asc");put("mode","min");}});}});
}});
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortMVA("product_codes", SortOrder.OrderEnum.Asc, SortMVA.ModeEnum.Min)
};
searchResponse = searchApi.search(searchRequest);

When grouping by a multi-value attribute, a document will contribute to as many groups as there are different values associated with that document. For instance, if a collection contains exactly one document having a 'product_codes' multi-value attribute with values 5, 7, and 11, grouping on 'product_codes' will produce 3 groups with COUNT(*)equal to 1 and GROUPBY() key values of 5, 7, and 11, respectively. Also, note that grouping by multi-value attributes may lead to duplicate documents in the result set because each document can participate in many groups.

SQL
insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;
Query OK, 1 row affected (0.00 sec)

+------+----------+-----------+
| id   | count(*) | groupby() |
+------+----------+-----------+
|    1 |        1 |        11 |
|    1 |        1 |         7 |
|    1 |        1 |         5 |
+------+----------+-----------+
3 rows in set (0.00 sec)

The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.

SQL
insert into product values (1,'first',(4,2,1,3));
select * from products;
Query OK, 1 row affected (0.00 sec)

+------+---------------+-------+
| id   | product_codes | title |
+------+---------------+-------+
|    1 | 1,2,3,4       | first |
+------+---------------+-------+
1 row in set (0.01 sec)
JSON
POST /insert
{
    "index":"products",
    "id":1,
    "doc":
    {
        "title":"first",
        "product_codes":[4,2,1,3]
    }
}

POST /search
{
  "index": "products",
  "query": { "match_all": {} }
}
{
   "_index":"products",
   "_id":1,
   "created":true,
   "result":"created",
   "status":201
}

{
   "took":0,
   "timed_out":false,
   "hits":{
      "total":1,
      "hits":[
         {
            "_id":"1",
            "_score":1,
            "_source":{
               "product_codes":[
                  1,
                  2,
                  3,
                  4
               ],
               "title":"first"
            }
         }
      ]
   }
}
PHP
$index->addDocument([
    "title"=>"first",
    "product_codes"=>[4,2,1,3]
]);
$index->search('')-get();
Array
(
    [_index] => products
    [_id] => 1
    [created] => 1
    [result] => created
    [status] => 201
)
Array
(
    [took] => 0
    [timed_out] =>
    [hits] => Array
        (
            [total] => 1
            [hits] => Array
                (
                    [0] => Array
                        (
                            [_id] => 1
                            [_score] => 1
                            [_source] => Array
                                (
                                    [product_codes] => Array
                                        (
                                            [0] => 1
                                            [1] => 2
                                            [2] => 3
                                            [3] => 4
                                        )

                                    [title] => first
                                )
                        )
                )
        )
)
Python
indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}})
searchApi.search({"index":"products","query":{"match_all":{}}})
{'created': True,
 'found': None,
 'id': 1,
 'index': 'products',
 'result': 'created'}
{'hits': {'hits': [{u'_id': u'1',
                    u'_score': 1,
                    u'_source': {u'product_codes': [1, 2, 3, 4],
                                 u'title': u'first'}}],
          'total': 1},
 'profile': None,
 'timed_out': False,
 'took': 29}
javascript
await indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}});
res = await searchApi.search({"index":"products","query":{"match_all":{}}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1,"_source":{"product_codes":[1,2,3,4],"title":"first"}}]}}
java
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
    put("title","first");
    put("product_codes",new int[] {4,2,1,3});
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
System.out.println(searchResponse.toString() );
class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}
C#
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "first");
doc.Add("product_codes", new List<Object> {4,2,1,3});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
object query =  new { match_all=null };
var searchRequest = new SearchRequest("products", query);
var searchResponse = searchApi.Search(searchRequest);
Console.WriteLine(searchResponse.ToString())
class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}

Multi-value big integer

A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.

SQL
CREATE TABLE products(title text, values multi64);
JSON
POST /cli -d "CREATE TABLE products(title text, values multi64)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'values'=>['type'=>'multi64']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, values multi64))')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, values multi64))');
java
utilsApi.sql("CREATE TABLE products(title text, values multi64))");
C#
utilsApi.Sql("CREATE TABLE products(title text, values multi64))");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi_64 = values
}

Columnar attribute properties

When you use the columnar storage you can specify the following properties for the attributes.

fast_fetch

By default, Manticore Columnar storage stores all attributes in a columnar fashion, as well as in a special docstore row by row. This enables fast execution of queries like SELECT * FROM ..., especially when fetching a large number of records at once. However, if you are sure that you do not need it or wish to save disk space, you can disable it by specifying fast_fetch='0' when creating a table or (if you are defining a table in a config) by using columnar_no_fast_fetch as shown in the following example.

RT mode
create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;
+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
3 rows in set (0.00 sec)
Plain mode
source min {
    type = mysql
    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_query = select 1, 1 a, 1 b
    sql_attr_uint = a
    sql_attr_uint = b
}

table tbl {
    path = tbl/col
    source = min
    columnar_attrs = *
    columnar_no_fast_fetch = b
}
+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
¶ 6.1.1

Data types

Full-text fields and attributes

Manticore's data types can be split into two categories: full-text fields and attributes.

Full-text fields

Full-text fields:

Full-text fields are represented by the data type text. All other data types are called "attributes".

Attributes

Attributes are non-full-text values associated with each document that can be used to perform non-full-text filtering, sorting and grouping during a search.

It is often desired to process full-text search results based not only on matching document ID and its rank, but also on a number of other per-document values. For example, one might need to sort news search results by date and then relevance, or search through products within a specified price range, or limit a blog search to posts made by selected users, or group results by month. To do this efficiently, Manticore enables not only full-text fields, but also additional attributes to be added to each document. These attributes can be used to filter, sort, or group full-text matches, or to search only by attributes.

The attributes, unlike full-text fields, are not full-text indexed. They are stored in the table, but it is not possible to search them as full-text.

A good example for attributes would be a forum posts table. Assume that only the title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (i.e., search only those rows that have some specific values of author_id or forum_id); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

SQL
CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp);
JSON
POST /cli -d "CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('forum');
$index->create([
    'title'=>['type'=>'text'],
    'content'=>['type'=>'text'],
    'author_id'=>['type'=>'int'],
    'forum_id'=>['type'=>'int'],
    'post_date'=>['type'=>'timestamp']
]);
Python
utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)')
Javascript
res = await utilsApi.sql('CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)');
Java
utilsApi.sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
C#
utilsApi.Sql("CREATE TABLE forum(title text, content text, author_id int, forum_id int, post_date timestamp)");
config
table forum
{
    type = rt
    path = forum

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
    rt_field = content

    # this option needs to be specified for the field to be stored
    stored_fields = title, content

    rt_attr_uint = author_id
    rt_attr_uint = forum_id
    rt_attr_timestamp = post_date
}

This example shows running a full-text query filtered by author_id, forum_id and sorted by post_date.

SQL
select * from forum where author_id=123 and forum_id in (1,3,7) order by post_date desc
JSON
POST /search
{
  "index": "forum",
  "query":
  {
    "match_all": {},
    "bool":
    {
      "must":
      [
        { "equals": { "author_id": 123 } },
        { "in": { "forum_id": [1,3,7] } }
      ]
    }
  },
  "sort": [ { "post_date": "desc" } ]
}
PHP
$client->search([
        'index' => 'forum',
        'query' =>
        [
            'match_all' => [],
            'bool' => [
                'must' => [
                    'equals' => ['author_id' => 123],
                    'in' => [
                        'forum_id' => [
                            1,3,7
                        ]
                    ]
                ]
            ]
        ],
        'sort' => [
        ['post_date' => 'desc']
    ]
]);
Python
searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]})
javascript
res = await searchApi.search({"index":"forum","query":{"match_all":{},"bool":{"must":[{"equals":{"author_id":123}},{"in":{"forum_id":[1,3,7]}}]}},"sort":[{"post_date":"desc"}]});
java
HashMap<String,Object> filters = new HashMap<String,Object>(){{
    put("must", new HashMap<String,Object>(){{
        put("equals",new HashMap<String,Integer>(){{
            put("author_id",123);
        }});
        put("in",
            new HashMap<String,Object>(){{
                put("forum_id",new int[] {1,3,7});
        }});
    }});
}};
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("bool",filters);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("post_date","desc");}});
}});
SearchResponse searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var boolFilter = new BoolFilter();
boolFilter.Must = new List<Object> {
    new EqualsFilter("author_id", 123),
    new InFilter("forum_id", new List<Object> {1,3,7})
};
searchRequest.AttrFilter = boolFilter;
searchRequest.Sort = new List<Object> { new SortOrder("post_date", SortOrder.OrderEnum.Desc) };
var searchResponse = searchApi.Search(searchRequest);

Row-wise and columnar attribute storages

Manticore supports two types of attribute storages:

As can be understood from their names, they store data differently. The traditional row-wise storage:

With the columnar storage:

The columnar storage was designed to handle large data volume that does not fit into RAM, so the recommendations are:

How to switch between the storages

The traditional row-wise storage is the default, so if you want everything to be stored in a row-wise fashion, you don't need to do anything when you create a table.

To enable the columnar storage you need to:

create table tbl(title text, type int, price float engine='rowwise') engine='columnar'
create table tbl(title text, type int, price float engine='columnar');

or

create table tbl(title text, type int, price float engine='columnar') engine='rowwise';

Below is the list of data types supported by Manticore Search:

Document ID

The document identifier is a mandatory attribute, and document IDs must be unique 64-bit unsigned integers. Document IDs can be explicitly specified, but if not, they are still enabled. Document IDs cannot be updated. Note that when retrieving document IDs, they are treated as signed 64-bit integers, which means they may be negative. Use the UINT64() function to cast them to unsigned 64-bit integers if necessary.

Explicit ID
When you create a table, you can specify ID explicitly, but no matter what data type you use, it will be always as said previously - a signed 64-bit integer.
CREATE TABLE tbl(id bigint, content text);
DESC tbl;
+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)
Implicit ID
You can also omit specifying ID at all, it will be enabled automatically.
CREATE TABLE tbl(content text);
DESC tbl;
+---------+--------+----------------+
| Field   | Type   | Properties     |
+---------+--------+----------------+
| id      | bigint |                |
| content | text   | indexed stored |
+---------+--------+----------------+
2 rows in set (0.00 sec)

Character data types

General syntax:

string|text [stored|attribute] [indexed]

Properties:

  1. indexed - full-text indexed (can be used in full-text queries)
  2. stored - stored in a docstore (stored on disk, not in RAM, lazy read)
  3. attribute - makes it a string attribute (can sort/group by it)

Specifying at least one property overrides all the default ones (see below), i.e., if you decide to use a custom combination of properties, you need to list all the properties you want.

No properties specified:

string and text are aliases, but if you don’t specify any properties, they by default mean different things:

Text

The text (just text or text/string indexed) data type forms the full-text part of the table. Text fields are indexed and can be searched for keywords.

Text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations, etc. Eventually, a full-text table (a special data structure that enables quick searches for a keyword) gets built from that text.

Full-text fields can only be used in the MATCH() clause and cannot be used for sorting or aggregation. Words are stored in an inverted index along with references to the fields they belong to and positions in the field. This allows searching a word inside each field and using advanced operators like proximity. By default, the original text of the fields is both indexed and stored in document storage. It means that the original text can be returned with the query results and used in search result highlighting.

SQL
CREATE TABLE products(title text);
JSON
POST /cli -d "CREATE TABLE products(title text)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text']
]);
Python
utilsApi.sql('CREATE TABLE products(title text)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text)');
java
utilsApi.sql("CREATE TABLE products(title text)");
C#
utilsApi.Sql("CREATE TABLE products(title text)");
config
table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title

    # this option needs to be specified for the field to be stored
    stored_fields = title
}

This behavior can be overridden by explicitly specifying that the text is only indexed.

SQL
CREATE TABLE products(title text indexed);
JSON
POST /cli -d "CREATE TABLE products(title text indexed)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text','options'=>['indexed']]
]);
Python
utilsApi.sql('CREATE TABLE products(title text indexed)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text indexed)');
java
utilsApi.sql("CREATE TABLE products(title text indexed)");
C#
utilsApi.Sql("CREATE TABLE products(title text indexed)");
config
table products
{
    type = rt
    path = products

    # when configuring fields via config, they are indexed (and not stored) by default
    rt_field = title
}

Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. "title" and "abstract" only). You can have up to 256 full-text fields.

SQL
select * from products where match('@title first');
JSON
POST /search
{
    "index": "products",
    "query":
    {
        "match": { "title": "first" }
    }
}
PHP
$index->setName('products')->search('@title')->get();
Python
searchApi.search({"index":"products","query":{"match":{"title":"first"}}})
javascript
res = await searchApi.search({"index":"products","query":{"match":{"title":"first"}}});
java
utilsApi.sql("CREATE TABLE products(title text indexed)");
C#
utilsApi.Sql("CREATE TABLE products(title text indexed)");

String

Unlike full-text fields, string attributes (just string or string/text attribute) are stored as they are received and cannot be used in full-text searches. Instead, they are returned in results, can be used in the WHERE clause for comparison filtering or REGEX, and can be used for sorting and aggregation. In general, it's not recommended to store large texts in string attributes, but use string attributes for metadata like names, titles, tags, keys.

If you want to also index the string attribute, you can specify both as string attribute indexed. It will allow full-text searching and works as an attribute.

SQL
CREATE TABLE products(title text, keys string);
JSON
POST /cli -d "CREATE TABLE products(title text, keys string)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'keys'=>['type'=>'string']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, keys string)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, keys string)');
java
utilsApi.sql("CREATE TABLE products(title text, keys string)");
C#
utilsApi.Sql("CREATE TABLE products(title text, keys string)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_string = keys
}
More You can create a full-text field that is also stored as a string attribute. This approach creates a full-text field and a string attribute that have the same name. Note that you can't add a `stored` property to store the data as a string attribute and in the document storage at the same time.
SQL
`string attribute indexed` means that we're working with a string data type that is stored as an attribute and indexed as a full-text field.
CREATE TABLE products ( title string attribute indexed );
JSON
POST /cli -d "CREATE TABLE products ( title string attribute indexed )"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'string','options'=>['indexed','attribute']]
]);
Python
utilsApi.sql('CREATE TABLE products ( title string attribute indexed )')
javascript
res = await utilsApi.sql('CREATE TABLE products ( title string attribute indexed )');
java
utilsApi.sql("CREATE TABLE products ( title string attribute indexed )");
C#
utilsApi.Sql("CREATE TABLE products ( title string attribute indexed )");
config
table products
{
    type = rt
    path = products

    rt_field = title
    rt_attr_string = title
}

Integer

Integer type allows storing 32 bit unsigned integer values.

SQL
CREATE TABLE products(title text, price int);
JSON
POST /cli -d "CREATE TABLE products(title text, price int)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'int']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, price int)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price int)');
java
utilsApi.sql("CREATE TABLE products(title text, price int)");
C#
utilsApi.Sql("CREATE TABLE products(title text, price int)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = type
}

Integers can be stored in shorter sizes than 32-bit by specifying a bit count. For example, if we want to store a numeric value which we know is not going to be bigger than 8, the type can be defined as bit(3). Bitcount integers perform slower than the full-size ones, but they require less RAM. They are saved in 32-bit chunks, so in order to save space, they should be grouped at the end of attribute definitions (otherwise a bitcount integer between 2 full-size integers will occupy 32 bits as well).

SQL
CREATE TABLE products(title text, flags bit(3), tags bit(2) );
JSON
POST /cli -d "CREATE TABLE products(title text, flags bit(3), tags bit(2))"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'flags'=>['type'=>'bit(3)'],
    'tags'=>['type'=>'bit(2)']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, flags bit(3), tags bit(2) ');
java
utilsApi.sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
C#
utilsApi.Sql("CREATE TABLE products(title text, flags bit(3), tags bit(2)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_uint = flags:3
    rt_attr_uint = tags:2
}

Big Integer

Big integers (bigint) are 64-bit wide signed integers.

SQL
CREATE TABLE products(title text, price bigint );
JSON
POST /cli -d "CREATE TABLE products(title text, price bigint)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'bigint']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, price bigint )')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price bigint )');
java
utilsApi.sql("CREATE TABLE products(title text, price bigint )");
C#
utilsApi.Sql("CREATE TABLE products(title text, price bigint )");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bigint = type
}

Boolean

Declares a boolean attribute. It's equivalent to an integer attribute with bit count of 1.

SQL
CREATE TABLE products(title text, sold bool );
JSON
POST /cli -d "CREATE TABLE products(title text, sold bool)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'sold'=>['type'=>'bool']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, sold bool )')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, sold bool )');
java
utilsApi.sql("CREATE TABLE products(title text, sold bool )");
C#
utilsApi.Sql("CREATE TABLE products(title text, sold bool )");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_bool = sold
}

Timestamps

Timestamp type represents unix timestamps which is stored as a 32-bit integer. The difference is that time and date functions are available for the timestamp type.

SQL
CREATE TABLE products(title text, date timestamp);
JSON
POST /cli -d "CREATE TABLE products(title text, date timestamp)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'date'=>['type'=>'timestamp']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, date timestamp)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, date timestamp)');
java
utilsApi.sql("CREATE TABLE products(title text, date timestamp)");
C#
utilsApi.Sql("CREATE TABLE products(title text, date timestamp)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_timestamp = date
}

Float

Real numbers are stored as 32-bit IEEE 754 single precision floats.

SQL
CREATE TABLE products(title text, coeff float);
JSON
POST /cli -d "CREATE TABLE products(title text, coeff float)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'coeff'=>['type'=>'float']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, coeff float)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, coeff float)');
java
utilsApi.sql("CREATE TABLE products(title text, coeff float)");
C#
utilsApi.Sql("CREATE TABLE products(title text, coeff float)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float = coeff
}

Unlike integer types, comparing two floating-point numbers for equality is not recommended due to potential rounding errors. A more reliable approach is to use a near-equal comparison, by checking the absolute error margin.

SQL
select abs(a-b)<=0.00001 from products
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "eps": "abs(a-b)" }
}
PHP
$index->setName('products')->search('')->expression('eps','abs(a-b)')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}},"expressions":{"eps":"abs(a-b)"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"eps":"abs(a-b)"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("ebs","abs(a-b)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object>{
    new Dictionary<string, string> { {"ebs", "abs(a-b)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Another alternative, which can also be used to perform IN(attr,val1,val2,val3) is to compare floats as integers by choosing a multiplier factor and convert the floats to integers in operations. The following example illustrates modifying IN(attr,2.0,2.5,3.5) to work with integer values.

SQL
select in(ceil(attr*100),200,250,350) from products
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "inc": "in(ceil(attr*100),200,250,350)" }
}
PHP
$index->setName('products')->search('')->expression('inc','in(ceil(attr*100),200,250,350)')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"inc":"in(ceil(attr*100),200,250,350)"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("inc","in(ceil(attr*100),200,250,350)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"ebs", "in(ceil(attr*100),200,250,350)"} }
};
var searchResponse = searchApi.Search(searchRequest);

JSON

This data type allows storing JSON objects, which is useful for storing schema-less data. However, it is not supported by columnar storage. However, it can be stored in traditional storage, as it's possible to combine both storage types in the same table.

SQL
CREATE TABLE products(title text, data json);
JSON
POST /cli -d "CREATE TABLE products(title text, data json)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'data'=>['type'=>'json']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, data json)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, data json)');
java
utilsApi.sql'CREATE TABLE products(title text, data json)');
C#
utilsApi.Sql'CREATE TABLE products(title text, data json)');
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_json = data
}

JSON properties can be used in most operations. There are also special functions such as ALL(), ANY(), GREATEST(), LEAST() and INDEXOF() that allow traversal of property arrays.

SQL
select indexof(x>2 for x in data.intarray) from products
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "expressions": { "idx": "indexof(x>2 for x in data.intarray)" }
}
PHP
$index->setName('products')->search('')->expression('idx','indexof(x>2 for x in data.intarray)')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"expressions":{"idx":"indexof(x>2 for x in data.intarray)"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

Text properties are treated the same as strings, so it's not possible to use them in full-text match expressions. However, string functions such as REGEX() can be used.

SQL
select regex(data.name, 'est') as c from products where c>0
JSON
POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "range": { "c": { "gt": 0 } } }
  },
  "expressions": { "c": "regex(data.name, 'est')" }
}
PHP
$index->setName('products')->search('')->expression('idx',"regex(data.name, 'est')")->filter('c','gt',0)->get();
Python
searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{},"range":{"c":{"gt":0}}}},"expressions":{"c":"regex(data.name, 'est')"}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("range", new HashMap<String,Object>(){{
    put("c", new HashMap<String,Object>(){{
        put("gt",0);
    }});
}});
searchRequest.setQuery(query);
Object expressions = new HashMap<String,Object>(){{
    put("idx","indexof(x>2 for x in data.intarray)");
}};
searchRequest.setExpressions(expressions);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
var rangeFilter = new RangeFilter("c");
rangeFilter.Gt = 0;
searchRequest.AttrFilter = rangeFilter;
searchRequest.Expressions = new List<Object> {
    new Dictionary<string, string> { {"idx", "indexof(x>2 for x in data.intarray)"} }
};
var searchResponse = searchApi.Search(searchRequest);

In the case of JSON properties, enforcing data type may be required for proper functionality in certain situations. For example, when working with float values, DOUBLE() must be used for proper sorting.

SQL
select * from products order by double(data.myfloat) desc
JSON
POST /search
{
  "index": "products",
  "query": { "match_all": {} } },
  "sort": [ { "double(data.myfloat)": { "order": "desc"} } ]
}
PHP
$index->setName('products')->search('')->sort('double(data.myfloat)','desc')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{}}},"sort":[{"double(data.myfloat)":{"order":"desc"}}]});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("double(data.myfloat)",new HashMap<String,String>(){{ put("order","desc");}});}});
}});
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortOrder("double(data.myfloat)", SortOrder.OrderEnum.Desc)
};
var searchResponse = searchApi.Search(searchRequest);

Float vector

Float vector attributes allow storing variable-length lists of floats. It's important to note that this concept differs from multi-valued attributes. Multi-valued attributes (MVAs) are essentially sets; they do not preserve value order, and duplicate values are not retained. In contrast, float vectors perform no additional processing on values during insertion.

Float vector attributes can be used in k-nearest neighbor searches; see KNN search.

** Currently, float_vector fields can only be utilized in KNN search within real-time tables and the data type is not supported in any other functions or expressions, nor is it supported in plain tables. **

SQL
CREATE TABLE products(title text, image_vector float_vector);
JSON
POST /cli -d "CREATE TABLE products(title text, image_vector float_vector)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'image_vector'=>['type'=>'float_vector']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, image_vector float_vector)');
java
utilsApi.sql("CREATE TABLE products(title text, image_vector float_vector)");
C#
utilsApi.Sql("CREATE TABLE products(title text, image_vector float_vector)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_float_vector = image_vector
}

Multi-value integer (MVA)

Multi-value attributes allow storing variable-length lists of 32-bit unsigned integers. This can be useful for storing one-to-many numeric values, such as tags, product categories, and properties.

SQL
CREATE TABLE products(title text, product_codes multi);
JSON
POST /cli -d "CREATE TABLE products(title text, product_codes multi)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'product_codes'=>['type'=>'multi']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, product_codes multi)')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, product_codes multi)');
java
utilsApi.sql("CREATE TABLE products(title text, product_codes multi)");
C#
utilsApi.Sql("CREATE TABLE products(title text, product_codes multi)");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi = product_codes
}

It supports filtering and aggregation, but not sorting. Filtering can be done using a condition that requires at least one element to pass (using ANY()) or all elements (ALL()) to pass.

SQL
select * from products where any(product_codes)=3
JSON
POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "equals" : { "any(product_codes)": 3 }
  }
}
PHP
$index->setName('products')->search('')->filter('any(product_codes)','equals',3)->get();
Python
searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{},"equals":{"any(product_codes)":3}}}})'
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
query.put("equals",new HashMap<String,Integer>(){{
     put("any(product_codes)",3);
}});
searchRequest.setQuery(query);
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.AttrFilter = new EqualsFilter("any(product_codes)", 3);
var searchResponse = searchApi.Search(searchRequest);

Information like least or greatest element and length of the list can be extracted. An example shows ordering by the least element of a multi-value attribute.

SQL
select least(product_codes) l from products order by l asc
JSON
POST /search
{
  "index": "products",
  "query":
  {
    "match_all": {},
    "sort": [ { "product_codes":{ "order":"asc", "mode":"min" } } ]
  }
}
PHP
$index->setName('products')->search('')->sort('product_codes','asc','min')->get();
Python
searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}})
javascript
res = await searchApi.search({"index":"products","query":{"match_all":{},"sort":[{"product_codes":{"order":"asc","mode":"min"}}]}});
java
searchRequest = new SearchRequest();
searchRequest.setIndex("forum");
query = new HashMap<String,Object>();
query.put("match_all",null);
searchRequest.setQuery(query);
searchRequest.setSort(new ArrayList<Object>(){{
    add(new HashMap<String,String>(){{ put("product_codes",new HashMap<String,String>(){{ put("order","asc");put("mode","min");}});}});
}});
searchResponse = searchApi.search(searchRequest);
C#
object query =  new { match_all=null };
var searchRequest = new SearchRequest("forum", query);
searchRequest.Sort = new List<Object> {
    new SortMVA("product_codes", SortOrder.OrderEnum.Asc, SortMVA.ModeEnum.Min)
};
searchResponse = searchApi.search(searchRequest);

When grouping by a multi-value attribute, a document will contribute to as many groups as there are different values associated with that document. For instance, if a collection contains exactly one document having a 'product_codes' multi-value attribute with values 5, 7, and 11, grouping on 'product_codes' will produce 3 groups with COUNT(*)equal to 1 and GROUPBY() key values of 5, 7, and 11, respectively. Also, note that grouping by multi-value attributes may lead to duplicate documents in the result set because each document can participate in many groups.

SQL
insert into products values ( 1, 'doc one', (5,7,11) );
select id, count(*), groupby() from products group by product_codes;
Query OK, 1 row affected (0.00 sec)

+------+----------+-----------+
| id   | count(*) | groupby() |
+------+----------+-----------+
|    1 |        1 |        11 |
|    1 |        1 |         7 |
|    1 |        1 |         5 |
+------+----------+-----------+
3 rows in set (0.00 sec)

The order of the numbers inserted as values of multivalued attributes is not preserved. Values are stored internally as a sorted set.

SQL
insert into product values (1,'first',(4,2,1,3));
select * from products;
Query OK, 1 row affected (0.00 sec)

+------+---------------+-------+
| id   | product_codes | title |
+------+---------------+-------+
|    1 | 1,2,3,4       | first |
+------+---------------+-------+
1 row in set (0.01 sec)
JSON
POST /insert
{
    "index":"products",
    "id":1,
    "doc":
    {
        "title":"first",
        "product_codes":[4,2,1,3]
    }
}

POST /search
{
  "index": "products",
  "query": { "match_all": {} }
}
{
   "_index":"products",
   "_id":1,
   "created":true,
   "result":"created",
   "status":201
}

{
   "took":0,
   "timed_out":false,
   "hits":{
      "total":1,
      "hits":[
         {
            "_id":"1",
            "_score":1,
            "_source":{
               "product_codes":[
                  1,
                  2,
                  3,
                  4
               ],
               "title":"first"
            }
         }
      ]
   }
}
PHP
$index->addDocument([
    "title"=>"first",
    "product_codes"=>[4,2,1,3]
]);
$index->search('')-get();
Array
(
    [_index] => products
    [_id] => 1
    [created] => 1
    [result] => created
    [status] => 201
)
Array
(
    [took] => 0
    [timed_out] =>
    [hits] => Array
        (
            [total] => 1
            [hits] => Array
                (
                    [0] => Array
                        (
                            [_id] => 1
                            [_score] => 1
                            [_source] => Array
                                (
                                    [product_codes] => Array
                                        (
                                            [0] => 1
                                            [1] => 2
                                            [2] => 3
                                            [3] => 4
                                        )

                                    [title] => first
                                )
                        )
                )
        )
)
Python
indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}})
searchApi.search({"index":"products","query":{"match_all":{}}})
{'created': True,
 'found': None,
 'id': 1,
 'index': 'products',
 'result': 'created'}
{'hits': {'hits': [{u'_id': u'1',
                    u'_score': 1,
                    u'_source': {u'product_codes': [1, 2, 3, 4],
                                 u'title': u'first'}}],
          'total': 1},
 'profile': None,
 'timed_out': False,
 'took': 29}
javascript
await indexApi.insert({"index":"products","id":1,"doc":{"title":"first","product_codes":[4,2,1,3]}});
res = await searchApi.search({"index":"products","query":{"match_all":{}}});
{"took":0,"timed_out":false,"hits":{"total":1,"hits":[{"_id":"1","_score":1,"_source":{"product_codes":[1,2,3,4],"title":"first"}}]}}
java
InsertDocumentRequest newdoc = new InsertDocumentRequest();
HashMap<String,Object> doc = new HashMap<String,Object>(){{
    put("title","first");
    put("product_codes",new int[] {4,2,1,3});
}};
newdoc.index("products").id(1L).setDoc(doc);
sqlresult = indexApi.insert(newdoc);
Map<String,Object> query = new HashMap<String,Object>();
query.put("match_all",null);
SearchRequest searchRequest = new SearchRequest();
searchRequest.setIndex("products");
searchRequest.setQuery(query);
SearchResponse searchResponse = searchApi.search(searchRequest);
System.out.println(searchResponse.toString() );
class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}
C#
Dictionary<string, Object> doc = new Dictionary<string, Object>();
doc.Add("title", "first");
doc.Add("product_codes", new List<Object> {4,2,1,3});
InsertDocumentRequest newdoc = new InsertDocumentRequest(index: "products", id: 1, doc: doc);
var sqlresult = indexApi.Insert(newdoc);
object query =  new { match_all=null };
var searchRequest = new SearchRequest("products", query);
var searchResponse = searchApi.Search(searchRequest);
Console.WriteLine(searchResponse.ToString())
class SearchResponse {
    took: 0
    timedOut: false
    hits: class SearchResponseHits {
        total: 1
        hits: [{_id=1, _score=1, _source={product_codes=[1, 2, 3, 4], title=first}}]
        aggregations: null
    }
    profile: null
}

Multi-value big integer

A data type that allows storing variable-length lists of 64-bit signed integers. It has the same functionality as multi-value integer.

SQL
CREATE TABLE products(title text, values multi64);
JSON
POST /cli -d "CREATE TABLE products(title text, values multi64)"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'values'=>['type'=>'multi64']
]);
Python
utilsApi.sql('CREATE TABLE products(title text, values multi64))')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, values multi64))');
java
utilsApi.sql("CREATE TABLE products(title text, values multi64))");
C#
utilsApi.Sql("CREATE TABLE products(title text, values multi64))");
config
table products
{
    type = rt
    path = products

    rt_field = title
    stored_fields = title

    rt_attr_multi_64 = values
}

Columnar attribute properties

When you use the columnar storage you can specify the following properties for the attributes.

fast_fetch

By default, Manticore Columnar storage stores all attributes in a columnar fashion, as well as in a special docstore row by row. This enables fast execution of queries like SELECT * FROM ..., especially when fetching a large number of records at once. However, if you are sure that you do not need it or wish to save disk space, you can disable it by specifying fast_fetch='0' when creating a table or (if you are defining a table in a config) by using columnar_no_fast_fetch as shown in the following example.

RT mode
create table t(a int, b int fast_fetch='0') engine='columnar'; desc t;
+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
3 rows in set (0.00 sec)
Plain mode
source min {
    type = mysql
    sql_host = localhost
    sql_user = test
    sql_pass =
    sql_db = test
    sql_query = select 1, 1 a, 1 b
    sql_attr_uint = a
    sql_attr_uint = b
}

table tbl {
    path = tbl/col
    source = min
    columnar_attrs = *
    columnar_no_fast_fetch = b
}
+-------+--------+---------------------+
| Field | Type   | Properties          |
+-------+--------+---------------------+
| id    | bigint | columnar fast_fetch |
| a     | uint   | columnar fast_fetch |
| b     | uint   | columnar            |
+-------+--------+---------------------+
¶ 6.2

Creating a local table

In Manticore Search, there are two ways to manage tables:

Online schema management (RT mode)

Real-time mode requires no table definition in the configuration file. However, the data_dir directive in the searchd section is mandatory. Index files are stored inside the data_dir.

Replication is only available in this mode.

You can use SQL commands such as CREATE TABLE, ALTER TABLE and DROP TABLE to create and modify table schema, and to drop it. This mode is particularly useful for real-time and percolate tables.

Table names are converted to lowercase when created.

Defining table schema in config (Plain mode)

In this mode, you can specify the table schema in the configuration file. Manticore reads this schema on startup and creates the table if it doesn't exist yet. This mode is particularly useful for plain tables that use data from an external storage.

To drop a table, remove it from the configuration file or remove the path setting and send a HUP signal to the server or restart it.

Table names are case-sensitive in this mode.

All table types are supported in this mode.

Table types and modes

Table type RT mode Plain mode
Real-time supported supported
Plain not supported supported
Percolate supported supported
Distributed supported supported
Template not supported supported
¶ 6.2.1

Real-time table

A real-time table is a main type of table in Manticore. It lets you add, update, and delete documents, and you can see these changes right away. You can set up a real-time Table in a configuration file or use commands like CREATE, UPDATE, DELETE, or ALTER.

Internally a real-time table consists of one or more plain tables called chunks. There are two kinds of chunks:

The size of the RAM chunk is controlled by the rt_mem_limit setting. Once this limit is reached, the RAM chunk is transferred to disk as a disk chunk. If there are too many disk chunks, Manticore combines some of them to improve performance.

Creating a real-time table:

You can create a new real-time table in two ways: by using the CREATE TABLE command or through the _mapping endpoint of the HTTP JSON API.

CREATE TABLE command:

You can use this command via both SQL and HTTP protocols:

SQL
CREATE TABLE products(title text, price float) morphology='stem_en';
Query OK, 0 rows affected (0.00 sec)
JSON
POST /cli -d "CREATE TABLE products(title text, price float)  morphology='stem_en'"
{
"total":0,
"error":"",
"warning":""
}
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
    'title'=>['type'=>'text'],
    'price'=>['type'=>'float'],
]);
Python
utilsApi.sql('CREATE TABLE forum(title text, price float)')
Javascript
res = await utilsApi.sql('CREATE TABLE forum(title text, price float)');
Java
utilsApi.sql("CREATE TABLE forum(title text, price float)");
C#
utilsApi.Sql("CREATE TABLE forum(title text, price float)");
CONFIG
table products {
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
  stored_fields = title
}

_mapping API:

Alternatively, you can create a new table via the _mapping endpoint. This endpoint allows you to define an Elasticsearch-like table structure to be converted to a Manticore table.

The body of your request must have the following structure:

"properties"
{
  "FIELD_NAME_1":
  {
    "type": "FIELD_TYPE_1"
  },
  "FIELD_NAME_2":
  {
    "type": "FIELD_TYPE_2"
  },

  ...

  "FIELD_NAME_N":
  {
    "type": "FIELD_TYPE_M"
  }
}

When creating a table, Elasticsearch data types will be mapped to Manticore types according to the following rules:
- aggregate_metric => json
- binary => string
- boolean => bool
- byte => int
- completion => string
- date => timestamp
- date_nanos => bigint
- date_range => json
- dense_vector => json
- flattened => json
- flat_object => json
- float => float
- float_range => json
- geo_point => json
- geo_shape => json
- half_float => float
- histogram => json
- integer => int
- integer_range => json
- ip => string
- ip_range => json
- keyword => string
- knn_vector => float_vector
- long => bigint
- long_range => json
- match_only_text => text
- object => json
- point => json
- scaled_float => float
- search_as_you_type => text
- shape => json
- short => int
- text => text
- unsigned_long => int
- version => string

JSON
POST /your_table_name/_mapping -d '
{
  "test": {
    "mappings": {
      "properties": {
        "price": {
            "type": "float"
        },
        "title": {
            "type": "text"
        }
      }
    }
  }
}
'
{
"total":0,
"error":"",
"warning":""
}

👍 What you can do with a real-time table:

⛔ What you cannot do with a real-time table:

Real-time table files structure

The following table outlines the different file extensions and their respective descriptions in a real-time table:

Extension Description
.lock A lock file that ensures that only one process can access the table at a time.
.ram The RAM chunk of the table, stored in memory and used as an accumulator of changes.
.meta The headers of the real-time table that define its structure and settings.
.*.sp* Disk chunks that are stored on disk with the same format as plain tables. They are created when the RAM chunk size exceeds the rt_mem_limit.

For more information on the structure of disk chunks, refer to the plain table files structure.

¶ 6.2.2

Plain table

Plain table is a basic element for non-percolate searching. It can be defined only in a configuration file using the Plain mode, and is not supported in the RT mode. It is typically used in conjunction with a source to process data from the external storage and can later be attached to a real-time table.

Creating a plain table

To create a plain table, you'll need to define it in a configuration file. It's not supported by the CREATE TABLE command.

Here's an example of a plain table configuration and a source for fetching data from a MySQL database:

How to create a plain table

Plain table example
source source {
  type             = mysql
  sql_host         = localhost
  sql_user         = myuser
  sql_pass         = mypass
  sql_db           = mydb
  sql_query        = SELECT id, title, description, category_id  from mytable
  sql_attr_uint    = category_id
  sql_field_string = title
 }

table tbl {
  type   = plain
  source = source
  path   = /path/to/table
 }

👍 What you can do with a plain table:

⛔ What you cannot do with a plain table:

Numeric attributes, including MVAs, are the only elements that can be updated in a plain table. All other data in the table is immutable. If updates or new records are required, the table must be rebuilt. During the rebuilding process, the existing table remains available to serve requests, and a process called rotation is performed when the new version is ready, bringing it online and discarding the old version.

Plain table building performance

The speed at which a plain table is indexed depends on several factors, including:

Plain table building scenarios

Rebuild fully when needed

For small data sets, the simplest option is to have a single plain table that is fully rebuilt as needed. This approach is acceptable when:

Main+delta scenario

For larger data sets, a plain table can be used instead of a Real-Time. The main+delta scenario involves:

This approach allows for infrequent rebuilding of the larger table and more frequent processing of updates from the source. The smaller table can be rebuilt more often (e.g. every minute or even every few seconds).

However, as time goes on, the indexing duration for the smaller table will become too long, requiring a rebuild of the larger table and the emptying of the smaller one.

The main+delta schema is explained in detail in this interactive course.

The mechanism of kill list and killlist_target directive is used to ensure that documents from the current table take precedence over those from the other table.

For more information on this topic, see here.

Plain table files structure

The following table outlines the various file extensions used in a plain table and their respective descriptions:

Extension Description
.spa stores document attributes in row-wise mode
.spb stores blob attributes in row-wise mode: strings, MVA, json
.spc stores document attributes in columnar mode
.spd stores matching document ID lists for each word ID
.sph stores table header information
.sphi stores histograms of attribute values
.spi stores word lists (word IDs and pointers to .spd file)
.spidx stores secondary indexes data
.spk stores kill-lists
.spl lock file
.spm stores a bitmap of killed documents
.spp stores hit (aka posting, aka word occurrence) lists for each word ID
.spt stores additional data structures to speed up lookups by document ids
.spe stores skip-lists to speed up doc-list filtering
.spds stores document texts
.tmp* temporary files during index_settings_and_status
.new.sp* new version of a plain table before rotation
.old.sp* old version of a plain table after rotation
¶ 6.2.3

Plain and real-time table settings

Defining table schema in a configuration file

table <index_name>[:<parent table name>] {
...
}
Plain
table <table name> {
  type = plain
  path = /path/to/table
  source = <source_name>
  source = <another source_name>
  [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]
}
Real-time
table <table name> {
  type = rt
  path = /path/to/table

  rt_field = <full-text field name>
  rt_field = <another full-text field name>
  [rt_attr_uint = <integer field name>]
  [rt_attr_uint = <another integer field name, limit by N bits>:N]
  [rt_attr_bigint = <bigint field name>]
  [rt_attr_bigint = <another bigint field name>]
  [rt_attr_multi = <multi-integer (MVA) field name>]
  [rt_attr_multi = <another multi-integer (MVA) field name>]
  [rt_attr_multi_64 = <multi-bigint (MVA) field name>]
  [rt_attr_multi_64 = <another multi-bigint (MVA) field name>]
  [rt_attr_float = <float field name>]
  [rt_attr_float = <another float field name>]
  [rt_attr_float_vector = <float vector field name>]
  [rt_attr_float_vector = <another float vector field name>]
  [rt_attr_bool = <boolean field name>]
  [rt_attr_bool = <another boolean field name>]
  [rt_attr_string = <string field name>]
  [rt_attr_string = <another string field name>]
  [rt_attr_json = <json field name>]
  [rt_attr_json = <another json field name>]
  [rt_attr_timestamp = <timestamp field name>]
  [rt_attr_timestamp = <another timestamp field name>]

  [stored_fields = <comma separated list of full-text fields that should be stored, all are stored by default, can be empty>]

  [rt_mem_limit = <RAM chunk max size, default 128M>]
  [optimize_cutoff = <max number of RT table disk chunks>]

}

Common plain and real-time tables settings

type

type = plain

type = rt

Table type: "plain" or "rt" (real-time)

Value: plain (default), rt

path

path = path/to/table

The path to where the table will be stored or located, either absolute or relative, without the extension.

Value: The path to the table, mandatory

stored_fields

stored_fields = title, content

By default, the original content of full-text fields is indexed and stored when a table is defined in a configuration file. This setting allows you to specify the fields that should have their original values stored.

Value: A comma-separated list of full-text fields that should be stored. An empty value (i.e. stored_fields = ) disables the storage of original values for all fields.

Note: In the case of a real-time table, the fields listed in stored_fields should also be declared as rt_field.

Also, note that you don't need to list attributes in stored_fields, since their original values are stored anyway. stored_fields can only be used for full-text fields.

See also docstore_block_size, docstore_compression for document storage compression options.

SQL
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)
JSON
POST /cli -d "
CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)"
PHP
$params = [
    'body' => [
        'columns' => [
            'title'=>['type'=>'text'],
            'content'=>['type'=>'text', 'options' => ['indexed', 'stored']],
            'name'=>['type'=>'text', 'options' => ['indexed']],
            'price'=>['type'=>'float']
        ]
    ],
    'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);
Python
utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)')
Javascript
res = await utilsApi.sql('CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)');
Java
utilsApi.sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");
C#
utilsApi.Sql("CREATE TABLE products(title text, content text stored indexed, name text indexed, price float)");
CONFIG
table products {
  stored_fields = title, content # we want to store only "title" and "content", "name" shouldn't be stored

  type = rt
  path = tbl
  rt_field = title
  rt_field = content
  rt_field = name
  rt_attr_uint = price
}

stored_only_fields

stored_only_fields = title,content

List of fields that will be stored in the table, but not indexed. This setting works similarly to stored_fields except that when a field is specified in stored_only_fields t will only be stored, not indexed, and cannot be searched using full-text queries. It can only be retrieved in search results.

The value is a comma-separated list of fields that should be stored only, not indexed. By default, this value is empty. If a real-time table is being defined, the fields listed in stored_only_fields must also be declared as rt_field.

Note also, that you don't need to list attributes instored_only_fields,since their original values are stored anyway. If to compare stored_only_fields to string attributes the former (stored field):

In contrast, the latter (string attribute):

Real-time table settings:

optimize_cutoff

The maximum number of disk chunks for the RT table. Learn more here.

rt_field

rt_field = subject

This field declaration determines the full-text fields that will be indexed. The field names must be unique, and the order is preserved. When inserting data, the field values must be in the same order as specified in the configuration.

This is a multi-value, optional field.

rt_attr_uint

rt_attr_uint = gid

This declaration defines an unsigned integer attribute.

Value: the field name or field_name:N (where N is the maximum number of bits to keep).

rt_attr_bigint

rt_attr_bigint = gid

This declaration defines a BIGINT attribute.

Value: field name, multiple records allowed.

rt_attr_multi

rt_attr_multi = tags

Declares a multi-valued attribute (MVA) with unsigned 32-bit integer values.

Value: field name. Multiple records allowed.

rt_attr_multi_64

rt_attr_multi_64 = wide_tags

Declares a multi-valued attribute (MVA) with signed 64-bit BIGINT values.

Value: field name. Multiple records allowed.

rt_attr_float

rt_attr_float = lat
rt_attr_float = lon

Declares floating point attributes with single precision, 32-bit IEEE 754 format.

Value: field name. Multiple records allowed.

rt_attr_float_vector

rt_attr_float_vector = image_vector

Declares a vector of floating-point values.

Value: field name. Multiple records allowed.

rt_attr_bool

rt_attr_bool = available

Declares a boolean attribute with 1-bit unsigned integer values.

Value: field name.

rt_attr_string

rt_attr_string = title

String attribute declaration.

Value: field name.

rt_attr_json

rt_attr_json = properties

Declares a JSON attribute.

Value: field name.

rt_attr_timestamp

rt_attr_timestamp = date_added

Declares a timestamp attribute.

Value: field name.

rt_mem_limit

rt_mem_limit = 512M

Memory limit for a RAM chunk of the table. Optional, default is 128M.

RT tables store some data in memory, known as the "RAM chunk," and also maintain a number of on-disk tables, referred to as "disk chunks." This directive allows you to control the size of the RAM chunk. When there is too much data to keep in memory, RT tables will flush it to disk, activate a newly created disk chunk, and reset the RAM chunk.

Please note that the limit is strict, and RT tables will never allocate more memory than what is specified in the rt_mem_limit. Additionally, memory is not preallocated, so specifying a 512MB limit and only inserting 3MB of data will result in allocating only 3MB, not 512MB.

The rt_mem_limit is never exceeded, but the actual RAM chunk size can be significantly lower than the limit. RT tables adapt to the data insertion pace and adjust the actual limit dynamically to minimize memory usage and maximize data write speed. This is how it works:

For instance, if 90MB of data is saved to a disk chunk and an additional 10MB of data arrives while the save is in progress, the rate would be 90%. Next time, the RT table will collect up to 90% of rt_mem_limit before flushing the data. The faster the insertion pace, the lower the rt_mem_limit rate. The rate varies between 33.3% to 95%. You can view the current rate of a table using the SHOW TABLE STATUS command.

How to change rt_mem_limit and optimize_cutoff

In real-time mode, you can adjust the size limit of RAM chunks and the maximum number of disk chunks using the ALTER TABLE statement. To set rt_mem_limit to 1 gigabyte for the table "t," run the following query: ALTER TABLE t rt_mem_limit='1G'. To change the maximum number of disk chunks, run the query: ALTER TABLE t optimize_cutoff='5'.

In the plain mode, you can change the values of rt_mem_limit and optimize_cutoff by updating the table configuration or running the command ALTER TABLE <index_name> RECONFIGURE

Important notes about RAM chunks

Plain table settings:

source

source = srcpart1
source = srcpart2
source = srcpart3

The source field specifies the source from which documents will be obtained during indexing of the current table. There must be at least one source. The sources can be of different types (e.g. one could be MySQL, another PostgreSQL). For more information on indexing from external storages, indexing from external storages here

Value: The name of the source is mandatory. Multiple values are allowed.

killlist_target

killlist_target = main:kl

This setting determines the table(s) to which the kill-list will be applied. Matches in the targeted table that are updated or deleted in the current table will be suppressed. In :kl mode, the documents to suppress are taken from the kill-list. In :id mode, all document IDs from the current table are suppressed in the targeted one. If neither is specified, both modes will take effect. Learn more about kill-lists here

Value: not specified (default), target_index_name:kl, target_index_name:id, target_index_name. Multiple values are allowed

columnar_attrs

columnar_attrs = *
columnar_attrs = id, attr1, attr2, attr3

This configuration setting determines which attributes should be stored in the columnar storage instead of the row-wise storage.

You can set columnar_attrs = * to store all supported data types in the columnar storage.

Additionally, id is a supported attribute to store in the columnar storage.

columnar_strings_no_hash

columnar_strings_no_hash = attr1, attr2, attr3

By default, all string attributes stored in columnar storage store pre-calculated hashes. These hashes are used for grouping and filtering. However, they occupy extra space, and if you don't need to group by that attribute, you can save space by disabling hash generation.

Creating a real-time table online via CREATE TABLE

CREATE TABLE [IF NOT EXISTS] name ( <field name> <field data type> [data type options] [, ...]) [table_options]

For more information on data types, see more about data types here.

Type Equivalent in a configuration file Notes Aliases
text rt_field Options: indexed, stored. Default: both. To keep text stored, but indexed, specify "stored" only. To keep text indexed only, specify "indexed" only. string
integer rt_attr_uint integer int, uint
bigint rt_attr_bigint big integer
float rt_attr_float float
float_vector rt_attr_float_vector a vector of float values
multi rt_attr_multi multi-integer
multi64 rt_attr_multi_64 multi-bigint
bool rt_attr_bool boolean
json rt_attr_json JSON
string rt_attr_string string. Option indexed, attribute will make the value full-text indexed and filterable, sortable and groupable at the same time
timestamp rt_attr_timestamp timestamp
bit(n) rt_attr_uint field_name:N N is the max number of bits to keep
SQL
CREATE TABLE products (title text, price float) morphology='stem_en'
This creates the "products" table with two fields: "title" (full-text) and "price" (float), and sets the "morphology" to "stem_en".
CREATE TABLE products (title text indexed, description text stored, author text, price float)
This creates the "products" table with three fields: * "title" is indexed, but not stored. * "description" is stored, but not indexed. * "author" is both stored and indexed.

Engine

create table ... engine='columnar';
create table ... engine='rowwise';

The engine setting changes the default attribute storage for all attributes in the table. You can also specify engine separately for each attribute.

For information on how to enable columnar storage for a plain table, see columnar_attrs .

Values:

Other settings

The following settings are applicable for both real-time and plain tables, regardless of whether they are specified in a configuration file or set online using the CREATE or ALTER command.

Accessing table files

Manticore supports two access modes for reading table data: seek+read and mmap.

In seek+read mode, the server uses the pread system call to read document lists and keyword positions, represented by the*.spd and *.spp files. The server uses internal read buffers to optimize the reading process, and the size of these buffers can be adjusted using the options read_buffer_docs and read_buffer_hits.There is also the option preopen that controls how Manticore opens files at start.

In mmap access mode, the search server maps the table's file into memory using the mmap system call, and the OS caches the file contents. The options read_buffer_docs and read_buffer_hits have no effect for corresponding files in this mode. The mmap reader can also lock the table's data in memory using themlock privileged call, which prevents the OS from swapping the cached data out to disk.

To control which access mode to use, the options access_plain_attrs, access_blob_attrs, access_doclists, access_hitlists and access_dict are available, with the following values:

Value Description
file server reads the table files from disk with seek+read using internal buffers on file access
mmap server maps the table files into memory and OS caches up its contents on file access
mmap_preread server maps the table files into memory and a background thread reads it once to warm up the cache
mlock server maps the table files into memory and then executes the mlock() system call to cache up the file contents and lock it into memory to prevent it being swapped out
Setting Values Description
access_plain_attrs mmap, mmap_preread (default), mlock controls how *.spa (plain attributes) *.spe (skip lists) *.spt (lookups) *.spm (killed docs) will be read
access_blob_attrs mmap, mmap_preread (default), mlock controls how *.spb (blob attributes) (string, mva and json attributes) will be read
access_doclists file (default), mmap, mlock controls how *.spd (doc lists) data will be read
access_hitlists file (default), mmap, mlock controls how *.spp (hit lists) data will be read
access_dict mmap, mmap_preread (default), mlock controls how *.spi (dictionary) will be read

Here is a table which can help you select your desired mode:

table part keep it on disk keep it in memory cached in memory on server start lock it in memory
plain attributes in row-wise (non-columnar) storage, skip lists, word lists, lookups, killed docs mmap mmap mmap_preread (default) mlock
row-wise string, multi-value attributes (MVA) and json attributes mmap mmap mmap_preread (default) mlock
columnar numeric, string and multi-value attributes always only by means of OS no not supported
doc lists file (default) mmap no mlock
hit lists file (default) mmap no mlock
dictionary mmap mmap mmap_preread (default) mlock
The recommendations are:

The default mode offers a balance of:

This provides a decent search performance, optimal memory utilization, and faster searchd restart in most scenarios.

attr_update_reserve

attr_update_reserve = 256k

This setting reserves extra space for updates to blob attributes such as multi-value attributes (MVA), strings, and JSON. The default value is 128k. When updating these attributes, their length may change. If the updated string is shorter than the previous one, it will overwrite the old data in the *.spb file. If the updated string is longer, it will be written to the end of the *.spb file. This file is memory-mapped, making resizing it a potentially slow process, depending on the operating system's memory-mapped file implementation. To avoid frequent resizing, you can use this setting to reserve extra space at the end of the .spb file.

Value: size, default 128k.

docstore_block_size

docstore_block_size = 32k

This setting controls the size of blocks used by the document storage. The default value is 16kb. When original document text is stored using stored_fields or stored_only_fields, it is stored within the table and compressed for efficiency. To optimize disk access and compression ratios for small documents, these documents are concatenated into blocks. The indexing process collects documents until their total size reaches the threshold specified by this option. At that point, the block of documents is compressed. This option can be adjusted to achieve better compression ratios (by increasing the block size) or faster access to document text (by decreasing the block size).

Value: size, default 16k.

docstore_compression

docstore_compression = lz4hc

This setting determines the type of compression used for compressing blocks of documents stored in document storage. If stored_fields or stored_only_fields are specified, the document storage stores compressed document blocks. 'lz4' offers fast compression and decompression speeds, while 'lz4hc' (high compression) sacrifices some compression speed for a better compression ratio. 'none' disables compression completely.

Values: lz4 (default), lz4hc, none.

docstore_compression_level

docstore_compression_level = 12

The compression level used when 'lz4hc' compression is applied in document storage. By adjusting the compression level, you can find the right balance between performance and compression ratio when using 'lz4hc' compression. Note that this option is not applicable when using 'lz4' compression.

Value: An integer between 1 and 12, with a default of 9.

preopen

preopen = 1

This setting indicates that searchd should open all table files on startup or rotation, and keep them open while running. By default, the files are not pre-opened. Pre-opened tables require a few file descriptors per table, but they eliminate the need for per-query open() calls and are immune to race conditions that might occur during table rotation under high load. However, if you are serving many tables, it may still be more efficient to open them on a per-query basis in order to conserve file descriptors.

Value: 0 (default), or 1.

read_buffer_docs

read_buffer_docs = 1M

Buffer size for storing the list of documents per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.

Value: size, default 256k, minimum value is 8k.

read_buffer_hits

read_buffer_hits = 1M

Buffer size for storing the list of hits per keyword. Increasing this value will result in higher memory usage during query execution, but may reduce I/O time.

Value: size, default 256k, minimum value is 8k.

Plain table disk footprint settings

inplace_enable

inplace_enable = {0|1}

Enables in-place table inversion. Optional, default is 0 (uses separate temporary files).

The inplace_enable option reduces the disk footprint during indexing of plain tables, while slightly slowing down indexing (it uses approximately 2 times less disk, but yields around 90-95% of the original performance).

Indexing is comprised of two primary phases. During the first phase, documents are collected, processed, and partially sorted by keyword, and the intermediate results are written to temporary files (.tmp*). During the second phase, the documents are fully sorted and the final table files are created. Rebuilding a production table on-the-fly requires approximately 3 times the peak disk footprint: first for the intermediate temporary files, second for the newly constructed copy, and third for the old table that will be serving production queries in the meantime. (Intermediate data is comparable in size to the final table.) This may be too much disk footprint for large data collections, and the inplace_enable option can be used to reduce it. When enabled, it reuses the temporary files, outputs the final data back to them, and renames them upon completion. However, this may require additional temporary data chunk relocation, which is where the performance impact comes from.

This directive has no effect on searchd, it only affects the indexer.

CONFIG
table products {
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_hit_gap

inplace_hit_gap = size

The option In-place inversion fine-tuning option. Controls preallocated hitlist gap size. Optional, default is 0.

This directive only affects the searchd tool, and does not have any impact on the indexer.

CONFIG
table products {
  inplace_hit_gap = 1M
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_reloc_factor

inplace_reloc_factor = 0.1

The inplace_reloc_factor setting determines the size of the relocation buffer within the memory arena used during indexing. The default value is 0.1.

This option is optional and only affects the indexer tool, not the searchd server.

CONFIG
table products {
  inplace_reloc_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

inplace_write_factor

inplace_write_factor = 0.1

Controls the size of the buffer used for in-place writing during indexing. Optional, with a default value of 0.1.

It's important to note that this directive only impacts the indexer tool and not the searchd server.

CONFIG
table products {
  inplace_write_factor = 0.1
  inplace_enable = 1

  path = products
  source = src_base
}

Natural language processing specific settings

The following settings are supported. They are all described in section NLP and tokenization.

¶ 6.2.4

Percolate table

A percolate table is a special table that stores queries rather than documents. It is used for prospective searches, or "search in reverse."

The schema of a percolate table is fixed and contains the following fields:

Field Description
ID An unsigned 64-bit integer with auto-increment functionality. It can be omitted when adding a PQ rule, as described in add a PQ rule
Query Full-text query of the rule, which can be thought of as the value of MATCH clause or JSON /search. If per field operators are used inside the query, the full-text fields need to be declared in the percolate table configuration. If the stored query is only for attribute filtering (without full-text querying), the query value can be empty or omitted. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table.
Filters Optional. Filters are an optional string containing attribute filters and/or expressions, defined the same way as in the WHERE clause or JSON filtering. The value of this field should correspond to the expected document schema, which is specified when creating the percolate table.
Tags Optional. Tags represent a list of string labels separated by commas that can be used for filtering/deleting PQ rules. The tags can also be returned along with matching documents when performing a Percolate query

Note that you do not need to add the above fields when creating a percolate table.

What you need to keep in mind when creating a new percolate table is to specify the expected schema of a document, which will be checked against the rules you will add later. This is done in the same way as for any other local table.

SQL
CREATE TABLE products(title text, meta json) type='pq';
Query OK, 0 rows affected (0.00 sec)
JSON
POST /cli -d "CREATE TABLE products(title text, meta json) type='pq'"
{
"total":0,
"error":"",
"warning":""
}
PHP
$index = [
    'index' => 'products',
    'body' => [
        'columns' => [
            'title' => ['type' => 'text'],
            'meta' => ['type' => 'json']
        ],
        'settings' => [
            'type' => 'pq'
        ]
    ]
];
$client->indices()->create($index);
Array(
    [total] => 0
    [error] =>
    [warning] =>
)
Python
utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, meta json) type=\'pq\'');
java
utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");
C#
utilsApi.Sql("CREATE TABLE products(title text, meta json) type='pq'");
typescript
res = await utilsApi.sql("CREATE TABLE products(title text, meta json) type='pq'");
go
apiClient.UtilsAPI.Sql(context.Background()).Body("CREATE TABLE products(title text, meta json) type='pq'").Execute()
CONFIG
table products {
  type = percolate
  path = tbl_pq
  rt_field = title
  rt_attr_json = meta
}
¶ 6.2.5

Template table

A Template Table is a special type of table in Manticore that doesn't store any data and doesn't create any files on your disk. Despite this, it can have the same NLP settings as a plain or real-time table. Template tables can be used for the following purposes:

CONFIG
table template {
  type = template
  morphology = stem_en
  wordforms = wordforms.txt
  exceptions = exceptions.txt
  stopwords = stopwords.txt
}
¶ 6.3

NLP and tokenization

¶ 6.3.1

Data tokenization

Manticore doesn't store text as is for performing full-text searching on it. Instead, it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as the position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching, and it operates at the character and word level.

On the character level, the engine allows only certain characters to pass. This is defined by the charset_table. Anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, such as lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.

At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.

Going further, we might want a word to be matched as another one because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.

Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only in speeding up queries but also in decreasing the index size.

A more advanced blacklisting is bigrams, which allows creating a special token between a "bigram" (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.

In case of indexing HTML content, it's important not to index the HTML tags, as they can introduce a lot of "noise" in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore the content of certain HTML elements.

¶ 6.3.2

Supported languages

Manticore supports a wide range of languages, with basic support enabled for most languages via charset_table = non_cjk (which is the default value).

For many languages, Manticore provides a stopwords file that can be used to improve search relevance.

Additionally, advanced morphology is available for a few languages that can significantly improve search relevance by using dictionary-based lemmatization or stemming algorithms for better segmentation and normalization.

The table below lists all supported languages and indicates how to enable:

Language Supported Stopwords file name Advanced morphology Notes
Afrikaans charset_table=non_cjk af -
Arabic charset_table=non_cjk ar morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian charset_table=non_cjk hy -
Assamese specify charset_table specify charset_table manually - -
Basque charset_table=non_cjk eu -
Bengali charset_table=non_cjk bn -
Bishnupriya specify charset_table manually - -
Buhid specify charset_table manually - -
Bulgarian charset_table=non_cjk bg -
Catalan charset_table=non_cjk ca morphology=libstemmer_ca
Chinese charset_table=chinese or ngram_chars=chinese zh morphology=icu_chinese or ngram_chars=1 correspondingly ICU dictionary based segmentation is much more accurate than ngram-based
Croatian charset_table=non_cjk hr -
Kurdish charset_table=non_cjk ckb -
Czech charset_table=non_cjk cz morphology=stem_cz (Czech stemmer)
Danish charset_table=non_cjk da morphology=libstemmer_da
Dutch charset_table=non_cjk nl morphology=libstemmer_nl
English charset_table=non_cjk en morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter's English stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto charset_table=non_cjk eo -
Estonian charset_table=non_cjk et -
Finnish charset_table=non_cjk fi morphology=libstemmer_fi
French charset_table=non_cjk fr morphology=libstemmer_fr
Galician charset_table=non_cjk gl -
Garo specify charset_table manually - -
German charset_table=non_cjk de morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek charset_table=non_cjk el morphology=libstemmer_el
Hebrew charset_table=non_cjk he -
Hindi charset_table=non_cjk hi morphology=libstemmer_hi
Hmong specify charset_table manually - -
Ho specify charset_table manually - -
Hungarian charset_table=non_cjk hu morphology=libstemmer_hu
Indonesian charset_table=non_cjk id morphology=libstemmer_id
Irish charset_table=non_cjk ga morphology=libstemmer_ga
Italian charset_table=non_cjk it morphology=libstemmer_it
Japanese ngram_chars=japanese - ngram_chars=japanese ngram_len=1 Requires ngram-based segmentation
Komi specify charset_table manually - -
Korean ngram_chars=korean - ngram_chars=korean ngram_len=1 Requires ngram-based segmentation
Large Flowery Miao specify charset_table manually - -
Latin charset_table=non_cjk la -
Latvian charset_table=non_cjk lv -
Lithuanian charset_table=non_cjk lt morphology=libstemmer_lt
Maba specify charset_table manually - -
Maithili specify charset_table manually - -
Marathi specify charset_table manually - -
Marathi charset_table=non_cjk mr -
Mende specify charset_table manually - -
Mru specify charset_table manually - -
Myene specify charset_table manually - -
Nepali specify charset_table manually - morphology=libstemmer_ne
Ngambay specify charset_table manually - -
Norwegian charset_table=non_cjk no morphology=libstemmer_no
Odia specify charset_table manually - -
Persian charset_table=non_cjk fa -
Polish charset_table=non_cjk pl -
Portuguese charset_table=non_cjk pt morphology=libstemmer_pt
Romanian charset_table=non_cjk ro morphology=libstemmer_ro
Russian charset_table=non_cjk ru morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter's Russian stemmer); morphology=stem_enru (Porter's English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali specify charset_table manually - -
Sindhi specify charset_table manually - -
Slovak charset_table=non_cjk sk -
Slovenian charset_table=non_cjk sl -
Somali charset_table=non_cjk so -
Sotho charset_table=non_cjk st -
Spanish charset_table=non_cjk es morphology=libstemmer_es
Swahili charset_table=non_cjk sw -
Swedish charset_table=non_cjk sv morphology=libstemmer_sv
Sylheti specify charset_table manually - -
Tamil specify charset_table manually - morphology=libstemmer_ta
Thai charset_table=non_cjk th -
Turkish charset_table=non_cjk tr morphology=libstemmer_tr
Ukrainian charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491 - morphology=lemmatize_uk_all Requires installation of UK lemmatizer
Yoruba charset_table=non_cjk yo -
Zulu charset_table=non_cjk zu -
¶ 6.3.3

Chinese, Japanese and Korean (CJK) languages

Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:

  1. Precise segmentation using the ICU library. Currently, only Chinese is supported.
SQL
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cjk',
            'morphology' => 'icu_chinese'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'')
Javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cjk\' morphology = \'icu_chinese\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'");
CONFIG
table products {
  charset_table = cjk
  morphology = icu_chinese

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
  1. Basic support using the N-gram options ngram_len and ngram_chars
    For each CJK language, there are separate character set tables (chinese, korean, japanese) that can be used, or you can use the common cjk character set table.
SQL
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
CONFIG
table products {
  charset_table = non_cjk
  ngram_len = 1
  ngram_chars = cjk

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Additionally, there is built-in support for Chinese stopwords with the alias zh.

SQL
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'");
CONFIG
table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.3.4

Low-level tokenization

When text is indexed in Manticore, it is split into words and case folding is done so that words like "Abc", "ABC", and "abc" are treated as the same word.

To perform these operations correctly, Manticore must know:

You can configure these settings on a per-table basis using the charset_table option. charset_table specifies an array that maps letter characters to their case-folded versions (or any other characters that you prefer). Characters that are not present in the array are considered to be non-letters and will be treated as word separators during indexing or searching in this table.

The default character set is non_cjk, which includes most languages.

You can also define text pattern replacement rules. For example, with the following rules:

regexp_filter = \**(\d+)\" => \1 inch
regexp_filter = (BLUE|RED) => COLOR

The text RED TUBE 5" LONG would be indexed as COLOR TUBE 5 INCH LONG, and PLANK 2" x 4" would be indexed as PLANK 2 INCH x 4 INCH. These rules are applied in the specified order. The rules also apply to queries, so a search for BLUE TUBE would actually search for COLOR TUBE.

You can learn more about regexp_filter here.

Index configuration options

charset_table

# default
charset_table = non_cjk

# only English and Russian letters
charset_table = 0..9, A..Z->a..z, _, a..z, \
U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

# english charset defined with alias
charset_table = 0..9, english, _

# you can override character mappings by redefining them, e.g. for case insensitive search with German umlauts you can use:
charset_table = non_cjk, U+00E4, U+00C4->U+00E4, U+00F6, U+00D6->U+00F6, U+00FC, U+00DC->U+00FC, U+00DF, U+1E9E->U+00DF

charset_table specifies an array that maps letter characters to their case folded versions (or any other characters if you like). The default character set is non_cjk which includes most non-CJK languages.

charset_table is a workhorse of Manticore's tokenization process, which extracts keywords from document text or query text. It controls what characters are accepted as valid and how they should be transformed (e.g. whether case should be removed or not).

By default, every character maps to 0, which means that it is not considered a valid keyword and is treated as a separator. Once a character is mentioned in the table, it is mapped to another character (most frequently, either to itself or to a lowercase letter) and is treated as a valid keyword part.

charset_table uses a comma-separated list of mappings to declare characters as valid or to map them to other characters. Syntax shortcuts are available for mapping ranges of characters at once:

For characters with codes from 0 to 32, and those in the range of 127 to 8-bit ASCII and Unicode characters, Manticore always treats them as separators. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+XXX form, where XXX is a hexadecimal code point number. The minimal accepted Unicode character code is U+0021.

If the default mappings are insufficient for your needs, you can redefine the character mappings by specifying them again with another mapping. For example, if the built-in non_cjk array includes characters Ä and ä and maps them both to the ASCII character a, you can redefine those characters by adding the Unicode code points for them, like this:

charset_table = non_cjk,U+00E4,U+00C4

for case sensitive search or

charset_table = non_cjk,U+00E4,U+00C4->U+00E4

for case insensitive search.

SQL
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451'");
CONFIG
table products {
  charset_table = 0..9, A..Z->a..z, _, a..z, \
    U+410..U+42F->U+430..U+44F, U+430..U+44F, U+401->U+451, U+451

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Besides definitions of characters and mappings, there are several built-in aliases that can be used. Current aliases are:

SQL
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => '0..9, english, _'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'0..9, english, _\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = '0..9, english, _'");
CONFIG
table products {
  charset_table = 0..9, english, _

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you want to support different languages in your search, it can be a laborious task to define sets of valid characters and folding rules for all of them. We have simplified this for you by providing default charset tables, non_cjk and cjk, that cover non-CJK and CJK (Chinese, Japanese, Korean) languages respectively. In most cases, these charsets should be sufficient for your needs.

Please note that the following languages are currently not supported:

All other languages listed in the Unicode languages
list
are supported by default.

To work with both cjk and non-cjk languages, set the options in your configuration file as shown below (with an exception for Chinese):

SQL
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cjk',
             'ngram_len' => '1',
             'ngram_chars' => 'cjk'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cjk\' ngram_len = \'1\' ngram_chars = \'cjk\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'");
CONFIG
table products {
  charset_table       = non_cjk
  ngram_len           = 1
  ngram_chars         = cjk

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you do not require support for cjk-languages, you can simply exclude the ngram_len and ngram_chars
options. For more information on these options, refer to the corresponding documentation sections.

To map one character to multiple characters or vice versa, you can use regexp_filter can be helpful.

blend_chars

blend_chars = +, &, U+23
blend_chars = +, &->+

Blended characters list. Optional, default is empty.

Blended characters are indexed as both separators and valid characters. For example, when & is defined as a blended character and AT&T appears in an indexed document, three different keywords will be indexed, at&t, at and t.

Additionally, blended characters can influence indexing in such a way that keywords are indexed as if the blended characters were not typed at all. This behavior is particularly evident when blend_mode = trim_all is specified. For example, the phrase some_thing will be indexed as some, something, and thing with blend_mode = trim_all.

Care should be taken when using blended characters as defining a character as blended means that it is no longer a separator.

Positions for tokens obtained by replacing blended characters with whitespace are assigned as usual, and regular keywords will be indexed as if there were no blend_chars specified at all. An additional token that mixes blended and non-blended characters will be put at the starting position. For instance, if AT&T company occurs in the very beginning of the text field, at will be given position 1, t position 2, company position 3, and AT&T will also be given position 1, blending with the opening regular keyword. As a result, queries for AT&T or just AT will match that document. A phrase query for "AT T" will also match, as well as a phrase query for "AT&T company".

Blended characters can overlap with special characters used in query syntax, such as T-Mobile or @twitter. Where possible, the query parser will handle the blended character as blended. For instance, if hello @twitter is within quotes (a phrase operator), the query parser will handle the @ symbol as blended. However, if the @ symbol was not within quotes, the character would be handled as an operator. Therefore, it is recommended to escape keywords.

Blended characters can be remapped so that multiple different blended characters can be normalized into one base form. This is useful when indexing multiple alternative Unicode codepoints with equivalent glyphs.

SQL
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'blend_chars' => '+, &, U+23, @->_'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_chars = \'+, &, U+23, @->_\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) blend_chars = '+, &, U+23, @->_'");
CONFIG
table products {
  blend_chars = +, &, U+23, @->_

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

blend_mode

blend_mode = option [, option [, ...]]
option = trim_none | trim_head | trim_tail | trim_both | trim_all | skip_pure

The blended tokens indexing mode is enabled by the blend_mode directive.

By default, tokens that mix blended and non-blended characters get indexed entirely. For example, when both an at-sign and an exclamation are in blend_chars, the string @dude! will be indexed as two tokens: @dude! (with all the blended characters) and dude (without any). As a result, a query of @dude will not match it.

blend_mode adds flexibility to this indexing behavior. It takes a comma-separated list of options, each of which specifies a token indexing variant.

If multiple options are specified, multiple variants of the same token will be indexed. Regular keywords (resulting from that token by replacing blended characters with a separator) are always indexed.

The options are:

Using blend_mode with the example @dude! string above, the setting blend_mode = trim_head, trim_tail would result in two indexed tokens: @dude and dude!. Using trim_both would have no effect because trimming both blended characters results in dude, which is already indexed as a regular keyword. Indexing @U.S.A. with trim_both (and assuming that dot is blended two) would result in U.S.A being indexed. Lastly, skip_pure enables you to ignore sequences of blended characters only. For example, one @@@ two would be indexed as one two, and match that as a phrase. This is not the case by default because a fully blended token gets indexed and offsets the second keyword position.

Default behavior is to index the entire token, equivalent to blend_mode = trim_none.

Be aware that using blend modes limits your search, even with the default mode trim_none if you assume . is a blended character:

Using more modes increases the chance your keyword will match something.

SQL
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'blend_mode' => 'trim_tail, skip_pure',
            'blend_chars' => '+, &'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) blend_mode = \'trim_tail, skip_pure\' blend_chars = \'+, &\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) blend_mode = 'trim_tail, skip_pure' blend_chars = '+, &'");
CONFIG
table products {
  blend_mode = trim_tail, skip_pure
  blend_chars = +, &

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

min_word_len

min_word_len = length

min_word_len is an optional index configuration option in Manticore that specifies the minimum indexed word length. The default value is 1, which means that everything is indexed.

Only those words that are not shorter than this minimum will be indexed. For example, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.

SQL
CREATE TABLE products(title text, price float) min_word_len = '4'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) min_word_len = '4'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_word_len' => '4'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_word_len = \'4\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) min_word_len = '4'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) min_word_len = '4'");
CONFIG
table products {
  min_word_len = 4

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ngram_len

ngram_len = 1

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1.

N-grams provide basic CJK (Chinese, Japanese, Korean) support for unsegmented texts. The issue with CJK searching is that there may be no clear separators between the words. In some cases, you may not want to use dictionary-based segmentation the one available for Chinese. In those cases, n-gram segmentation might work well too.

When this feature is enabled, streams of CJK (or any other characters defined in ngram_chars) are indexed as N-grams. For example, if the incoming text is "ABCDEF" (where A to F represent some CJK characters) and ngram_len is 1, it will be indexed as if it were "A B C D E F". Only ngram_len=1 is currently supported. Only those characters that are listed in ngram_chars table will be split this way; others will not be affected.

Note that if the search query is segmented, i.e. there are separators between individual words, then wrapping the words in quotes and using extended mode will result in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Manticore and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Manticore should still produce good results, thanks to phrase-based ranking: it will pull closer phrase matches (which in the case of N-gram CJK words can mean closer multi-character word matches) to the top.

SQL
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'cjk',
             'ngram_len' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
CONFIG
table products {
  ngram_chars = cjk
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ngram_chars

ngram_chars = cjk

ngram_chars = cjk, U+3000..U+2FA1F

N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table. N-gram characters cannot appear in the charset_table.

SQL
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'U+3000..U+2FA1F',
             'ngram_len' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'U+3000..U+2FA1F\' ngram_len = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'U+3000..U+2FA1F' ngram_len = '1'");
CONFIG
table products {
  ngram_chars = U+3000..U+2FA1F
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Also you can use an alias for our default N-gram table as in the example. It should be sufficient in most cases.

SQL
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'ngram_chars' => 'cjk',
             'ngram_len' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ngram_chars = \'cjk\' ngram_len = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) ngram_chars = 'cjk' ngram_len = '1'");
CONFIG
table products {
  ngram_chars = cjk
  ngram_len = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

ignore_chars

ignore_chars = U+AD

Ignored characters list. Optional, default is empty.

Useful in cases when some characters, such as soft hyphenation mark (U+00AD), should be not just treated as separators but rather fully ignored. For example, if '-' is simply not in the charset_table, "abc-def" text will be indexed as "abc" and "def" keywords. On the contrary, if '-' is added to ignore_chars list, the same text will be indexed as a single "abcdef" keyword.

The syntax is the same as for charset_table, but it's only allowed to declare characters, and not allowed to map them. Also, the ignored characters must not be present in charset_table.

SQL
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'ignore_chars' => 'U+AD'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) ignore_chars = \'U+AD\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) ignore_chars = 'U+AD'");
CONFIG
table products {
  ignore_chars = U+AD

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

bigram_index

bigram_index = {none|all|first_freq|both_freq}

Bigram indexing mode. Optional, default is none.

Bigram indexing is a feature to accelerate phrase searches. When indexing, it stores a document list for either all or some of the adjacent words pairs into the index. Such a list can then be used at searching time to significantly accelerate phrase or sub-phrase matching.

bigram_index controls the selection of specific word pairs. The known modes are:

For most use cases, both_freq would be the best mode, but your mileage may vary.

SQL
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'bigram_freq_words' => 'the, a, you, i',
            'bigram_index' => 'both_freq'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'both_freq\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'both_freq'");
CONFIG
table products {
  bigram_index = both_freq
  bigram_freq_words = the, a, you, i

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

bigram_freq_words

bigram_freq_words = the, a, you, i

A list of keywords considered "frequent" when indexing bigrams. Optional, default is empty.

Some of the bigram indexing modes (see bigram_index) require to define a list of frequent keywords. These are not to be confused with stop words. Stop words are completely eliminated when both indexing and searching. Frequent keywords are only used by bigrams to determine whether to index a current word pair or not.

bigram_freq_words lets you define a list of such keywords.

SQL
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'bigram_freq_words' => 'the, a, you, i',
            'bigram_index' => 'first_freq'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) bigram_freq_words = \'the, a, you, i\' bigram_index = \'first_freq\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) bigram_freq_words = 'the, a, you, i' bigram_index = 'first_freq'");
CONFIG
table products {
  bigram_freq_words = the, a, you, i
  bigram_index = first_freq

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

dict

dict = {keywords|crc}

The type of keywords dictionary used is identified by one of two known values, 'crc' or 'keywords'. This is optional, with 'keywords' as the default.

Using the keywords dictionary mode (dict=keywords) can significantly decrease the indexing burden and enable substring searches on extensive collections. This mode can be utilized for both plain and RT tables.

CRC dictionaries do not store the original keyword text in the index. Instead, they replace keywords with a control sum value (computed using FNV64) during both searching and indexing processes. This value is used internally within the index. This approach has two disadvantages:

The keywords dictionary resolves both of these issues. It stores keywords in the index and performs search-time wildcard expansion. For instance, a search for a test* prefix could internally expand to a 'test|tests|testing' query based on the dictionary's contents. This expansion process is entirely invisible to the application, with the exception that the separate per-keyword statistics for all the matched keywords are now also reported.

For substring (infix) searches, extended wildcards can be used. Special characters such as ? and % are compatible with substring (infix) search (e.g., t?st*, run%, *abc*). Note that the wildcards operators and the REGEX only function with dict=keywords.

Indexing with a keywords dictionary is approximately 1.1x to 1.3x slower than regular, non-substring indexing - yet significantly faster than substring indexing (either prefix or infix). The index size should only be slightly larger than that of the standard non-substring table, with a total difference of 1..10% percent. The time it takes for regular keyword searching should be nearly the same or identical across all three index types discussed (CRC non-substring, CRC substring, keywords). Substring searching time can significantly fluctuate based on how many actual keywords match the given substring (i.e., how many keywords the search term expands into). The maximum number of matched keywords is limited by the expansion_limit directive.

In summary, keywords and CRC dictionaries offer two different trade-off decisions for substring searching. You can opt to either sacrifice indexing time and index size to achieve the fastest worst-case searches (CRC dictionary), or minimally impact indexing time but sacrifice worst-case searching time when the prefix expands into a high number of keywords (keywords dictionary).

SQL
CREATE TABLE products(title text, price float) dict = 'keywords'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) dict = 'keywords'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'dict' => 'keywords'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) dict = \'keywords\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) dict = 'keywords'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) dict = 'keywords'");
CONFIG
table products {
  dict = keywords

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

embedded_limit

embedded_limit = size

Embedded exceptions, wordforms, or stop words file size limit. Optional, default is 16K.

When you create a table the above mentioned files can be either saved externally along with the table or embedded directly into the table. Files sized under embedded_limit get stored into the table. For bigger files, only the file names are stored. This also simplifies moving table files to a different machine; you may get by just copying a single file.

With smaller files, such embedding reduces the number of the external files on which the table depends, and helps maintenance. But at the same time it makes no sense to embed a 100 MB wordforms dictionary into a tiny delta table. So there needs to be a size threshold, and embedded_limit is that threshold.

CONFIG
table products {
  embedded_limit = 32K

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

global_idf

global_idf = /path/to/global.idf

The path to a file with global (cluster-wide) keyword IDFs. Optional, default is empty (use local IDFs).

On a multi-table cluster, per-keyword frequencies are quite likely to differ across different tables. That means that when the ranking function uses TF-IDF based values, such as BM25 family of factors, the results might be ranked slightly differently depending on what cluster node they reside.

The easiest way to fix that issue is to create and utilize a global frequency dictionary, or a global IDF file for short. This directive lets you specify the location of that file. It is suggested (but not required) to use an .idf extension. When the IDF file is specified for a given table and OPTION global_idf is set to 1, the engine will use the keyword frequencies and collection documents counts from the global_idf file, rather than just the local table. That way, IDFs and the values that depend on them will stay consistent across the cluster.

IDF files can be shared across multiple tables. Only a single copy of an IDF file will be loaded by searchd, even when many tables refer to that file. Should the contents of an IDF file change, the new contents can be loaded with a SIGHUP.

You can build an .idf file using indextool utility, by dumping dictionaries using --dumpdict dict.txt --stats switch first, then converting those to .idf format using --buildidf, then merging all the .idf files across cluster using --mergeidf.

SQL
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'global_idf' => '/usr/local/manticore/var/global.idf'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) global_idf = \'/usr/local/manticore/var/global.idf\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) global_idf = '/usr/local/manticore/var/global.idf'");
CONFIG
table products {
  global_idf = /usr/local/manticore/var/global.idf

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

hitless_words

hitless_words = {all|path/to/file}

Hitless words list. Optional, allowed values are 'all', or a list file name.

By default, Manticore full-text index stores not only a list of matching documents for every given keyword, but also a list of its in-document positions (known as hitlist). Hitlists enables phrase, proximity, strict order and other advanced types of searching, as well as phrase proximity ranking. However, hitlists for specific frequent keywords (that can not be stopped for some reason despite being frequent) can get huge and thus slow to process while querying. Also, in some cases we might only care about boolean keyword matching, and never need position-based searching operators (such as phrase matching) nor phrase ranking.

hitless_words lets you create indexes that either do not have positional information (hitlists) at all, or skip it for specific keywords.

Hitless index will generally use less space than the respective regular full-text index (about 1.5x can be expected). Both indexing and searching should be faster, at a cost of missing positional query and ranking support.

If used in positional queries (e.g. phrase queries) the hitless words are taken out from them and used as operand without a position. For example if "hello" and "world" are hitless and "simon" and "says" are not hitless, the phrase query "simon says hello world" will be converted to ("simon says" & hello & world), matching "hello" and "world" anywhere in the document and "simon says" as an exact phrase.

A positional query than contains only hitless words will result in an empty phrase node, therefore the entire query will return an empty result and a warning. If the whole dictionary is hitless (using all) only boolean matching can be used on the respective index.

SQL
CREATE TABLE products(title text, price float) hitless_words = 'all'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) hitless_words = 'all'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'hitless_words' => 'all'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) hitless_words = \'all\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) hitless_words = 'all'");
CONFIG
table products {
  hitless_words = all

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_field_lengths

index_field_lengths = {0|1}

Enables computing and storing of field lengths (both per-document and average per-index values) into the full-text index. Optional, default is 0 (do not compute and store).

When index_field_lengths is set to 1 Manticore will:

BM25A() and BM25F() functions in the expression ranker are based on these lengths and require index_field_lengths to be enabled. Historically, Manticore used a simplified, stripped-down variant of BM25 that, unlike the complete function, did not account for document length. There's also support for both a complete variant of BM25, and its extension towards multiple fields, called BM25F. They require per-document length and per-field lengths, respectively. Hence the additional directive.

SQL
CREATE TABLE products(title text, price float) index_field_lengths = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) index_field_lengths = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_field_lengths' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_field_lengths = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) index_field_lengths = '1'");
CONFIG
table products {
  index_field_lengths = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_token_filter

index_token_filter = my_lib.so:custom_blend:chars=@#&

Index-time token filter for full-text indexing. Optional, default is empty.

The index_token_filter directive specifies an optional index-time token filter for full-text indexing. This directive is used to create a custom tokenizer that makes tokens according to custom rules. The filter is created by the indexer on indexing source data into a plain table or by an RT table on processing INSERT or REPLACE statements. The plugins are defined using the format library name:plugin name:optional string of settings. For example, my_lib.so:custom_blend:chars=@#&.

SQL
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_token_filter' => 'my_lib.so:custom_blend:chars=@#&'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_token_filter = \'my_lib.so:custom_blend:chars=@#&\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) index_token_filter = 'my_lib.so:custom_blend:chars=@#&'");
CONFIG
table products {
  index_token_filter = my_lib.so:custom_blend:chars=@#&

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

overshort_step

overshort_step = {0|1}

Position increment on overshort (less than min_word_len) keywords. Optional, allowed values are 0 and 1, default is 1.

SQL
CREATE TABLE products(title text, price float) overshort_step = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) overshort_step = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'overshort_step' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) overshort_step = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) overshort_step = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) overshort_step = '1'");
CONFIG
table products {
  overshort_step = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

phrase_boundary

phrase_boundary = ., ?, !, U+2026 # horizontal ellipsis

Phrase boundary characters list. Optional, default is empty.

This list controls what characters will be treated as phrase boundaries, in order to adjust word positions and enable phrase-level search emulation through proximity search. The syntax is similar to charset_table, but mappings are not allowed and the boundary characters must not overlap with anything else.

On phrase boundary, additional word position increment (specified by phrase_boundary_step) will be added to current word position. This enables phrase-level searching through proximity queries: words in different phrases will be guaranteed to be more than phrase_boundary_step distance away from each other; so proximity search within that distance will be equivalent to phrase-level search.

Phrase boundary condition will be raised if and only if such character is followed by a separator; this is to avoid abbreviations such as S.T.A.L.K.E.R or URLs being treated as several phrases.

SQL
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'phrase_boundary' => '., ?, !, U+2026',
             'phrase_boundary_step' => '10'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary = \'., ?, !, U+2026\' phrase_boundary_step = \'10\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary = '., ?, !, U+2026' phrase_boundary_step = '10'");
CONFIG
table products {
  phrase_boundary = ., ?, !, U+2026
  phrase_boundary_step = 10

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

phrase_boundary_step

phrase_boundary_step = 100

Phrase boundary word position increment. Optional, default is 0.

On phrase boundary, current word position will be additionally incremented by this number.

SQL
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'phrase_boundary_step' => '100',
             'phrase_boundary' => '., ?, !, U+2026'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) phrase_boundary_step = \'100\' phrase_boundary = \'., ?, !, U+2026\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) phrase_boundary_step = '100' phrase_boundary = '., ?, !, U+2026'");
CONFIG
table products {
  phrase_boundary_step = 100
  phrase_boundary = ., ?, !, U+2026

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

regexp_filter

# index '13"' as '13inch'
regexp_filter = \b(\d+)\" => \1inch

# index 'blue' or 'red' as 'color'
regexp_filter = (blue|red) => color

Regular expressions (regexps) used to filter the fields and queries. This directive is optional, multi-valued, and its default is an empty list of regular expressions. The regular expressions engine used by Manticore Search is Google's RE2, which is known for its speed and safety. For detailed information on the syntax supported by RE2, you can visit the RE2 syntax guide.

In certain applications such as product search, there can be many ways to refer to a product, model, or property. For example, iPhone 3gs and iPhone 3 gs (or even iPhone3 gs) are very likely to refer to the same product. Another example could be different ways to express a laptop screen size, such as 13-inch, 13 inch, 13", or 13in.

Regexps provide a mechanism to specify rules tailored to handle such cases. In the first example, you could possibly use a wordforms file to handle a handful of iPhone models, but in the second example, it's better to specify rules that would normalize "13-inch" and "13in" to something identical.

Regular expressions listed in regexp_filter are applied in the order they are listed, at the earliest stage possible, before any other processing (including exceptions), even before tokenization. That is, regexps are applied to the raw source fields when indexing, and to the raw search query text when searching.

SQL
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'regexp_filter' => '(blue|red) => color'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) regexp_filter = \'(blue|red) => color\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) regexp_filter = '(blue|red) => color'");
CONFIG
table products {
  # index '13"' as '13inch'
  regexp_filter = \b(\d+)\" => \1inch

  # index 'blue' or 'red' as 'color'
  regexp_filter = (blue|red) => color

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.3.5

Wildcard searching settings

Wildcard searching is a common text search type. In Manticore, it is performed at the dictionary level. By default, both plain and RT tables use a dictionary type called dict. In this mode, words are stored as they are, so enabling wildcarding does not affect the size of the table. When a wildcard search is performed, the dictionary is searched to find all possible expansions of the wildcarded word. This expansion can be problematic in terms of computation at query time when the expanded word provides many expansions or expansions that have huge hitlists, especially in the case of infixes where the wildcard is added at the start and end of the word. To avoid such problems, the expansion_limit can be used.

min_prefix_len

min_prefix_len = length

This setting determines the minimum word prefix length to index and search. By default, it is set to 0, meaning prefixes are not allowed.

Prefixes allow for wildcard searching by wordstart* wildcards.

For example, if the word "example" is indexed with min_prefix_len=3, it can be found by searching for "exa", "exam", "examp", "exampl", as well as the full word.

Note that with dict=crc min_prefix_len will affect the size of the full-text index since each word expansion will be stored additionally.

Manticore can differentiate perfect word matches from prefix matches and rank the former higher if the following conditions are met:

Note that with either dict=crc mode or any of the above options disabled, it is not possible to differentiate between prefixes and full words, and perfect word matches cannot be ranked higher.

When the minimum infix length is set to a positive number, the minimum prefix length is always considered 1.

SQL
CREATE TABLE products(title text, price float) min_prefix_len = '3'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) min_prefix_len = '3'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_prefix_len' => '3'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_prefix_len = \'3\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) min_prefix_len = '3'");
CONFIG
table products {
  min_prefix_len = 3

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

min_infix_len

min_infix_len = length

The min_infix_len setting determines the minimum length of an infix prefix to index and search. It is optional and its default value is 0, which means that infixes are not allowed. The minimum allowed non-zero value is 2.

When enabled, infixes allow for wildcard searching with term patterns like start*, *end, *middle*, , and so on. It also allows you to disable too short wildcards if they are too expensive to search for.

If the following conditions are met, Manticore can differentiate perfect word matches from infix matches and rank the former higher:

Note that with the dict=crc mode or any of the above options disabled, there is no way to differentiate between infixes and full words, and thus perfect word matches cannot be ranked higher.

Infix wildcard search query time can vary greatly, depending on how many keywords the substring will actually expand to. Short and frequent syllables like *in* or *ti* might expand to way too many keywords, all of which would need to be matched and processed. Therefore, to generally enable substring searches, you would set min_infix_len to 2. To limit the impact from wildcard searches with too short wildcards, you might set it higher.

Infixes must be at least 2 characters long, and wildcards like *a* are not allowed for performance reasons.

When min_infix_len is set to a positive number, the minimum prefix length is considered 1. For dict word infixing and prefixing cannot be both enabled at the same time. For dict and other fields to have prefixes declared with prefix_fields, it is forbidden to declare the same field in both lists.

If dict=keywords, besides the wildcard * two other wildcard characters can be used:

SQL
CREATE TABLE products(title text, price float) min_infix_len = '3'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) min_infix_len = '3'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'min_infix_len' => '3'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_infix_len = \'3\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) min_infix_len = '3'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) min_infix_len = '3'");
CONFIG
table products {
  min_infix_len = 3

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

prefix_fields

prefix_fields = field1[, field2, ...]

The prefix_fields setting is used to limit prefix indexing to specific full-text fields in dict=crc mode. By default, all fields are indexed in prefix mode, but because prefix indexing can affect both indexing and searching performance, it may be desired to limit it to certain fields.

To limit prefix indexing to specific fields, use the prefix_fields setting followed by a comma-separated list of field names. If prefix_fields is not set, then all fields will be indexed in prefix mode.

CONFIG
table products {
  prefix_fields = title, name
  min_prefix_len = 3
  dict = crc

infix_fields

infix_fields = field1[, field2, ...]

The infix_fields setting allows you to specify a list of full-text fields to limit infix indexing to. This applies to dict=crc only and is optional, with the default being to index all fields in infix mode.
This setting is similar to prefix_fields, but instead allows you to limit infix indexing to specific fields.

CONFIG
table products {
  infix_fields = title, name
  min_infix_len = 3
  dict = crc

max_substring_len

max_substring_len = length

The max_substring_len directive sets the maximum substring length to be indexed for either prefix or infix searches. This setting is optional, and its default value is 0 (which means that all possible substrings are indexed). It only applies to dict.

By default, substring indexing in dict indexes all possible substrings as separate keywords, which can result in an overly large full-text index. Therefore, the max_substring_len directive allows you to skip too-long substrings that will probably never be searched for.

For example, a test table of 10,000 blog posts takes up a different amount of disk space depending on the settings:

Therefore, limiting the max substring length can save 10-15% of the table size.

When using dict=keywords mode, there is no performance impact associated with substring length. Therefore, this directive is not applicable and is intentionally forbidden in that case. However, if required, you can still limit the length of a substring that you search for in the application code.

CONFIG
table products {
  max_substring_len = 12
  min_infix_len = 3
  dict = crc

expand_keywords

expand_keywords = {0|1|exact|star}

This setting expands keywords with their exact forms and/or with stars when possible. The supported values are:

Queries against tables with expand_keywords feature enabled are internally expanded as follows: if the table was built with prefix or infix indexing enabled, every keyword gets internally replaced with a disjunction of the keyword itself and a respective prefix or infix (keyword with stars). If the table was built with both stemming and index_exact_words enabled, exact form is also added.

SQL
CREATE TABLE products(title text, price float) expand_keywords = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) expand_keywords = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'expand_keywords' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) expand_keywords = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) expand_keywords = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) expand_keywords = '1'");
CONFIG
table products {
  expand_keywords = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Expanded queries take naturally longer to complete, but can possibly improve the search quality, as the documents with exact form matches should be ranked generally higher than documents with stemmed or infix matches.

Note that the existing query syntax does not allow to emulate this kind of expansion, because internal expansion works on keyword level and expands keywords within phrase or quorum operators too (which is not possible through the query syntax). Take a look at the examples and how expand_keywords affects the search result weights and how "runsy" is found by "runs" w/o the need to add a star:

expand_keywords_enabled
mysql> create table t(f text) min_infix_len='2' expand_keywords='1' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    2 | runs    |     1560 |
|    1 | running |     1500 |
|    3 | runsy   |     1500 |
+------+---------+----------+
3 rows in set (0.01 sec)

mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t(f text) min_infix_len='2' expand_keywords='exact' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1590 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)
expand_keywords_disabled
mysql> create table t(f text) min_infix_len='2' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('runs');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1500 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

mysql> drop table t;
Query OK, 0 rows affected (0.01 sec)

mysql> create table t(f text) min_infix_len='2' morphology='stem_en';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> insert into t values(1,'running'),(2,'runs'),(3,'runsy');
Query OK, 3 rows affected (0.00 sec)

mysql> select *, weight() from t where match('running');
+------+---------+----------+
| id   | f       | weight() |
+------+---------+----------+
|    1 | running |     1500 |
|    2 | runs    |     1500 |
+------+---------+----------+
2 rows in set (0.00 sec)

This directive does not affect indexer in any way, it only affects searchd.

expansion_limit

expansion_limit = number

Maximum number of expanded keywords for a single wildcard. Details are here.

¶ 6.3.6

Ignoring stop words

Stop words are words that are ignored during indexing and searching, typically due to their high frequency and low value to search results.

Manticore Search applies stemming to stop words by default, which can lead to undesired results, but this can be turned off using the stopwords_unstemmed.

Small stop word files are stored in the table header, and there is a limit to the size of files that can be embedded, as defined by the embedded_limit option.

Stop words are not indexed, but they do affect keyword positions. For example, if "the" is a stop word, and document 1 contains the phrase "in office" while document 2 contains the phrase "in the office," searching for "in office" as an exact phrase will only return the first document, even though "the" is skipped as a stop word in the second document. This behavior can be modified using the stopword_step directive.

stopwords

stopwords=path/to/stopwords/file[ path/to/another/file ...]

The stopwords setting is optional and by default empty. It allows you to specify the path to one or more stop word files, separated by spaces. All the files will be loaded. In the real-time mode, only absolute paths are allowed.

The stop word file format is simple plain text with UTF-8 encoding. The file data will be tokenized with respect to the charset_table settings, so you can use the same separators as in the indexed data.

Stop word files can be created manually or semi-automatically. The indexer provides a mode that creates a frequency dictionary of the table, sorted by the keyword frequency. Top keywords from that dictionary can usually be used as stop words. See --buildstops and --buildfreqs switch for details. Top keywords from that dictionary can usually be used as stop words.

SQL
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'");
CONFIG
table products {
  stopwords = /usr/local/manticore/data/stopwords.txt
  stopwords = stopwords-ru.txt stopwords-en.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Alternatively you can use one of the default stop word files that come with Manticore. Currently stop words for 50 languages are available. Here is the full list of aliases for them:

For example, to use stop words for Italian language just put the following line in your config file:

SQL
CREATE TABLE products(title text, price float) stopwords = 'it'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'it'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'it'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'it'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = 'it'");
CONFIG
table products {
  stopwords = it

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

If you need to use stop words for multiple languages you should list all their aliases, separated with commas (RT mode) or spaces (plain mode):

SQL
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");
C#
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'");
CONFIG
table products {
  stopwords = en it ru

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

stopword_step

stopword_step={0|1}

The position_increment setting on stopwords is optional, and the allowed values are 0 and 1, with the default being 1.

SQL
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopword_step' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");
C#
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'");
CONFIG
table products {
  stopwords = en
  stopword_step = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

stopwords_unstemmed

stopwords_unstemmed={0|1}

Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).

By default, stop words are stemmed themselves, and then applied to tokens after stemming (or any other morphology processing). This means that a token is stopped when stem(token) is equal to stem(stopword). This default behavior can lead to unexpected results when a token is erroneously stemmed to a stopped root. For example, "Andes" might get stemmed to "and", so when "and" is a stopword, "Andes" is also skipped.

However, you can change this behavior by enabling the stopwords_unstemmed directive. When this is enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when the token is equal to the stopword.

SQL
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'stopwords' => 'en, it, ru',
            'stopwords_unstemmed' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'");
CONFIG
table products {
  stopwords = en
  stopwords_unstemmed = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.3.7

Word forms

Word forms are applied after tokenizing incoming text by charset_table rules. They essentially let you replace one word with another. Normally, that would be used to bring different word forms to a single normal form (e.g. to normalize all the variants such as "walks", "walked", "walking" to the normal form "walk"). It can also be used to implement stemming exceptions, because stemming is not applied to words found in the forms list.

wordforms

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

Word forms dictionary. Optional, default is empty.

The dictionaries are used to normalize incoming words both during indexing and searching. Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the word forms file.

Word forms support in Manticore is designed to handle large dictionaries well. They moderately affect indexing speed; for example, a dictionary with 1 million entries slows down indexing by about 1.5 times. Searching speed is not affected at all. The additional RAM impact is roughly equal to the dictionary file size, and dictionaries are shared across tables. For instance, if the very same 50 MB word forms file is specified for 10 different tables, the additional searchd RAM usage will be about 50 MB.

The dictionary file should be in a simple plain text format. Each line should contain source and destination word forms, in UTF-8 encoding, separated by a "greater than" sign. Rules from the charset_table will be applied when the file is loaded, so if you are using built-in charset_table options, it is typically case-insensitive, just like your other full-text indexed data. Here is a sample file contents:

walks > walk
walked > walk
walking > walk

There is a bundled utility called Spelldump that helps you create a dictionary file in a format that Manticore can read. The utility can read from source .dict and .aff dictionary files in the ispell or MySpell format, as bundled with OpenOffice.

You can map several source words to a single destination word. The process happens on tokens, not the source text, so differences in whitespace and markup are ignored.

You can use the => symbol instead of >. Comments (starting with #) are also allowed. Finally, if a line starts with a tilde (~), the wordform will be applied after morphology, instead of before (note that only a single source and destination word are supported in this case).

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

You can specify multiple destination tokens:

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

You can specify multiple files, not just one. Masks can be used as a pattern, and all matching files will be processed in simple ascending order.

In the RT mode, only absolute paths are allowed.

If multi-byte codepages are used and file names include foreign characters, the resulting order may not be exactly alphabetic. If the same wordform definition is found in multiple files, the latter one is used and overrides previous definitions.

SQL
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'");
CONFIG
table products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.3.8

Exceptions

Exceptions (also known as synonyms) allow mapping one or more tokens (including tokens with characters that would normally be excluded) to a single keyword. They are similar to wordforms in that they also perform mapping but have a number of important differences.

A short summary of the differences from wordforms is as follows:

Exceptions Word forms
Case sensitive Case insensitive
Can use special characters that are not in charset_table Fully obey charset_table
Underperform on huge dictionaries Designed to handle millions of entries

exceptions

exceptions = path/to/exceptions.txt

Tokenizing exceptions file. Optional, the default is empty.
In the RT mode, only absolute paths are allowed.

The expected file format is plain text, with one line per exception. The line format is as follows:

map-from-tokens => map-to-token

Example file:

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus

All tokens here are case sensitive and will not be processed by charset_table rules. Thus, with the example exceptions file above, the "at&t" text will be tokenized as two keywords "at" and "t" due to lowercase letters. On the other hand, "AT&T" will match exactly and produce a single "AT&T" keyword.

Note that this map-to keyword:

In our sample, "ms windows" query will not match the document with "MS Windows" text. The query will be interpreted as a query for two keywords, "ms" and "windows". The mapping for "MS Windows" is a single keyword "ms windows", with a space in the middle. On the other hand, "standartenfuhrer" will retrieve documents with "Standarten Fuhrer" or "Standarten Fuehrer" contents (capitalized exactly like this), or any capitalization variant of the keyword itself, e.g., "staNdarTenfUhreR". (It won't catch "standarten fuhrer", however: this text does not match any of the listed exceptions because of case sensitivity and gets indexed as two separate keywords.)

The whitespace in the map-from tokens list matters, but its amount does not. Any amount of whitespace in the map-form list will match any other amount of whitespace in the indexed document or query. For instance, the "AT & T" map-from token will match "AT & T" text, whatever the amount of space in both map-from part and the indexed text. Such text will, therefore, be indexed as a special "AT&T" keyword, thanks to the very first entry from the sample.

Exceptions also allow capturing special characters (that are exceptions from general charset_table rules; hence the name). Assume that you generally do not want to treat '+' as a valid character, but still want to be able to search for some exceptions from this rule such as 'C++'. The sample above will do just that, totally independent of what characters are in the table and what are not.

Therefore, when it comes to a plain table, it's required to rotate the table in order to pick up changes in the exceptions file.

SQL
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'");
CONFIG
table products {
  exceptions = /usr/local/manticore/data/exceptions.txt

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.3.9

Advanced morphology

Morphology preprocessors can be applied to words during indexing to normalize different forms of the same word and improve segmentation. For example, an English stemmer can normalize "dogs" and "dog" to "dog", resulting in identical search results for both keywords.

Manticore has four built-in morphology preprocessors:

morphology

morphology = morphology1[, morphology2, ...]

The morphology directive specifies a list of morphology preprocessors to apply to the words being indexed. This is an optional setting, with the default being no preprocessor applied.

Manticore comes with built-in morphological preprocessors for:

Lemmatizers require dictionary .pak files that can be downloaded from the Manticore website. The dictionaries need to be put in the directory specified by lemmatizer_base. Additionally, the lemmatizer_cache setting can be used to speed up lemmatizing by spending more RAM for an uncompressed dictionary cache.

The Chinese language segmentation can be performed using ICU. It provides more precise segmentation compared to n-grams but is slightly slower. The charset_table must include all Chinese characters, which can be done by using the "cjk" alias. When "morphology=icu_chinese" is specified, the documents are first pre-processed by ICU. Then, the result is processed by the tokenizer according to the charset_table, and finally, other morphology processors specified in the "morphology" option are applied. Only those parts of texts that contain Chinese are passed to ICU for segmentation, while others can be modified by different means such as different morphologies or charset_table.

Built-in English and Russian stemmers are faster than their libstemmer counterparts but may produce slightly different results

Soundex implementation matches that of MySQL. Metaphone implementation is based on Double Metaphone algorithm and indexes the primary code.

To use the morphology option, specify one or multiple of the built-in options, including:

Multiple stemmers can be specified, separated by commas. They will be applied to incoming words in the order they are listed, and the processing will stop once one of the stemmers modifies the word. Additionally, when wordforms feature is enabled, the word will be looked up in the word forms dictionary first. If there is a matching entry in the dictionary, stemmers will not be applied at all. wordforms сan be used to implement stemming exceptions.

SQL
CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'
JSON
POST /cli -d "CREATE TABLE products(title text, price float)  morphology = 'stem_en, libstemmer_sv'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology' => 'stem_en, libstemmer_sv'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'");
CONFIG
table products {
  morphology = stem_en, libstemmer_sv

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

morphology_skip_fields

morphology_skip_fields = field1[, field2, ...]

A list of fields to skip morphology preprocessing. Optional, default is empty (apply preprocessors to all fields).

SQL
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'
JSON
POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology_skip_fields' => 'name',
            'morphology' => 'stem_en'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'");
CONFIG
table products {
  morphology_skip_fields = name
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_field = name
  rt_attr_uint = price
}

min_stemming_len

min_stemming_len = length

Minimum word length at which to enable stemming. Optional, default is 1 (stem everything).

Stemmers are not perfect, and might sometimes produce undesired results. For instance, running "gps" keyword through Porter stemmer for English results in "gp", which is not really the intent. min_stemming_len feature lets you suppress stemming based on the source word length, ie. to avoid stemming too short words. Keywords that are shorter than the given threshold will not be stemmed. Note that keywords that are exactly as long as specified will be stemmed. So in order to avoid stemming 3-character keywords, you should specify 4 for the value. For more finely grained control, refer to wordforms feature.

SQL
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'min_stemming_len' => '4',
             'morphology' => 'stem_en'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'");
CONFIG
table products {
  min_stemming_len = 4
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_exact_words

index_exact_words = {0|1}

This option allows for the indexing of original keywords along with their morphologically modified versions. However, original keywords that are remapped by the wordforms and exceptions cannot be indexed. The default value is 0, indicating that this feature is disabled by default.

This allows the use of the exact form operator in the query language. Enabling this feature will increase the full-text index size and indexing time, but will not impact search performance.

SQL
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'index_exact_words' => '1',
             'morphology' => 'stem_en'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'");
CONFIG
table products {
  index_exact_words = 1
  morphology = stem_en

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.3.10

Advanced HTML tokenization

Stripping HTML tags

html_strip

html_strip = {0|1}

This option determines whether HTML markup should be stripped from the incoming full-text data. The default value is 0, which disables stripping. To enable stripping, set the value to 1.

HTML tags and entities are considered as markup and will be processed.

HTML tags are removed, while the contents between them (e.g. everything between <p> and </p>) are left intact. You can choose to keep and index tag attributes (e.g. HREF attribute in an A tag or ALT in an IMG tag). Some well-known inline tags, such as A, B, I, S, U, BASEFONT, BIG, EM, FONT, IMG, LABEL, SMALL, SPAN, STRIKE, STRONG, SUB, SUP, and TT, are completely removed. All other tags are treated as block level and are replaced with whitespace. For example, the text te<b>st</b> will be indexed as a single keyword 'test', while te<p>st</p> will be indexed as two keywords 'te' and 'st'.

HTML entities are decoded and replaced with their corresponding UTF-8 characters. The stripper supports both numeric forms (e.g. &#239;) and text forms (e.g. &oacute; or &nbsp;) of entities, and supports all entities specified by the HTML4 standard.

The stripper is designed to work with properly formed HTML and XHTML, but may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Please note that only the tags themselves, as well as HTML comments, are stripped. To strip the contents of the tags, including embedded scripts, see the html_remove_elements option. There are no restrictions on tag names, meaning that everything that looks like a valid tag start, end, or comment will be stripped.

SQL
CREATE TABLE products(title text, price float) html_strip = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) html_strip = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_strip' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) html_strip = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) html_strip = '1'");
CONFIG
table products {
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_index_attrs

html_index_attrs = img=alt,title; a=title;

The html_index_attrs option allows you to specify which HTML markup attributes should be indexed even though other HTML markup is stripped. The default value is empty, meaning no attributes will be indexed.
The format of the option is a per-tag enumeration of indexable attributes, as demonstrated in the example above. The contents of the specified attributes will be retained and indexed, providing a way to extract additional information from your full-text data.

SQL
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_index_attrs' => 'img=alt,title; a=title;',
            'html_strip' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");
CONFIG
table products {
  html_index_attrs = img=alt,title; a=title;
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_remove_elements

html_remove_elements = element1[, element2, ...]

A list of HTML elements whose contents, along with the elements themselves, will be stripped. Optional, the default is an empty string (do not strip contents of any elements).

This option allows you to remove the contents of elements, meaning everything between the opening and closing tags. It is useful for removing embedded scripts, CSS, etc. The short tag form for empty elements (e.g.
) is properly supported, and the text following such a tag will not be removed.

The value is a comma-separated list of element (tag) names, the contents of which should be removed. Tag names are case-insensitive.

SQL
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_remove_elements' => 'style, script',
            'html_strip' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");
CONFIG
table products {
  html_remove_elements = style, script
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Extracting important parts from HTML

index_sp

index_sp = {0|1}

Controls detection and indexing of sentence and paragraph boundaries. Optional, default is 0 (no detection or indexing).

This directive enables the detection and indexing of sentence and paragraph boundaries, making it possible for the SENTENCE and PARAGRAPH operators to work. Sentence boundary detection is based on plain text analysis, and only requires setting index_sp = 1 to enable it. Paragraph detection, however, relies on HTML markup and occurs during the [HTML stripping process](06-creating_a_table.html#toc-06.03.10-advanced_html_tokenization#html_strip. As such, to index paragraph boundaries, both the index_sp directive and the html_strip directive must be set to 1.

The following rules are used to determine sentence boundaries:

Paragraph boundaries are detected at every block-level HTML tag, including: ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, and UL.

Both sentences and paragraphs increment the keyword position counter by 1.

SQL
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_sp' => '1',
            'html_strip' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'");
CONFIG
table products {
  index_sp = 1
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_zones

index_zones = h*, th, title

A list of HTML/XML zones within a field to be indexed. The default is an empty string (no zones will be indexed).

A "zone" is defined as everything between an opening and a matching closing tag, and all spans sharing the same tag name are referred to as a "zone." For example, everything between <H1> and </H1> in a document field belongs to the H1 zone.

The index_zones directive enables zone indexing, but the HTML stripper must also be enabled (by setting html_strip = 1). The value of index_zones should be a comma-separated list of tag names and wildcards (ending with a star) to be indexed as zones.

Zones can be nested and overlap, as long as every opening tag has a matching tag. Zones can also be used for matching with the ZONE operator, as described in the extended_query_syntax.

SQL
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'
JSON
POST /cli -d "
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"
PHP
$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_zones' => 'h*,th,title',
            'html_strip' => '1'
        ]);
Python
utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')
javascript
res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');
Java
utilsApi.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");
C#
utilsApi.Sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'");
CONFIG
table products {
  index_zones = h*, th, title
  html_strip = 1

  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}
¶ 6.4

Creating a distributed table

Manticore allows for the creation of distributed tables, which act like regular plain or real-time tables, but are actually a collection of child tables used for searching. When a query is sent to a distributed table, it is distributed among all tables in the collection. The server then collects and processes the responses to sort and recalculate values of aggregates, if necessary.

From the client's perspective, it appears as if they are querying a single table.

Distributed tables can be composed of any combination of tables, including:

Mixing percolate and template tables with plain and real-time tables is not recommended.

A distributed table is defined as type 'distributed' in the configuration file or through the SQL clause CREATE TABLE

In a configuration file

table foo {
    type = distributed
    local = bar
    local = bar1, bar2
    agent = 127.0.0.1:9312:baz
    agent = host1|host2:tbl
    agent = host1:9301:tbl1|host2:tbl2 [ha_strategy=random retry_count=10]
    ...
}

Via SQL

CREATE TABLE distributed_index type='distributed' local='local_index' agent='127.0.0.1:9312:remote_index'

Children

The essence of a distributed table lies in its list of child tables, to which it points. There are two types of child tables in a distributed table:

  1. Local tables: These are tables that are served within the same server as the distributed table. To enumerate local tables, you use the syntax local =. You can list several local tables using multiple local = lines, or combine them into one list separated by commas.

  2. Remote tables: These are tables that are served anywhere outside the server. To enumerate remote tables, you use the syntax agent =. Each line represents one endpoint or agent. Each agent can have multiple external locations and options for how it should work. More details here. It is important to note that the server does not have any information about the type of table it is working with. This may lead to errors if, for example, you issue a CALL PQ to a remote table 'foo' that is not a percolate table.

¶ 6.4.1

Creating a local distributed table

A distributed table in Manticore Search acts as a "master node" that proxies the demanded query to other tables and provides merged results from the responses it receives. The table doesn't hold any data on its own. It can connect to both local tables and tables located on other servers. Here's an example of a simple distributed table:

Configuration file
table index_dist {
  type  = distributed
  local = index1
  local = index2
  ...
 }
RT mode
CREATE TABLE local_dist type='distributed' local='index1' local='index2';
PHP
$params = [
    'body' => [
        'settings' => [
            'type' => 'distributed',
            'local' => [
                'index1',
                'index2'
            ]
        ]
    ],
    'index' => 'products'
];
$index = new \Manticoresearch\Index($client);
$index->create($params);
Python
utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'')
javascript
res = await utilsApi.sql('CREATE TABLE local_dist type=\'distributed\' local=\'index1\' local=\'index2\'');
Java
utilsApi.sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");
C#
utilsApi.Sql("CREATE TABLE local_dist type='distributed' local='index1' local='index2'");
¶ 6.4.2

Remote tables

A remote table in Manticore Search is represented by the agent prefix in the definition of a distributed table. A distributed table can include a combination of local and remote tables. If there are no local tables provided, the distributed table will be purely remote and serve as a proxy only. For example, you might have a Manticore instance that listens on multiple ports and serves different protocols, and then redirects queries to backend servers that only accept connections via Manticore's internal binary protocol, using persistent connections to reduce the overhead of establishing connections.
Even though a purely remote distributed table doesn't serve local tables itself, it still consumes machine resources, as it still needs to perform final calculations, such as merging results and calculating final aggregated values.

agent

agent = address1 [ | address2 [...] ][:table-list]
agent = address1[:table-list [ | address2[:table-list [...] ] ] ]

agent directive declares the remote agents that are searched each time the enclosing distributed table is searched. These agents are essentially pointers to networked tables. The value specified includes the address and can also include multiple alternatives (agent mirrors) for either the address only or the address and table list.

The address specification must be one of the following:

address = hostname[:port] # eg. server2:9312
address = /absolute/unix/socket/path # eg. /var/run/manticore2.sock

The hostname is the remote host name, port is the remote TCP port number, table-list is a comma-separated list of table names, and square brackets [] indicate an optional clause.

If the table name is omitted, it is assumed to be the same table as the one where this line is defined. In other words, when defining agents for the 'mycoolindex' distributed table, you can simply point to the address, and it will be assumed that you are querying the mycoolindex table on the agent's endpoints.

If the port number is omitted, it is assumed to be 9312. If it is defined but invalid (e.g. 70000), the agent will be skipped.

You can point each agent to one or more remote tables residing on one or more networked servers with no restrictions. This enables several different usage modes:

All agents are searched in parallel. The index list is passed verbatim to the remote agent. The exact way that list is searched within the agent (i.e. sequentially or in parallel) depends solely on the agent's configuration (see the threads setting). The master has no remote control over this.

It is important to note that the LIMIT, option is ignored in agent queries. This is because each agent can contain different tables, so it is the responsibility of the client to apply the limit to the final result set. This is why the query to a physical table is different from the query to a distributed table when viewed in the query logs. The query cannot be a simple copy of the original query, as this would not produce the correct results.

For example, if a client makes a query SELECT ... LIMIT 10, 10, and there are two agents, with the second agent having only 10 documents, broadcasting the original LIMIT 10, 10 query would result in receiving 0 documents from the second agent. However, LIMIT 10,10 should return documents 10-20 from the resulting set. To resolve this, the query must be sent to the agents with a broader limit, such as the default max_matches value of 1000.

For instance, if there is a distributed table dist that refers to the remote table user, a client query SELECT * FROM dist LIMIT 10,10 would be converted to SELECT * FROM user LIMIT 0,1000 and sent to the remote table user. Once the distributed table receives the result, it will apply the LIMIT 10,10 and return the requested 10 documents.

SELECT * FROM dist LIMIT 10,10;

the query will be converted to:

SELECT * FROM user LIMIT 0,1000

Additionally, the value can specify options for each individual agent, such as:

agent = address1:table-list[[ha_strategy=value, conn=value, blackhole=value]]

Example:

# config on box1
# sharding a table over 3 servers
agent = box2:9312:shard1
agent = box3:9312:shard2

# config on box2
# sharding a table over 3 servers
agent = box1:9312:shard2
agent = box3:9312:shard3

# config on box3
# sharding a table over 3 servers
agent = box1:9312:shard1
agent = box2:9312:shard3

# per agent options
agent = box1:9312:shard1[ha_strategy=nodeads]
agent = box2:9312:shard2[conn=pconn]
agent = box2:9312:shard2[conn=pconn,ha_strategy=nodeads]
agent = test:9312:any[blackhole=1]
agent = test:9312|box2:9312|box3:9312:any2[retry_count=2]
agent = test:9312|box2:9312:any2[retry_count=2,conn=pconn,ha_strategy=noerrors]

For optimal performance, it's recommended to place remote tables that reside on the same server within the same record. For instance, instead of:

agent = remote:9312:idx1
agent = remote:9312:idx2

you should prefer:

agent = remote:9312:idx1,idx2

agent_persistent

agent_persistent = remotebox:9312:index2

The agent_persistent option allows you to persistently connect to an agent, meaning the connection will not be dropped after a query is executed. The syntax for this directive is the same as the agent directive. However, instead of opening a new connection to the agent for each query and then closing it, the master will keep a connection open and reuse it for subsequent queries. The maximum number of persistent connections per agent host is defined by the persistent_connections_limit option in the searchd section.

It's important to note that the persistent_connections_limit must be set to a value greater than 0 in order to use persistent agent connections. If it's not defined, it defaults to 0, and the agent_persistent directive will act the same as the agentdirective.

Using persistent master-agent connections reduces TCP port pressure and saves time on connection handshakes, making it more efficient.

agent_blackhole

agent_blackhole = testbox:9312:testindex1,testindex2

The agent_blackhole directive allows you to forward queries to remote agents without waiting for or processing their responses. This is useful for debugging or testing production clusters, as you can set up a separate debugging/testing instance and forward requests to it from the production master (aggregator) instance, without interfering with production work. The master searchd will attempt to connect to the blackhole agent and send queries as normal, but will not wait for or process any responses, and all network errors on the blackhole agents will be ignored. The format of the value is identical to that of the regular agent directive.

agent_connect_timeout

agent_connect_timeout = 300

The agent_connect_timeout directive defines the timeout for connecting to remote agents. By default, the value is assumed to be in milliseconds, but can have another suffix). The default value is 1000 (1 second).

When connecting to remote agents, searchd will wait for this amount of time at most to complete the connection successfully. If the timeout is reached but the connection has not been established, and retries are enabled, a retry will be initiated.

agent_query_timeout

agent_query_timeout = 10000 # our query can be long, allow up to 10 sec

The agent_query_timeout sets the amount of time that searchd will wait for a remote agent to complete a query. The default value is 3000 milliseconds (3 seconds), but can be suffixed to indicate a different unit of time.

After establishing a connection, searchd will wait for a maximum of agent_query_timeout for remote queries to complete. Note that this timeout is separate from the agent_connection_timeout and the total possible delay caused by a remote agent will be the sum of both values. If the agent_query_timeout is reached, the query will not be retried, instead, a warning will be produced.

Note that behavior is also affected by reset_network_timeout_on_packet

agent_retry_count

The agent_retry_count is an integer that specifies how many times Manticore will attempt to connect and query remote agents in a distributed table before reporting a fatal query error. It works similarly to agent_retry_count defined in the "searchd" section of the configuration file but applies specifically to the table.

mirror_retry_count

mirror_retry_count serves the same purpose as agent_retry_count. If both values are provided, mirror_retry_count will take precedence, and a warning will be raised.

Instance-wide options

The following options manage the overall behavior of remote agents and are specified in the searchd section of the configuration file. They set default values for the entire Manticore instance.

Note, that if you use agent mirrors in the definition of your distributed table, the server will select a different mirror before each connection attempt according to the specified ha_strategy specified. In this case the agent_retry_count will be aggregated for all mirrors in the set.

For example, if you have 10 mirrors and set agent_retry_count=5, he server will attempt up to 50 retries (assuming an average of 5 tries per every 10 mirrors). In case of the option ha_strategy = roundrobin, it will actually be exactly 5 tries per mirror.

At the same time, the value provided as the retry_count option in the agent definition serves as an absolute limit. In other words, the [retry_count=2] option in the agent definition means there will be a maximum of 2 tries, regardless of whether there is 1 or 10 mirrors in the line.

agent_retry_delay

The agent_retry_delay is an integer value that determines the amount of time, in milliseconds, that Manticore Search will wait before retrying to query a remote agent in case of a failure. This value can be specified either globally in the searchd configuration or on a per-query basis using the OPTION retry_delay=XXX clause. If both options are provided, the per-query option will take precedence over the global one. The default value is 500 milliseconds (0.5 seconds). This option is only relevant if agent_retry_count or the per-query OPTION retry_count are non-zero.

client_timeout

The client_timeout option sets the maximum waiting time between requests when using persistent connections. This value is expressed in seconds or with a time suffix. The default value is 5 minutes.

Example:

client_timeout = 1h

hostname_lookup

The hostname_lookup option defines the strategy for renewing hostnames. By default, the IP addresses of agent host names are cached at server start to avoid excessive access to DNS. However, in some cases, the IP can change dynamically (e.g. cloud hosting) and it may be desirable to not cache the IPs. Setting this option to request disables the caching and queries the DNS for each query. The IP addresses can also be manually renewed using the FLUSH HOSTNAMES command.

listen_tfo

The listen_tfo option allows for the use of the TCP_FASTOPEN flag for all listeners. By default, it is managed by the system, but it can be explicitly turned off by setting it to '0'.

For more information about the TCP Fast Open extension, please refer to Wikipedia. In short, it allows to eliminate one TCP round-trip when establishing a connection.

In practice, using TFO can optimize the client-agent network efficiency, similar to when agent_persistent is in use, but without holding active connections and without limitations on the maximum number of connections.

Most modern operating systems support TFO. Linux (as one of the most progressive) has supported it since 2011, with kernels starting from 3.7 (for the server side). Windows has supported it since some builds of Windows 10. Other systems, such as FreeBSD and MacOS, are also in the game.

For Linux systems, the server checks the variable /proc/sys/net/ipv4/tcp_fastopen and behaves accordingly. Bit 0 manages the client side, while bit 1 rules the listeners. By default, the system has this parameter set to 1, i.e., clients are enabled and listeners are disabled.

persistent_connections_limit

persistent_connections_limit = 29 # assume that each host of agents has max_connections = 30 (or 29).

The persistent_connections_limit option defines the maximum number of simultaneous persistent connections to remote persistent agents. This is an instance-wide setting and must be defined in the searchd configuration section. Each time a connection to an agent defined under agent_persistent is made, we attempt to reuse an existing connection (if one exists) or create a new connection and save it for future use. However, in some cases it may be necessary to limit the number of persistent connections. This directive defines the limit and affects the number of connections to each agent's host across all distributed tables.

It is recommended to set this value equal to or less than the max_connections option in the agent's configuration.

Distributed snippets creation

A special case of a distributed table is a single local and multiple remotes, which is used exclusively for distributed snippets creation, when snippets are sourced from files. In this case, the local table may act as a "template" table, providing settings for tokenization when building snippets.

snippets_file_prefix

snippets_file_prefix = /mnt/common/server1/

The snippets_file_prefix is an optional prefix that can be added to the local file names when generating snippets. The default value is the current working folder.

To learn more about distributed snippets creation, see CALL SNIPPETS.

Distributed percolate tables (DPQ tables)

You can create a distributed table from multiple percolate tables. The syntax for constructing this type of table is the same as for other distributed tables, and can include multiplelocal tables as well as agents.

For DPQ, the operations of listing stored queries and searching through them (using CALL PQ) are transparent and work as if all the tables were one single local table. However, data manipulation statements such as insert, replace, truncate are not available.

If you include a non-percolate table in the list of agents, the behavior will be undefined. If the incorrect agent has the same schema as the outer schema of the PQ table (id, query, tags, filters), it will not trigger an error when listing stored PQ rules, and may pollute the list of actual PQ rules stored in PQ tables with its own non-PQ strings. As a result, be cautious and aware of the confusion that this may cause. ACALL PQ to such an incorrect agent will trigger an error.

For more information on making queries to a distributed percolate table, see making queries to a distribute percolate table.