¶ 22

Extensions

¶ 22.1

SphinxSE

SphinxSE is a MySQL storage engine that can be compiled into MySQL/MariaDB servers using their pluggable architecture.

Despite its name, SphinxSE does not actually store any data itself. Instead, it serves as a built-in client that enables the MySQL server to communicate with searchd, execute search queries, and retrieve search results. All indexing and searching take place outside MySQL.

Some common SphinxSE applications include:

Installing SphinxSE

You will need to obtain a copy of MySQL sources, prepare those, and then recompile MySQL binary. MySQL sources (mysql-5.x.yy.tar.gz) could be obtained from http://dev.mysql.com website.

Compiling MySQL 5.0.x with SphinxSE

  1. copy sphinx.5.0.yy.diff patch file into MySQL sources directory and run
$ patch -p1 < sphinx.5.0.yy.diff

If there's no .diff file exactly for the specific version you need to: build, try applying .diff with closest version numbers. It is important that the patch should apply with no rejects.
2. in MySQL sources directory, run

$ sh BUILD/autorun.sh
  1. in MySQL sources directory, create sql/sphinx directory in and copy all files in mysqlse directory from Manticore sources there. Example:
$ cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.0.24/sql/sphinx
  1. configure MySQL and enable the new engine:
$ ./configure --with-sphinx-storage-engine
  1. build and install MySQL:
$ make
$ make install

Compiling MySQL 5.1.x with SphinxSE

  1. In the MySQL sources directory, create a storage/sphinx directory and copy all files from the mysqlse directory in the Manticore sources to this new location. For example:
$ cp -R /root/builds/sphinx-0.9.7/mysqlse /root/builds/mysql-5.1.14/storage/sphinx
  1. In the MySQL source directory, run:
$ sh BUILD/autorun.sh
  1. Configure MySQL and enable the Manticore engine:
$ ./configure --with-plugins=sphinx
  1. Build and install MySQL:
$ make
$ make install

Checking SphinxSE installation

To verify that SphinxSE has been successfully compiled into MySQL, start the newly built server, run the MySQL client, and issue the SHOW ENGINES query. You should see a list of all available engines. Manticore should be present, and the "Support" column should display "YES":

Req
mysql> show engines;
+------------+----------+-------------------------------------------------------------+
| Engine     | Support  | Comment                                                     |
+------------+----------+-------------------------------------------------------------+
| MyISAM     | DEFAULT  | Default engine as of MySQL 3.23 with great performance      |
  ...
| SPHINX     | YES      | Manticore storage engine                                       |
  ...
+------------+----------+-------------------------------------------------------------+
13 rows in set (0.00 sec)

Using SphinxSE

To search using SphinxSE, you'll need to create a special ENGINE=SPHINX "search table" and then use a SELECT statement with the full-text query placed in the WHERE clause for the query column.

Here's an example create statement and search query:

CREATE TABLE t1
(
    id          INTEGER UNSIGNED NOT NULL,
    weight      INTEGER NOT NULL,
    query       VARCHAR(3072) NOT NULL,
    group_id    INTEGER,
    INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://localhost:9312/test";

SELECT * FROM t1 WHERE query='test it;mode=any';

In a search table, the first three columns must have the following types: INTEGER UNSIGNED or BIGINT for the 1st column (document ID), INTEGER or BIGINT for the 2nd column (match weight), and VARCHAR or TEXT for the 3rd column (your query). This mapping is fixed; you cannot omit any of these three required columns, move them around, or change their types. Additionally, the query column must be indexed, while all others should remain unindexed. Column names are ignored, so you can use any arbitrary names.

Additional columns must be either INTEGER, TIMESTAMP, BIGINT, VARCHAR, or FLOAT. They will be bound to attributes provided in the Manticore result set by name, so their names must match the attribute names specified in sphinx.conf. If there's no matching attribute name in the Manticore search results, the column will have NULL values.

Special "virtual" attribute names can also be bound to SphinxSE columns. Use _sph_ instead of @ for that purpose. For example, to obtain the values of @groupby, @count, or @distinct virtual attributes, use _sph_groupby, _sph_count, or _sph_distinct column names, respectively.

The CONNECTION string parameter is used to specify the Manticore host, port, and table. If no connection string is specified in CREATE TABLE, the table name * (i.e., search all tables) and localhost:9312 are assumed. The connection string syntax is as follows:

CONNECTION="sphinx://HOST:PORT/TABLENAME"

You can change the default connection string later:

mysql> ALTER TABLE t1 CONNECTION="sphinx://NEWHOST:NEWPORT/NEWTABLENAME";

You can also override these parameters on a per-query basis.

As shown in the example, both the query text and search options should be placed in the WHERE clause on the search query column (i.e., the 3rd column). Options are separated by semicolons and their names from values by an equality sign. Any number of options can be specified. The available options are:

... WHERE query='test;sort=attr_asc:group_id';
... WHERE query='test;sort=extended:@weight desc, group_id asc';
... WHERE query='test;index=test1;';
... WHERE query='test;index=test1,test2,test3;';
... WHERE query='test;weights=1,2,3;';
# only include groups 1, 5 and 19
... WHERE query='test;filter=group_id,1,5,19;';
# exclude groups 3 and 11
... WHERE query='test;!filter=group_id,3,11;';
# include groups from 3 to 7, inclusive
... WHERE query='test;range=group_id,3,7;';
# exclude groups from 5 to 25
... WHERE query='test;!range=group_id,5,25;';
# filter by a float size
... WHERE query='test;floatrange=size,2,3;';
# pick all results within 1000 meter from geoanchor
... WHERE query='test;floatrange=@geodist,0,1000;';
... WHERE query='test;maxmatches=2000;';
... WHERE query='test;cutoff=10000;';
... WHERE query='test;maxquerytime=1000;';
... WHERE query='test;groupby=day:published_ts;';
... WHERE query='test;groupby=attr:group_id;';
... WHERE query='test;groupsort=@count desc;';
... WHERE query='test;groupby=attr:country_id;distinct=site_id';
... WHERE query='test;indexweights=tbl_exact,2,tbl_stemmed,1;';
... WHERE query='test;fieldweights=title,10,abstract,3,content,1;';
... WHERE query='test;comment=marker001;';
... WHERE query='test;select=2*a+3*** as myexpr;';
... WHERE query='test;host=sphinx-test.loc;port=7312;';
... WHERE query='test;mode=extended;ranker=bm25;';
... WHERE query='test;mode=extended;ranker=expr:sum(lcs);';

The "export" ranker functions similarly to ranker=expr, but it retains the per-document factor values, while ranker=expr discards them after computing the final WEIGHT() value. Keep in mind that ranker=export is intended for occasional use, such as training a machine learning (ML) function or manually defining your own ranking function, and should not be used in actual production. When utilizing this ranker, you'll likely want to examine the output of the RANKFACTORS() function, which generates a string containing all the field-level factors for each document.

Req
SELECT *, WEIGHT(), RANKFACTORS()
    FROM myindex
    WHERE MATCH('dog')
    OPTION ranker=export('100*bm25');
*************************** 1\. row ***************************
           id: 555617
    published: 1110067331
   channel_id: 1059819
        title: 7
      content: 428
     weight(): 69900
rankfactors(): bm25=699, bm25a=0.666478, field_mask=2,
doc_word_count=1, field1=(lcs=1, hit_count=4, word_count=1,
tf_idf=1.038127, min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532,
min_hit_pos=120, min_best_span_pos=120, exact_hit=0,
max_window_hits=1), word1=(tf=4, idf=0.259532)

*************************** 2\. row ***************************
           id: 555313
    published: 1108438365
   channel_id: 1058561
        title: 8
      content: 249
     weight(): 68500
rankfactors(): bm25=685, bm25a=0.675213, field_mask=3,
doc_word_count=1, field0=(lcs=1, hit_count=1, word_count=1,
tf_idf=0.259532, min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532,
min_hit_pos=8, min_best_span_pos=8, exact_hit=0, max_window_hits=1),
field1=(lcs=1, hit_count=2, word_count=1, tf_idf=0.519063,
min_idf=0.259532, max_idf=0.259532, sum_idf=0.259532, min_hit_pos=36,
min_best_span_pos=36, exact_hit=0, max_window_hits=1), word1=(tf=3,
idf=0.259532)
... WHERE query='test;geoanchor=latattr,lonattr,0.123,0.456';

One very important note is that it is much more efficient to let Manticore handle sorting, filtering, and slicing the result set, rather than increasing the max matches count and using WHERE, ORDER BY, and LIMIT clauses on the MySQL side. This is due to two reasons. First, Manticore employs a variety of optimizations and performs these tasks better than MySQL. Second, less data would need to be packed by searchd, transferred, and unpacked by SphinxSE.

You can obtain additional information related to the query results using the SHOW ENGINE SPHINX STATUS statement:

Req
mysql> SHOW ENGINE SPHINX STATUS;
+--------+-------+-------------------------------------------------+
| Type   | Name  | Status                                          |
+--------+-------+-------------------------------------------------+
| SPHINX | stats | total: 25, total found: 25, time: 126, words: 2 |
| SPHINX | words | sphinx:591:1256 soft:11076:15945                |
+--------+-------+-------------------------------------------------+
2 rows in set (0.00 sec)

You can also access this information through status variables. Keep in mind that using this method does not require super-user privileges.

Req
mysql> SHOW STATUS LIKE 'sphinx_%';
+--------------------+----------------------------------+
| Variable_name      | Value                            |
+--------------------+----------------------------------+
| sphinx_total       | 25                               |
| sphinx_total_found | 25                               |
| sphinx_time        | 126                              |
| sphinx_word_count  | 2                                |
| sphinx_words       | sphinx:591:1256 soft:11076:15945 |
+--------------------+----------------------------------+
5 rows in set (0.00 sec)

SphinxSE search tables can be joined with tables using other engines. Here's an example using the "documents" table from example.sql:

Req
mysql> SELECT content, date_added FROM test.documents docs
-> JOIN t1 ON (docs.id=t1.id)
-> WHERE query="one document;mode=any";

mysql> SHOW ENGINE SPHINX STATUS;
+-------------------------------------+---------------------+
| content                             | docdate             |
+-------------------------------------+---------------------+
| this is my test document number two | 2006-06-17 14:04:28 |
| this is my test document number one | 2006-06-17 14:04:28 |
+-------------------------------------+---------------------+
2 rows in set (0.00 sec)

+--------+-------+---------------------------------------------+
| Type   | Name  | Status                                      |
+--------+-------+---------------------------------------------+
| SPHINX | stats | total: 2, total found: 2, time: 0, words: 2 |
| SPHINX | words | one:1:2 document:2:2                        |
+--------+-------+---------------------------------------------+
2 rows in set (0.00 sec)

Building snippets via MySQL

SphinxSE also features a UDF function that allows you to create snippets using MySQL. This functionality is similar to HIGHLIGHT(), but can be accessed through MySQL+SphinxSE.

The binary providing the UDF is called sphinx.so and should be automatically built and installed in the appropriate location along with SphinxSE. If it doesn't install automatically for some reason, locate sphinx.so in the build directory and copy it to your MySQL instance's plugins directory. Once done, register the UDF with the following statement:

CREATE FUNCTION sphinx_snippets RETURNS STRING SONAME 'sphinx.so';

The function name must be sphinx_snippets; you cannot use an arbitrary name. The function arguments are as follows:

Prototype: function sphinx_snippets ( document, table, words [, options] );

The document and words arguments can be either strings or table columns. Options must be specified like this: 'value' AS option_name. For a list of supported options, refer to the Highlighting section. The only UDF-specific additional option is called sphinx and allows you to specify the searchd location (host and port).

Usage examples:

SELECT sphinx_snippets('hello world doc', 'main', 'world',
    'sphinx://192.168.1.1/' AS sphinx, true AS exact_phrase,
    '[**]' AS before_match, '[/**]' AS after_match)
FROM documents;

SELECT title, sphinx_snippets(text, 'index', 'mysql php') AS text
    FROM sphinx, documents
    WHERE query='mysql php' AND sphinx.id=documents.id;
¶ 22.2

FEDERATED

With the MySQL FEDERATED engine, you can connect to a local or remote Manticore instance from MySQL/MariaDB and perform search queries.

Using FEDERATED

An actual Manticore query can't be used directly with the FEDERATED engine and must be "proxied" (sent as a string in a column) due to the FEDERATED engine's limitations and the fact that Manticore implements custom syntax like the MATCH clause.

To search via FEDERATED, you first need to create a FEDERATED engine table. The Manticore query will be included in a query column in the SELECT performed over the FEDERATED table.

Creating a FEDERATED-compatible MySQL table:

SQL
CREATE TABLE t1
(
    id          INTEGER UNSIGNED NOT NULL,
    year        INTEGER NOT NULL,
    rating      FLOAT,
    query       VARCHAR(1024) NOT NULL,
    INDEX(query)
) ENGINE=FEDERATED
DEFAULT CHARSET=utf8
CONNECTION='mysql://FEDERATED@127.0.0.1:9306/DB/movies';
Query OK, 0 rows affected (0.00 sec)

Query FEDERATED compatible table:

SQL
SELECT * FROM t1 WHERE query='SELECT * FROM movies WHERE MATCH (\'pie\')';
+----+------+--------+------------------------------------------+
| id | year | rating | query                                    |
+----+------+--------+------------------------------------------+
|  1 | 2019 |      5 | SELECT * FROM movies WHERE MATCH ('pie') |
+----+------+--------+------------------------------------------+
1 row in set (0.04 sec)

The only fixed mapping is the query column. It is mandatory and must be the only column with a table attached.

The Manticore table linked via FEDERATED must be a physical table (plain or real-time).

The FEDERATED table should have columns with the same names as the remote Manticore table attributes since they will be bound to the attributes provided in the Manticore result set by name. However, it might map only some attributes, not all of them.

Manticore server identifies a query from a FEDERATED client by the user name "FEDERATED". The CONNECTION string parameter is used to specify the Manticore host, SQL port, and tables for queries coming through the connection. The connection string syntax is as follows:

CONNECTION="mysql://FEDERATED@HOST:PORT/DB/TABLENAME"

Since Manticore doesn't have the concept of a database, the DB string can be random as it will be ignored by Manticore, but MySQL requires a value in the CONNECTION string definition. As seen in the example, the full SELECT SQL query should be placed in a WHERE clause against the query column.

Only the SELECT statement is supported, not INSERT, REPLACE, UPDATE, or DELETE.

FEDERATED tips

One very important note is that it is much more efficient to allow Manticore to perform sorting, filtering, and slicing the result set than to increase the max matches count and use WHERE, ORDER BY, and LIMIT clauses on the MySQL side. This is for two reasons. First, Manticore implements a number of optimizations and performs better than MySQL for these tasks. Second, less data needs to be packed by searchd, transferred, and unpacked between Manticore and MySQL.

JOINs can be performed between a FEDERATED table and other MySQL tables. This can be used to retrieve information that is not stored in a Manticore table.

SQL
SELECT t1.id, t1.year, comments.comment FROM t1 JOIN comments ON t1.id=comments.post_id WHERE query='SELECT * FROM movies WHERE MATCH (\'pie\')';
+----+------+--------------+
| id | year | comment      |
+----+------+--------------+
|  1 | 2019 | was not good |
+----+------+--------------+
1 row in set (0.00 sec)
¶ 22.3

UDFs and Plugins

Manticore can be extended with user-defined functions, or UDFs for short, like this:

SELECT id, attr1, myudf (attr2, attr3+attr4) ...

You can dynamically load and unload UDFs into searchd without having to restart the server, and use them in expressions when searching, ranking, etc. A quick summary of the UDF features is as follows:

We do not yet support aggregation functions. In other words, your UDFs will be called for just a single document at a time and are expected to return some value for that document. Writing a function that can compute an aggregate value like AVG() over the entire group of documents that share the same GROUP BY key is not yet possible. However, you can use UDFs within the built-in aggregate functions: that is, even though MYCUSTOMAVG() is not supported yet, AVG(MYCUSTOMFUNC()) should work just fine!

UDFs offer a wide range of applications, such as:

Plugins

Plugins offer additional opportunities to expand search functionality. They can currently be used to compute custom rankings and tokenize documents and queries.

Here's the complete list of plugin types:

This section covers the general process of writing and managing plugins; specifics related to creating different types of plugins are discussed in their respective subsections.

So, how do you write and use a plugin? Here's a quick four-step guide:

Note that while UDFs are first-class plugins, they are installed using a separate CREATE FUNCTION statement. This allows for a neat specification of the return type, without sacrificing backward compatibility or changing the syntax.

Dynamic plugins are supported in threads and thread_pool workers. Multiple plugins (and/or UDFs) can be contained in a single library file. You may choose to either group all project-specific plugins in one large library or create a separate library for each UDF and plugin; it's up to you.

As with UDFs, you should include the src/sphinxudf.h header file. At the very least, you'll need the SPH_UDF_VERSION constant to implement an appropriate version function. Depending on the specific plugin type, you may or may not need to link your plugin with src/sphinxudf.c. However, all functions implemented in sphinxudf.c are related to unpacking the PACKEDFACTORS() blob, and no plugin types have access to that data. So currently, linking with just the header should suffice. (In fact, if you copy over the UDF version number, you won't even need the header file for some plugin types.)

Formally, plugins are simply sets of C functions that adhere to a specific naming pattern. You're typically required to define one key function for the primary task, but you can also define additional functions. For instance, to implement a ranker called "myrank", you must define a myrank_finalize() function that returns the rank value. However, you can also define myrank_init(), myrank_update(), and myrank_deinit() functions. Specific sets of well-known suffixes and call arguments differ based on the plugin type, but _init() and _deinit() are generic, and every plugin has them. Hint: for a quick reference on known suffixes and their argument types, refer to sphinxplugin.h, where the call prototypes are defined at the beginning of the file.

Even though the public interface is defined in pure C, our plugins essentially follow an object-oriented model. Indeed, every _init() function receives a void ** userdata out-parameter, and the pointer value stored at (*userdata) is then passed as the first argument to all other plugin functions. So you can think of a plugin as a class that gets instantiated every time an object of that class is needed to handle a request: the userdata pointer serves as the this pointer; the functions act as methods, and the _init() and _deinit() functions work as constructor and destructor, respectively.

This minor OOP-in-C complication arises because plugins run in a multi-threaded environment, and some need to maintain state. You can't store that state in a global variable in your plugin, so we pass around a userdata parameter, which naturally leads to the OOP model. If your plugin is simple and stateless, the interface allows you to omit _init(), _deinit(), and any other functions.

To summarize, here's the simplest complete ranker plugin in just three lines of C code:

// gcc -fPIC -shared -o myrank.so myrank.c
#include "sphinxudf.h"
int myrank_ver() { return SPH_UDF_VERSION; }
int myrank_finalize(void *u, int w) { return 123; }

Here's how to use the simple ranker plugin:

mysql> CREATE PLUGIN myrank TYPE 'ranker' SONAME 'myrank.dll';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT id, weight() FROM test1 WHERE MATCH('test') OPTION ranker=myrank('');
+------+----------+
| id   | weight() |
+------+----------+
|    1 |      123 |
|    2 |      123 |
+------+----------+
2 rows in set (0.01 sec)
¶ 22.3.1

Listing plugins

SHOW PLUGINS

SHOW PLUGINS

Displays all the loaded plugins (except for Buddy plugins, see below) and UDFs. The "Type" column should be one of the udf, ranker, index_token_filter, or query_token_filter. The "Users" column is the number of thread that are currently using that plugin in a query. The "Extra" column is intended for various additional plugin-type specific information; currently, it shows the return type for the UDFs and is empty for all the other plugin types.

Example
SHOW PLUGINS;
+------+----------+----------------+-------+-------+
| Type | Name     | Library        | Users | Extra |
+------+----------+----------------+-------+-------+
| udf  | sequence | udfexample.dll | 0     | INT   |
+------+----------+----------------+-------+-------+
1 row in set (0.00 sec)

SHOW BUDDY PLUGINS

SHOW BUDDY PLUGINS

This will display all available plugins, including core and local ones.
To remove a plugin, make sure to use the name listed in the Package column.

Example
SHOW BUDDY PLUGINS;
+------------------------------------------------+-----------------+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| Package                                        | Plugin          | Version | Type | Info                                                                                                                                                     |
+------------------------------------------------+-----------------+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+
| manticoresoftware/buddy-plugin-empty-string    | empty-string    | 2.1.5   | core | Handles empty queries, which can occur when trimming comments or dealing with specific SQL protocol instructions in comments that are not supported      |
| manticoresoftware/buddy-plugin-backup          | backup          | 2.1.5   | core | BACKUP sql statement                                                                                                                                     |
| manticoresoftware/buddy-plugin-emulate-elastic | emulate-elastic | 2.1.5   | core | Emulates some Elastic queries and generates responses as if they were made by ES                                                                         |
| manticoresoftware/buddy-plugin-insert          | insert          | 2.1.5   | core | Auto schema support. When an insert operation is performed and the table does not exist, it creates it with data types auto-detection                    |
| manticoresoftware/buddy-plugin-alias           | alias           | 2.1.5   | core |                                                                                                                                                          |
| manticoresoftware/buddy-plugin-select          | select          | 2.1.5   | core | Various SELECTs handlers needed for mysqldump and other software support, mostly aiming to work similarly to MySQL                                       |
| manticoresoftware/buddy-plugin-show            | show            | 2.1.5   | core | Various "show" queries handlers, for example, `show queries`, `show fields`, `show full tables`, etc                                                     |
| manticoresoftware/buddy-plugin-cli-table       | cli-table       | 2.1.5   | core | /cli endpoint based on /cli_json - outputs query result as a table                                                                                       |
| manticoresoftware/buddy-plugin-plugin          | plugin          | 2.1.5   | core | Core logic for plugin support and helpers. Also handles `create buddy plugin`, `delete buddy plugin`, and `show buddy plugins`                           |
| manticoresoftware/buddy-plugin-test            | test            | 2.1.5   | core | Test plugin, used exclusively for tests                                                                                                                  |
| manticoresoftware/buddy-plugin-insert-mva      | insert-mva      | 2.1.5   | core | Manages the restoration of MVA fields with mysqldump                                                                                                     |
| manticoresoftware/buddy-plugin-modify-table    | modify-table    | 2.1.5   | core | Assists in standardizing options in create and alter table statements to show option=1 for integers. Also manages the logic for creating sharded tables. |
| manticoresoftware/buddy-plugin-knn             | knn             | 2.1.5   | core | Enables KNN by document id                                                                                                                               |
| manticoresoftware/buddy-plugin-replace         | replace         | 2.1.5   | core | Enables partial replaces                                                                                                                                 |
+------------------------------------------------+-----------------+---------+------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
¶ 22.3.2

UDF

UDFs are stored in external dynamic libraries (.so files on UNIX and .dll on Windows systems). Library files must be placed in a trusted folder specified by the plugin_dir directive for security reasons: it's easier to secure a single folder than to allow anyone to install arbitrary code into searchd. You can dynamically load and unload UDFs into searchd using CREATE FUNCTION and DROP FUNCTION SQL statements, respectively. Additionally, you can seamlessly reload UDFs (and other plugins) with the RELOAD PLUGINS statement. Manticore keeps track of currently loaded functions; every time you create or drop a UDF, searchd updates its state in the sphinxql_state file as a plain SQL script.

UDFs are local. To use them on a cluster, you must place the same library on all nodes and run CREATE statements on each node as well. This process may change in future versions.

Once you successfully load a UDF, you can use it in your SELECT or other statements just like any built-in function:

SELECT id, MYCUSTOMFUNC (groupid, authorname), ... FROM myindex

Multiple UDFs (and other plugins) can reside in a single library. The library will only be loaded once and is automatically unloaded once all the UDFs and plugins within it are dropped.

In theory, you can write a UDF in any language, as long as its compiler can import standard C headers and emit standard dynamic libraries with properly exported functions. However, writing in C++ or plain C is the path of least resistance. We provide an example UDF library written in plain C that implements several functions (demonstrating various techniques) alongside our source code, found at src/udfexample.c. This example includes the src/sphinxudf.h header file, which contains definitions of several UDF-related structures and types. For most UDFs and plugins, simply using #include "sphinxudf.h" as shown in the example should be sufficient. However, if you're writing a ranking function and need to access ranking signals (factors) data from within the UDF, you'll also need to compile and link with src/sphinxudf.c (available in our source code), as the implementations of functions that let you access signal data from within the UDF reside in that file.

Both the sphinxudf.h header and sphinxudf.c are standalone, so you can copy those files individually; they don't depend on any other parts of Manticore's source code.

Within your UDF, you must implement and export only a couple of functions. First, for UDF interface version control, you must define a function int LIBRARYNAME_ver(), where LIBRARYNAME is the name of your library file, and you must return SPH_UDF_VERSION (a value defined in sphinxudf.h) from it. Here's an example.

#include <sphinxudf.h>

// our library will be called udfexample.so, thus, so it must define
// a version function named udfexample_ver()
int udfexample_ver()
{
    return SPH_UDF_VERSION;
}

This precaution protects you from accidentally loading a library with a mismatching UDF interface version into a newer or older searchd. Secondly, you must implement the actual function as well.

sphinx_int64_t testfunc ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args, char * error_flag )
{
    return 123;
}

UDF function names in SQL are case-insensitive. However, the respective C function names are not; they need to be all lower-case, or the UDF will not load. More importantly, it is crucial that:

  1. the calling convention is C (aka __cdecl),
  2. the arguments list matches the plugin system expectations exactly, and
  3. the return type matches the one you specify in CREATE FUNCTION.

Unfortunately, there is no (easy) way for us to check for these mistakes when loading the function, and they could crash the server and/or result in unexpected results. Last but not least, all the C functions you implement need to be thread-safe.

The first argument, a pointer to SPH_UDF_INIT structure, is essentially a pointer to our function state. It is optional. In the example just above, the function is stateless, as it simply returns 123 every time it gets called. So, we do not have to define an initialization function, and we can simply ignore that argument.
This argument serves one more purpose. Since a single query can be executed on multiple threads (see pseudo-sharding), the daemon tries to determine whether a UDF is stateful or stateless by checking this argument. If the argument is initialized, parallel execution will be disabled. So, if your UDF is stateful but you don't use this argument, it will be called from multiple threads, and your code needs to be aware of that.

The second argument, a pointer to SPH_UDF_ARGS, is the most important one. All the actual call arguments are passed to your UDF via this structure; it contains the call argument count, names, types, etc. So, whether your function gets called like SELECT id, testfunc(1) or like SELECT id, testfunc('abc', 1000*id+gid, WEIGHT()) or any other way, it will receive the very same SPH_UDF_ARGS structure in all of these cases. However, the data passed in the args structure will be different. In the first example, args->arg_count will be set to 1, in the second example it will be set to 3, and the args->arg_types array will contain different type data, and so on.

Finally, the third argument is an error flag. A UDF can raise it to indicate that some kind of internal error occurred, the UDF cannot continue, and the query should terminate early. You should not use this for argument type checks or for any other error reporting that is likely to happen during normal use. This flag is designed to report sudden critical runtime errors, such as running out of memory.

If we wanted to, say, allocate temporary storage for our function to use, or check upfront whether the arguments are of the supported types, then we would need to add two more functions, for UDF initialization and deinitialization, respectively.

int testfunc_init ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
    char * error_message )
{
    // allocate and initialize a little bit of temporary storage
    init->func_data = malloc ( sizeof(int) );

    *(int*)init->func_data = 123;

    // return a success code
    return 0;
}

void testfunc_deinit ( SPH_UDF_INIT * init )
{
    // free up our temporary storage
    free ( init->func_data );
}

Note how testfunc_init() also receives the call arguments structure. By the time it is called, it does not receive any actual values, so the args->arg_values will be NULL. But the argument names and types are known and will be passed. You can check them in the initialization function and return an error if they are of an unsupported type.

SPH_UDF_ARGS types

UDFs can receive arguments of pretty much any valid internal Manticore type. Refer to the sphinx_udf_argtype enumeration in sphinxudf.h for a full list. Most of the types map straightforwardly to the respective C types.

The most notable type is the SPH_UDF_TYPE_FACTORS argument type. You get that type by calling your UDF with a PACKEDFACTOR() argument. Its data is a binary blob in a certain internal format, and to extract individual ranking signals from that blob, you need to use either of the two sphinx_factors_XXX() or sphinx_get_YYY_factor() families of functions.

sphinx_factors_XXX() functions

This family consists of 3 functions.

First, you need to call init() and unpack(), then you can use the SPH_UDF_FACTORS fields, and finally, you need to clean up with deinit().

This approach is simple but may result in a bunch of memory allocations for each processed document, which could be slow.

sphinx_get_YYY_factor() functions

The other interface, consisting of a bunch of sphinx_get_YYY_factor() functions, is a bit more verbose to use but accesses the blob data directly and guarantees no allocations. For top-notch ranking UDF performance, you'll want to use this approach.

Return types of UDF

As for the return types, UDFs can currently return a single INTEGER, BIGINT, FLOAT, or STRING value. The C function return type should be sphinx_int64_t, sphinx_int64_t, double, or char* respectively. In the last case, you must use the args->fn_malloc function to allocate space for returned string values. Internally in your UDF, you can use whatever you want, so the testfunc_init() example above is correct code even though it uses malloc() directly: you manage that pointer yourself, it gets freed up using a matching free() call, and all is well. However, the returned strings values are managed by Manticore, and we have our own allocator, so for the return values specifically, you need to use it too.

Depending on how your UDFs are used in the query, the main function call (testfunc() in our example) might be called in a rather different volume and order. Specifically,

The calling sequence of the other functions is fixed, though. Namely,

¶ 22.3.2.1

CREATE FUNCTION

CREATE FUNCTION udf_name
    RETURNS {INT | INTEGER | BIGINT | FLOAT | STRING}
    SONAME 'udf_lib_file'

CREATE FUNCTION statement installs a user-defined function UDF with the specified name and type from the provided library file. The library file must be located in a trusted plugin_dir directory. Upon successful installation, the function becomes available for use in all subsequent queries received by the server. Example:

mysql> CREATE FUNCTION avgmva RETURNS INTEGER SONAME 'udfexample.dll';
Query OK, 0 rows affected (0.03 sec)

mysql> SELECT *, AVGMVA(tag) AS q from test1;
+------+--------+---------+-----------+
| id   | weight | tag     | q         |
+------+--------+---------+-----------+
|    1 |      1 | 1,3,5,7 | 4.000000  |
|    2 |      1 | 2,4,6   | 4.000000  |
|    3 |      1 | 15      | 15.000000 |
|    4 |      1 | 7,40    | 23.500000 |
+------+--------+---------+-----------+
¶ 22.3.2.2

DROP FUNCTION

DROP FUNCTION udf_name

DROP FUNCTION statement uninstalls a user-defined function UDF with the specified name. Upon successful removal, the function will no longer be available for use in subsequent queries. However, ongoing concurrent queries will not be affected, and if necessary, the library unloading will be delayed until those queries are completed. Example:

mysql> DROP FUNCTION avgmva;
Query OK, 0 rows affected (0.00 sec)
¶ 22.3.3

Plugins

¶ 22.3.3.1

System plugins

CREATE PLUGIN

CREATE PLUGIN plugin_name TYPE 'plugin_type' SONAME 'plugin_library'

Loads the given library (if it is not already loaded) and loads the specified plugin from it. The available plugin types include:

For more information on writing plugins, please refer to the plugins documentation.

mysql> CREATE PLUGIN myranker TYPE 'ranker' SONAME 'myplugins.so';
Query OK, 0 rows affected (0.00 sec)

CREATE BUDDY PLUGIN

Buddy plugins can extend Manticore Search's functionality and enable certain queries that are not natively supported. To learn more about creating Buddy plugins, we recommend reading this article.

To create a Buddy plugin, run the following SQL command:

CREATE PLUGIN <username/package name on https://packagist.org/> TYPE 'buddy' VERSION <package version>

You can also use an alias command specifically created for Buddy plugins, which is easier to remember:

CREATE BUDDY PLUGIN <username/package name on https://packagist.org/> VERSION <package version>

This command will install the show-hostname plugin to the plugin_dir and enable it without the need to restart the server.

Examples

Example
CREATE PLUGIN manticoresoftware/buddy-plugin-show-hostname TYPE 'buddy' VERSION 'dev-main';

CREATE BUDDY PLUGIN manticoresoftware/buddy-plugin-show-hostname VERSION 'dev-main';
¶ 22.3.3.2

DELETE PLUGIN

DROP PLUGIN plugin_name TYPE 'plugin_type'

Marks the designated plugin for unloading. The unloading process is not instantaneous, as concurrent queries may still be utilizing it. Nevertheless, following a DROP, new queries will no longer have access to the plugin. Subsequently, when all ongoing queries involving the plugin have finished, the plugin will be unloaded. If all plugins from the specified library are unloaded, the library will also be automatically unloaded.

mysql> DROP PLUGIN myranker TYPE 'ranker';
Query OK, 0 rows affected (0.00 sec)

DELETE BUDDY PLUGIN

DELETE BUDDY PLUGIN <username/package name on https://packagist.org/>

This action instantly and permanently removes the installed plugin from the plugin_dir. Once removed, the plugin's features will no longer be available.

Example

Example
DELETE BUDDY PLUGIN manticoresoftware/buddy-plugin-show-hostname
¶ 22.3.3.3

Enabling and disabling Buddy plugins

To simplify the control of Buddy plugins, especially when developing a new one or modifying an existing one, the enable and disable Buddy plugin commands are provided. These commands act temporarily during runtime and will reset to their defaults after restarting the daemon or performing a Buddy reset. To permanently disable a plugin, it must be removed.

You need the fully qualified package name of the plugin to enable or disable it. To find it, you can run the SHOW BUDDY PLUGINS query and look for the full qualified name in the package field. For example, the SHOW plugin has the fully qualified name manticoresoftware/buddy-plugin-show.

ENABLE BUDDY PLUGIN

ENABLE BUDDY PLUGIN <username/package name on https://packagist.org/>

This command reactivates a previously disabled Buddy plugin, allowing it to process your requests again.

Example

SQL
ENABLE BUDDY PLUGIN manticoresoftware/buddy-plugin-show

DISABLE BUDDY PLUGIN

DISABLE BUDDY PLUGIN <username/package name on https://packagist.org/>

This command deactivates an active Buddy plugin, preventing it from processing any further requests.

Example

SQL
DISABLE BUDDY PLUGIN manticoresoftware/buddy-plugin-show
After disabling, if you try the `SHOW QUERIES` command, you'll encounter an error because the plugin is disabled.
¶ 22.3.3.4

RELOADING PLUGINS

RELOAD PLUGINS FROM SONAME 'plugin_library'

Reloads all plugins (UDFs, rankers, etc.) from a given library. In a sense, the reload process is transactional, ensuring that:
1. all plugins are successfully updated to their new versions;
2. the update is atomic, meaning all plugins are replaced simultaneously. This atomicity ensures that queries using multiple functions from a reloaded library will never mix old and new versions.

During the RELOAD, the set of plugins is guaranteed to be consistent; they will either be all old or all new.

The reload process is also seamless, as some version of a reloaded plugin will always be available for concurrent queries, without any temporary disruptions. This is an improvement over using a pair of DROP and CREATE statements for reloading. With those, there is a brief window between the DROP and the subsequent CREATE during which queries technically refer to an unknown plugin and will therefore fail.

If there's any failure, RELOAD PLUGINS does nothing, retains the old plugins, and reports an error.

On Windows, overwriting or deleting a DLL library currently in use can be problematic. However, you can still rename it, place a new version under the old name, and then RELOAD will work. After a successful reload, you'll also be able to delete the renamed old library.

mysql> RELOAD PLUGINS FROM SONAME 'udfexample.dll';
Query OK, 0 rows affected (0.00 sec)
¶ 22.3.3.5

Ranker plugins

Ranker plugins let you implement a custom ranker that receives all the occurrences of the keywords matched in the document, and computes a WEIGHT() value. They can be called as follows:

SELECT id, attr1 FROM test WHERE match('hello') OPTION ranker=myranker('option1=1');

The call workflow proceeds as follows:

  1. XXX_init() is invoked once per query per table, at the very beginning. Several query-wide options are passed to it via a SPH_RANKER_INIT structure, including the user options strings (for instance, "option1=1" in the example above).
  2. XXX_update() is called multiple times for each matched document, with every matched keyword occurrence provided as its parameter, a SPH_RANKER_HIT structure. The occurrences within each document are guaranteed to be passed in ascending order of hit->hit_pos values.
  3. XXX_finalize() is called once for each matched document when there are no more keyword occurrences. It must return the WEIGHT() value. This function is the only mandatory one.
  4. XXX_deinit() is invoked once per query, at the very end.
¶ 22.3.3.6

Token filter plugins

Token filter plugins allow you to implement a custom tokenizer that creates tokens according to custom rules. There are two types:

In the text processing pipeline, token filters will run after the base tokenizer processing occurs (which processes the text from fields or queries and creates tokens out of them).

Index-time tokenizer

Index-time tokenizer is created by indexer when indexing source data into a table or by an RT table when processing INSERT or REPLACE statements.

Plugin is declared as library name:plugin name:optional string of settings. The init functions of the plugin can accept arbitrary settings that can be passed as a string in the format option1=value1;option2=value2;...

Example:

index_token_filter = my_lib.so:email_process:field=email;split=.io

The call workflow for index-time token filter is as follows:

  1. XXX_init() gets called right after indexer creates token filter with an empty fields list and then after indexer gets the table schema with the actual fields list. It must return zero for successful initialization or an error description otherwise.
  2. XXX_begin_document gets called only for RT table INSERT/REPLACE for every document. It must return zero for a successful call or an error description otherwise. Using OPTION token_filter_options, additional parameters/settings can be passed to the function.
    sql INSERT INTO rt (id, title) VALUES (1, 'some text corp@space.io') OPTION token_filter_options='.io'
  3. XXX_begin_field gets called once for each field prior to processing the field with the base tokenizer, with the field number as its parameter.
  4. XXX_push_token gets called once for each new token produced by the base tokenizer, with the source token as its parameter. It must return the token, count of extra tokens made by the token filter, and delta position for the token.
  5. XXX_get_extra_token gets called multiple times in case XXX_push_token reports extra tokens. It must return the token and delta position for that extra token.
  6. XXX_end_field gets called once right after the source tokens from the current field are processed.
  7. XXX_deinit gets called at the very end of indexing.

The following functions are mandatory to be defined: XXX_begin_document, XXX_push_token, and XXX_get_extra_token.

query-time token filter

Query-time tokenizer gets created on search each time full-text is invoked by every table involved.

The call workflow for query-time token filter is as follows:

  1. XXX_init() gets called once per table prior to parsing the query with parameters - max token length and a string set by the token_filter option
    sql SELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'
    It must return zero for successful initialization or error description otherwise.
  2. XXX_push_token() gets called once for each new token produced by the base tokenizer with parameters: token produced by the base tokenizer, pointer to raw token at source query string, and raw token length. It must return the token and delta position for the token.
  3. XXX_pre_morph() gets called once for the token right before it gets passed to the morphology processor with a reference to the token and stopword flag. It might set the stopword flag to mark the token as a stopword.
  4. XXX_post_morph() gets called once for the token after it is processed by the morphology processor with a reference to the token and stopword flag. It might set the stopword flag to mark the token as a stopword. It must return a flag, the non-zero value of which means to use the token prior to morphology processing.
  5. XXX_deinit() gets called at the very end of query processing.

Absence of the functions is tolerated.