12 KiB
Overview
The indexer by default
- runs on port 5210
- Uses Swagger UI in development mode (endpoint:
/swagger/index.html) - Ignores API keys when in development mode
- Uses Elmah error logging (endpoint:
/elmah, local files:~/logs) - Uses serilog logging (local files:
~/logs) - Uses HealthChecks (endpoint:
/healthz)
Docker installation
(On Linux you might need root privileges, thus use sudo where necessary)
- Configure the indexer
- Set up your indexing script(s)
- Navigate to the
srcdirectory - Build the docker container:
docker build -t embeddingsearch-indexer -f Indexer/Dockerfile . - Run the docker container:
docker run --net=host -t embeddingsearch-indexer(the-tis optional, but you get more meaningful output. Or use-dto run it in the background)
Installing the dependencies
Ubuntu 24.04
- Install the .NET SDK:
sudo apt update && sudo apt install dotnet-sdk-10.0 -y - Install the python SDK:
sudo apt install python3 python3.13 python3.13-dev- Note: Python 3.14 is not supported yet
Windows
Download and install the .NET SDK or follow these steps to use WSL:
- Install Ubuntu in WSL (
wsl --installandwsl --install -d Ubuntu) - Enter your WSL environment
wsl.exeand configure it - Update via
sudo apt update && sudo apt upgrade -y && sudo snap refresh - Continue here: Ubuntu 24.04
Configuration
The configuration is located in src/Indexer and conforms to the ASP.NET configuration design pattern, i.e. src/Indexer/appsettings.json is the base configuration, and /src/Indexer/appsettings.Development.json overrides it.
If you plan to use multiple environments, create any appsettings.{YourEnvironment}.json (e.g. Development, Staging, Prod) and set the environment variable DOTNET_ENVIRONMENT accordingly on the target machine.
Setup
If you just installed the indexer and want to configure it:
- Open
src/Indexer/appsettings.Development.json - If your search server is not on the same machine as the indexer, update "BaseUri" to reflect the URL to the server.
- If you configured API keys for the search server, set
"ApiKey": "<your key here>"beneath"BaseUri"in the"Server"section. - Create your own indexing script(s) in
src/Indexer/Scripts/and configure them as shown below
Structure
"Indexer": {
"Workers":
[ // This is a list; you can have as many "workers" as you want
{
"Name": "example",
"Script": "Scripts/example.py",
"Calls": [ // This is also a list. A worker may have multiple different calls.
{
"Type": "interval", // See: Call types
"Interval": 60000 // Parameter(s) as specified for the call type
}
]
},
{
"Name": "secondWorker",
/* ... */
}
],
"ApiKeys": ["YourApiKeysHereForTheIndexer"], // API Keys for if you want to protect the indexer's API
"Server": {
"BaseUri": "http://localhost:5000", // URL to the embeddingsearch server
"ApiKey": "ServerApiKeyHere" // API Key set in the server
}
}
Call types
runonce- What does it do: The script gets called once at startup. Use this if you need a main loop.
- (Remember the call runs in
update()like the others!) - Parameters: None
interval- What does it do: The script gets called periodically based on the specified
Intervalparameter. - Parameters:
- Interval (in milliseconds)
- What does it do: The script gets called periodically based on the specified
schedule- What does it do: The script gets called based on the provided schedule
- Parameters:
- Schedule (Quartz Cron expression, e.g. "0/5 * * * * ?")
fileupdate- What does it do: The script gets called whenever a file is updated in the specified subdirectory
- Parameters:
- Path (e.g. "Scripts/example_content")
Scripting
Scripts should be put in src/Indexer/Scripts/. If you look there, by default you will find some example scripts that can be taken as reference when building your own.
For configuration of the scripts see: Structure
The next few sections explain some core concepts/patterns. If you want to skip to explicit code examples, look here:
General
Scripts need to define the following functions:
init()- Is run at startup. Put all initialization code here.
- Do not put a main loop here! Might cause other workers not to initialize and other unintended behavior!
update()- Is called by the calls as specified in Call types
- A main loop might work best here using the
runoncecall
probMethods
Probmethods are used to join the multiple similarity values from multiple models and multiple datapoints into one single result.
They need to be specified when constructing a datapoint or an entity (see: src/Indexer/Scripts/example.py in method index_files)
Currently the following probMethods are implemented:
- "Mean"
- "HarmonicMean"
- "QuadraticMean"
- "GeometricMean"
- "ExtremeValuesEmphasisWeightedAverage" or "EVEWavg"
- "HighValueEmphasisWeightedAverage" or "HVEWAvg"
- "LowValueEmphasisWeightedAverage" or "LVEWAvg"
- "DictionaryWeightedAverage"
Mean
Averages the values by accumulating the sums and dividing by the number of entries.
\frac{1}{n} \sum_{i=1}^{n} x_i
HarmonicMean
Calculates the harmonic mean, but also avoids division by 0 issues
\text{HarmonicMean}(L) = \begin{cases}0,
& \text{if } n_{nz} = 0 \\\left( \frac{n_{nz}}{\sum\limits_{x_i \in L,\ x_i \neq 0} \frac{1}{x_i}} \right) \cdot \left( \frac{n_{nz}}{n_T} \right), & \text{otherwise}
\end{cases}
with
n_{nz}being the number of non-zero elementsn_Tbeing the total number of elements
QuadraticMean
Calculates the quadratic mean.
\text{QuadraticMean}(L) = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} x_i^2 }
GeometricMean
Calculates the geometric mean.
\text{GeometricMean}(L) = \begin{cases}0, & \text{if } n = 0\\\left(\prod\limits_{i=1}^{n} x_i \right)^{\frac{1}{n}}, & \text{otherwise}\end{cases}
ExtremeValuesEmphasisWeightedAverage
aka. EVEWavg
Calculates a weighted average where values near 0 or 1 are weighted much more heavily.
A single 1 makes the whole function return 1, as it has "infinite" weight.
Similarly any 0 causes the function to return 0.
(If both a 0 and a 1 are present, the function returns 1)
\text{EVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\0, & \text{if } \exists, x_i = 0 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{x_i(1 - x_i)} }{ \sum\limits_{i=1}^{n} \frac{1}{x_i(1 - x_i)} }, & \text{otherwise}\end{cases}
HighValueEmphasisWeightedAverage
aka. HVEWAvg
Calculates a weighted average where values near 1 are weighted much more heavily. Lower values are weighted less.
A single 1 makes the whole function return 1, as it has "infinite" weight.
A 0 has zero weight.
\text{HVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{1 - x_i} }{ \sum\limits_{i=1}^{n} \frac{1}{1 - x_i} }, & \text{otherwise}\end{cases}
LowValueEmphasisWeightedAverage
aka. LVEWAvg
Calculates a weighted average where values near 0 are weighted much more heavily. Higher values are weighted less.
A single 0 makes the whole function return 0, as it has "infinite" weight.
A 1 has zero weight.
\text{LVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ n}{ \sum\limits_{i=1}^{n} \frac{1}{x_i} }, & \text{otherwise}\end{cases}
DictionaryWeightedAverage
Calculates a weighted average as specified by the user.
\text{DWA}(L, D) = \frac{ \sum\limits_{i=1}^{n} w_i x_i }{ \sum\limits_{i=1}^{n} w_i }
Where:
L = \{(k_1, x_1), (k_2, x_2), \dots, (k_n, x_n)\}is the list of key–value pairsx_iis the float value associated with keyk_i
D : k_i \mapsto w_iis a dictionary mapping keysk_ito weightsw_i \in \mathbb{R}$
e.g.:
probmethod_datapoint = "DictionaryWeightedAverage:{\"ollama:bge-m3\": 4, \"ollama:mxbai-embed-large\": 1}"
probmethod_entity = "DictionaryWeightedAverage:{\"title\": 2, \"filename\": 0.1, \"text\": 0.25}"
Python
To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.
These are not yet defined in a way that makes them 100% interactible with the Dotnet CLR, meaning some methods that require anything more than strings or other simple data types to be passed are not yet supported. (WIP)
Supported file extensions
- .py
Code elements
Here is an overview of code elements by example:
from tools import * # Import all tools that are provided for ease of scripting
def init(toolset: Toolset): # defining an init() function with 1 parameter is required.
pass # Your code would go here.
# Don't put a main loop here! Why?
# This function prevents the application from initializing and maintains exclusive control over the GIL
def update(toolset: Toolset): # defining an update() function with 1 parameter is required.
pass # Your code - including possibly a main loop - would go here.
Using the toolset passed by the .NET CLR
The use of the toolset is laid out in good example by src/Indexer/Scripts/example.py.
Currently, Toolset, as provided by the IndexerService to the Python script, contains 3 elements:
- (only for
update, notinit)callbackInfos- an object that provides all information regarding the callback. (e.g. what file was updated) client- a .NET object that has the functions as described insrc/Indexer/Scripts/tools.py. It's the client that - according to the configuration - communicates with the search server and executes the API calls.filePath- the path to the script, as specified in the configuration
C# (Roslyn)
Supported file extensions
- .csx
Code elements
important hint: As shown in the last two lines of the example code, simply declaring the class is not enough. One must also return an object of said class!
// #load directives are disregarded at compile time. Its use is currently for syntax highlighting only
#load "../../Client/Client.cs"
#load "../Models/Script.cs"
#load "../Models/Interfaces.cs"
#load "../Models/WorkerResults.cs"
#load "../../Shared/Models/SearchdomainResults.cs"
#load "../../Shared/Models/JSONModels.cs"
#load "../../Shared/Models/EntityResults.cs"
using Shared.Models;
using System.Collections.Generic;
using System.Linq;
using Microsoft.Extensions.Logging;
// Required: a class that extends Indexer.Models.IScript
public class ExampleScript : Indexer.Models.IScript
{
public Indexer.Models.ScriptToolSet ToolSet;
public Client.Client client;
// Optional: constructor
public ExampleScript()
{
//System.Console.WriteLine("DEBUG@example.cs - Constructor"); // logger not passed here yet
}
// Required: Init method as required to extend IScript
public int Init(Indexer.Models.ScriptToolSet toolSet)
{
ToolSet = toolSet;
ToolSet.Logger.LogInformation("DEBUG@example.csx - Init");
return 0; // Required: int error value return
}
// Required: Updaet method as required to extend IScript
public int Update(Indexer.Models.ICallbackInfos callbackInfos)
{
ToolSet.Logger.LogInformation("DEBUG@example.csx - Update");
EntityQueryResults test = ToolSet.Client.EntityQueryAsync(defaultSearchdomain, "DNA").Result;
var firstResult = test.Results.ToArray()[0];
ToolSet.Logger.LogInformation(firstResult.Name);
ToolSet.Logger.LogInformation(firstResult.Value.ToString());
return 0; // Required: int error value return
}
// Required: int error value return
public int Stop()
{
ToolSet.Logger.LogInformation("DEBUG@example.csx - Stop");
return 0; // Required: int error value return
}
}
// Required: return an instance of your IScript-extending class
return new ExampleScript();
Lua
TODO
Javascript
TODO