Updated documentation to reflect the addition of the AIProvider implementation

This commit is contained in:
2025-07-07 02:41:13 +02:00
parent 7a846935a1
commit 60b6e0502a
2 changed files with 104 additions and 16 deletions

View File

@@ -3,10 +3,21 @@
embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of indexed entries. embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of indexed entries.
This repository comes with embeddingsearch offers:
- a server (accessible via API calls & swagger) - Privacy and flexibility through self-hosted solutions like:
- a clientside library (C#) - ollama
- a scripting based indexer service that supports the use of - OpenAI-compatible APIs (like LocalAI)
- Great flexibility through deep control over
- the amount of datapoints per entity (i.e. the thing you're trying to find)
- which models are used (multiple per datapoint possible to improve accuracy)
- which models are sourced from where (multiple Ollama/OpenAI-compatible sources possible)
- similarity calculation methods (WIP)
- aggregation of results (when multiple models are used per datapoint)
This repository comes with a
- server (accessible via API calls & swagger)
- clientside library (C#)
- scripting based indexer service that supports the use of
- Python - Python
- Golang (WIP) - Golang (WIP)
- Javascript (WIP) - Javascript (WIP)

View File

@@ -67,22 +67,99 @@ If you just installed the server and want to configure it:
# Scripting # Scripting
## General ## General
## probMethods ## probMethods
Probmethods are used to join the multiple similarity results from multiple models, and multiple datapoints into one single value. Probmethods are used to join the multiple similarity values from multiple models and multiple datapoints into one single result.
The probMethod is given a list where each element consists of a string and a floating point value (0-1). They need to be specified when constructing a datapoint or an entity (see: [src/Indexer/Scripts/example.py](/src/Indexer/Scripts/example.py) in method `index_files`)
### `probmethod_embedding` (also referred to as `probmethod_datapoint`) Currently the following probMethods are implemented:
Takes list where each element contains: - "Mean"
- model name (e.g. "bge-m3") - "HarmonicMean"
- Result of the similarity calculation between query embeddings and the embeddings for this datapoint (per model) - "QuadraticMean"
- "GeometricMean"
- "ExtremeValuesEmphasisWeightedAverage" or "EVEWavg"
- "HighValueEmphasisWeightedAverage" or "HVEWAvg"
- "LowValueEmphasisWeightedAverage" or "LVEWAvg"
- "DictionaryWeightedAverage"
Returns a single floating point value that represents the resulting similarity for this datapoint. ### Mean
### `probmethod` (also referred to as `probmethod_entity`) Averages the values by accumulating the sums and dividing by the number of entries.
Takes list where each element contains:
- datapoint name (e.g. "title", "text", "filename") $\frac{1}{n} \sum_{i=1}^{n} x_i$
- Result from `probmethod_embedding`
### HarmonicMean
Calculates the harmonic mean, but also avoids division by 0 issues
$$
\text{HarmonicMean}(L) = \begin{cases}0,
& \text{if } n_{nz} = 0 \\\left( \frac{n_{nz}}{\sum\limits_{x_i \in L,\ x_i \neq 0} \frac{1}{x_i}} \right) \cdot \left( \frac{n_{nz}}{n_T} \right), & \text{otherwise}
\end{cases}
$$
with
- $n_{nz}$ being the number of non-zero elements
- $n_T$ being the total number of elements
### QuadraticMean
Calculates the quadratic mean.
$$
\text{QuadraticMean}(L) = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} x_i^2 }
$$
### GeometricMean
Calculates the geometric mean.
$$
\text{GeometricMean}(L) = \begin{cases}0, & \text{if } n = 0\\\left(\prod\limits_{i=1}^{n} x_i \right)^{\frac{1}{n}}, & \text{otherwise}\end{cases}
$$
### ExtremeValuesEmphasisWeightedAverage
aka. EVEWavg
Calculates a weighted average where values near 0 or 1 are weighted much more heavily.
A single `1` makes the whole function return 1, as it has "infinite" weight.
Similarly any `0` causes the function to return 0.
(If both a `0` and a `1` are present, the function returns 1)
$$
\text{EVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\0, & \text{if } \exists, x_i = 0 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{x_i(1 - x_i)} }{ \sum\limits_{i=1}^{n} \frac{1}{x_i(1 - x_i)} }, & \text{otherwise}\end{cases}
$$
### HighValueEmphasisWeightedAverage
aka. HVEWAvg
Calculates a weighted average where values near 1 are weighted much more heavily. Lower values are weighted less.
A single `1` makes the whole function return 1, as it has "infinite" weight.
A `0` has zero weight.
$$
\text{HVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{1 - x_i} }{ \sum\limits_{i=1}^{n} \frac{1}{1 - x_i} }, & \text{otherwise}\end{cases}
$$
### LowValueEmphasisWeightedAverage
aka. LVEWAvg
Calculates a weighted average where values near 0 are weighted much more heavily. Higher values are weighted less.
A single `0` makes the whole function return 0, as it has "infinite" weight.
A `1` has zero weight.
$$
\text{LVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ n}{ \sum\limits_{i=1}^{n} \frac{1}{x_i} }, & \text{otherwise}\end{cases}
$$
### DictionaryWeightedAverage
Calculates a weighted average as specified by the user.
$$
\text{DWA}(L, D) = \frac{ \sum\limits_{i=1}^{n} w_i x_i }{ \sum\limits_{i=1}^{n} w_i }
$$
Where:
- $L = \{(k_1, x_1), (k_2, x_2), \dots, (k_n, x_n)\}$ is the list of keyvalue pairs
- $x_i$ is the float value associated with key $k_i$
- $D : k_i \mapsto w_i$ is a dictionary mapping keys $k_i$ to weights $w_i \in \mathbb{R}$
$
e.g.:
```
probmethod_datapoint = "DictionaryWeightedAverage:{\"ollama:bge-m3\": 4, \"ollama:mxbai-embed-large\": 1}"
probmethod_entity = "DictionaryWeightedAverage:{\"title\": 2, \"filename\": 0.1, \"text\": 0.25}"
```
Returns a single floating point value that represents the resulting similarity for this Entity.
## Python ## Python
To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods. To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.