Updated documentation to reflect the addition of the AIProvider implementation
This commit is contained in:
19
README.md
19
README.md
@@ -3,10 +3,21 @@
|
|||||||
|
|
||||||
embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of indexed entries.
|
embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of indexed entries.
|
||||||
|
|
||||||
This repository comes with
|
embeddingsearch offers:
|
||||||
- a server (accessible via API calls & swagger)
|
- Privacy and flexibility through self-hosted solutions like:
|
||||||
- a clientside library (C#)
|
- ollama
|
||||||
- a scripting based indexer service that supports the use of
|
- OpenAI-compatible APIs (like LocalAI)
|
||||||
|
- Great flexibility through deep control over
|
||||||
|
- the amount of datapoints per entity (i.e. the thing you're trying to find)
|
||||||
|
- which models are used (multiple per datapoint possible to improve accuracy)
|
||||||
|
- which models are sourced from where (multiple Ollama/OpenAI-compatible sources possible)
|
||||||
|
- similarity calculation methods (WIP)
|
||||||
|
- aggregation of results (when multiple models are used per datapoint)
|
||||||
|
|
||||||
|
This repository comes with a
|
||||||
|
- server (accessible via API calls & swagger)
|
||||||
|
- clientside library (C#)
|
||||||
|
- scripting based indexer service that supports the use of
|
||||||
- Python
|
- Python
|
||||||
- Golang (WIP)
|
- Golang (WIP)
|
||||||
- Javascript (WIP)
|
- Javascript (WIP)
|
||||||
|
|||||||
101
docs/Indexer.md
101
docs/Indexer.md
@@ -67,22 +67,99 @@ If you just installed the server and want to configure it:
|
|||||||
# Scripting
|
# Scripting
|
||||||
## General
|
## General
|
||||||
## probMethods
|
## probMethods
|
||||||
Probmethods are used to join the multiple similarity results from multiple models, and multiple datapoints into one single value.
|
Probmethods are used to join the multiple similarity values from multiple models and multiple datapoints into one single result.
|
||||||
|
|
||||||
The probMethod is given a list where each element consists of a string and a floating point value (0-1).
|
They need to be specified when constructing a datapoint or an entity (see: [src/Indexer/Scripts/example.py](/src/Indexer/Scripts/example.py) in method `index_files`)
|
||||||
|
|
||||||
### `probmethod_embedding` (also referred to as `probmethod_datapoint`)
|
Currently the following probMethods are implemented:
|
||||||
Takes list where each element contains:
|
- "Mean"
|
||||||
- model name (e.g. "bge-m3")
|
- "HarmonicMean"
|
||||||
- Result of the similarity calculation between query embeddings and the embeddings for this datapoint (per model)
|
- "QuadraticMean"
|
||||||
|
- "GeometricMean"
|
||||||
|
- "ExtremeValuesEmphasisWeightedAverage" or "EVEWavg"
|
||||||
|
- "HighValueEmphasisWeightedAverage" or "HVEWAvg"
|
||||||
|
- "LowValueEmphasisWeightedAverage" or "LVEWAvg"
|
||||||
|
- "DictionaryWeightedAverage"
|
||||||
|
|
||||||
Returns a single floating point value that represents the resulting similarity for this datapoint.
|
### Mean
|
||||||
### `probmethod` (also referred to as `probmethod_entity`)
|
Averages the values by accumulating the sums and dividing by the number of entries.
|
||||||
Takes list where each element contains:
|
|
||||||
- datapoint name (e.g. "title", "text", "filename")
|
$\frac{1}{n} \sum_{i=1}^{n} x_i$
|
||||||
- Result from `probmethod_embedding`
|
|
||||||
|
### HarmonicMean
|
||||||
|
Calculates the harmonic mean, but also avoids division by 0 issues
|
||||||
|
$$
|
||||||
|
\text{HarmonicMean}(L) = \begin{cases}0,
|
||||||
|
& \text{if } n_{nz} = 0 \\\left( \frac{n_{nz}}{\sum\limits_{x_i \in L,\ x_i \neq 0} \frac{1}{x_i}} \right) \cdot \left( \frac{n_{nz}}{n_T} \right), & \text{otherwise}
|
||||||
|
\end{cases}
|
||||||
|
$$
|
||||||
|
with
|
||||||
|
- $n_{nz}$ being the number of non-zero elements
|
||||||
|
- $n_T$ being the total number of elements
|
||||||
|
### QuadraticMean
|
||||||
|
Calculates the quadratic mean.
|
||||||
|
$$
|
||||||
|
\text{QuadraticMean}(L) = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} x_i^2 }
|
||||||
|
$$
|
||||||
|
### GeometricMean
|
||||||
|
Calculates the geometric mean.
|
||||||
|
$$
|
||||||
|
\text{GeometricMean}(L) = \begin{cases}0, & \text{if } n = 0\\\left(\prod\limits_{i=1}^{n} x_i \right)^{\frac{1}{n}}, & \text{otherwise}\end{cases}
|
||||||
|
$$
|
||||||
|
|
||||||
|
### ExtremeValuesEmphasisWeightedAverage
|
||||||
|
aka. EVEWavg
|
||||||
|
|
||||||
|
Calculates a weighted average where values near 0 or 1 are weighted much more heavily.
|
||||||
|
|
||||||
|
A single `1` makes the whole function return 1, as it has "infinite" weight.
|
||||||
|
Similarly any `0` causes the function to return 0.
|
||||||
|
|
||||||
|
(If both a `0` and a `1` are present, the function returns 1)
|
||||||
|
$$
|
||||||
|
\text{EVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\0, & \text{if } \exists, x_i = 0 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{x_i(1 - x_i)} }{ \sum\limits_{i=1}^{n} \frac{1}{x_i(1 - x_i)} }, & \text{otherwise}\end{cases}
|
||||||
|
$$
|
||||||
|
|
||||||
|
### HighValueEmphasisWeightedAverage
|
||||||
|
aka. HVEWAvg
|
||||||
|
|
||||||
|
Calculates a weighted average where values near 1 are weighted much more heavily. Lower values are weighted less.
|
||||||
|
|
||||||
|
A single `1` makes the whole function return 1, as it has "infinite" weight.
|
||||||
|
A `0` has zero weight.
|
||||||
|
$$
|
||||||
|
\text{HVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{1 - x_i} }{ \sum\limits_{i=1}^{n} \frac{1}{1 - x_i} }, & \text{otherwise}\end{cases}
|
||||||
|
$$
|
||||||
|
### LowValueEmphasisWeightedAverage
|
||||||
|
aka. LVEWAvg
|
||||||
|
|
||||||
|
Calculates a weighted average where values near 0 are weighted much more heavily. Higher values are weighted less.
|
||||||
|
|
||||||
|
A single `0` makes the whole function return 0, as it has "infinite" weight.
|
||||||
|
A `1` has zero weight.
|
||||||
|
$$
|
||||||
|
\text{LVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ n}{ \sum\limits_{i=1}^{n} \frac{1}{x_i} }, & \text{otherwise}\end{cases}
|
||||||
|
$$
|
||||||
|
### DictionaryWeightedAverage
|
||||||
|
Calculates a weighted average as specified by the user.
|
||||||
|
|
||||||
|
$$
|
||||||
|
\text{DWA}(L, D) = \frac{ \sum\limits_{i=1}^{n} w_i x_i }{ \sum\limits_{i=1}^{n} w_i }
|
||||||
|
$$
|
||||||
|
|
||||||
|
Where:
|
||||||
|
- $L = \{(k_1, x_1), (k_2, x_2), \dots, (k_n, x_n)\}$ is the list of key–value pairs
|
||||||
|
- $x_i$ is the float value associated with key $k_i$
|
||||||
|
- $D : k_i \mapsto w_i$ is a dictionary mapping keys $k_i$ to weights $w_i \in \mathbb{R}$
|
||||||
|
$
|
||||||
|
|
||||||
|
|
||||||
|
e.g.:
|
||||||
|
```
|
||||||
|
probmethod_datapoint = "DictionaryWeightedAverage:{\"ollama:bge-m3\": 4, \"ollama:mxbai-embed-large\": 1}"
|
||||||
|
probmethod_entity = "DictionaryWeightedAverage:{\"title\": 2, \"filename\": 0.1, \"text\": 0.25}"
|
||||||
|
```
|
||||||
|
|
||||||
Returns a single floating point value that represents the resulting similarity for this Entity.
|
|
||||||
## Python
|
## Python
|
||||||
To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.
|
To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user