Updated documentation to reflect the addition of the AIProvider implementation

2025-07-07 02:41:13 +02:00
parent 7a846935a1
commit 60b6e0502a
2 changed files with 104 additions and 16 deletions
@@ -3,10 +3,21 @@
 embeddingsearch is a search server that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of indexed entries.
-This repository comes with
+embeddingsearch offers:
- a server (accessible via API calls & swagger)
+- Privacy and flexibility through self-hosted solutions like:
- a clientside library (C#)
+  - ollama
- a scripting based indexer service that supports the use of
+  - OpenAI-compatible APIs (like LocalAI)
 - Great flexibility through deep control over
  - the amount of datapoints per entity (i.e. the thing you're trying to find)
  - which models are used (multiple per datapoint possible to improve accuracy)
  - which models are sourced from where (multiple Ollama/OpenAI-compatible sources possible)
  - similarity calculation methods (WIP)
  - aggregation of results (when multiple models are used per datapoint)
 This repository comes with a
 - server (accessible via API calls & swagger)
 - clientside library (C#)
 - scripting based indexer service that supports the use of
  - Python
  - Golang (WIP)
  - Javascript (WIP)
@@ -67,22 +67,99 @@ If you just installed the server and want to configure it:
 # Scripting
 ## General
 ## probMethods
-Probmethods are used to join the multiple similarity results from multiple models, and multiple datapoints into one single value.
+Probmethods are used to join the multiple similarity values from multiple models and multiple datapoints into one single result.
-The probMethod is given a list where each element consists of a string and a floating point value (0-1).
+They need to be specified when constructing a datapoint or an entity (see: [src/Indexer/Scripts/example.py](/src/Indexer/Scripts/example.py) in method `index_files`)
-### `probmethod_embedding` (also referred to as `probmethod_datapoint`)
+Currently the following probMethods are implemented:
-Takes list where each element contains:
+- "Mean"
- model name (e.g. "bge-m3")
+- "HarmonicMean"
- Result of the similarity calculation between query embeddings and the embeddings for this datapoint (per model)
+- "QuadraticMean"
 - "GeometricMean"
 - "ExtremeValuesEmphasisWeightedAverage" or "EVEWavg"
 - "HighValueEmphasisWeightedAverage" or "HVEWAvg"
 - "LowValueEmphasisWeightedAverage" or "LVEWAvg"
 - "DictionaryWeightedAverage"
-Returns a single floating point value that represents the resulting similarity for this datapoint.
+### Mean
-### `probmethod` (also referred to as `probmethod_entity`)
+Averages the values by accumulating the sums and dividing by the number of entries.
-Takes list where each element contains:
+
- datapoint name (e.g. "title", "text", "filename")
+$\frac{1}{n} \sum_{i=1}^{n} x_i$
- Result from `probmethod_embedding`
+
 ### HarmonicMean
 Calculates the harmonic mean, but also avoids division by 0 issues
 $$
 \text{HarmonicMean}(L) = \begin{cases}0,
 & \text{if } n_{nz} = 0 \\\left( \frac{n_{nz}}{\sum\limits_{x_i \in L,\ x_i \neq 0} \frac{1}{x_i}} \right) \cdot \left( \frac{n_{nz}}{n_T} \right), & \text{otherwise}
 \end{cases}
 $$
 with
 - $n_{nz}$ being the number of non-zero elements
 - $n_T$ being the total number of elements
 ### QuadraticMean
 Calculates the quadratic mean.
 $$
 \text{QuadraticMean}(L) = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} x_i^2 }
 $$
 ### GeometricMean
 Calculates the geometric mean.
 $$
 \text{GeometricMean}(L) = \begin{cases}0, & \text{if } n = 0\\\left(\prod\limits_{i=1}^{n} x_i \right)^{\frac{1}{n}}, & \text{otherwise}\end{cases}
 $$
 ### ExtremeValuesEmphasisWeightedAverage
 aka. EVEWavg
 Calculates a weighted average where values near 0 or 1 are weighted much more heavily.
 A single `1` makes the whole function return 1, as it has "infinite" weight.
 Similarly any `0` causes the function to return 0.
 (If both a `0` and a `1` are present, the function returns 1)
 $$
 \text{EVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\0, & \text{if } \exists, x_i = 0 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{x_i(1 - x_i)} }{ \sum\limits_{i=1}^{n} \frac{1}{x_i(1 - x_i)} }, & \text{otherwise}\end{cases}
 $$
 ### HighValueEmphasisWeightedAverage
 aka. HVEWAvg
 Calculates a weighted average where values near 1 are weighted much more heavily. Lower values are weighted less.
 A single `1` makes the whole function return 1, as it has "infinite" weight.
 A `0` has zero weight.
 $$
 \text{HVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ \sum\limits_{i=1}^{n} \frac{x_i}{1 - x_i} }{ \sum\limits_{i=1}^{n} \frac{1}{1 - x_i} }, & \text{otherwise}\end{cases}
 $$
 ### LowValueEmphasisWeightedAverage
 aka. LVEWAvg
 Calculates a weighted average where values near 0 are weighted much more heavily. Higher values are weighted less.
 A single `0` makes the whole function return 0, as it has "infinite" weight.
 A `1` has zero weight.
 $$
 \text{LVEWA}(L) = \begin{cases}1, & \text{if } \exists, x_i = 1 \\\frac{ n}{ \sum\limits_{i=1}^{n} \frac{1}{x_i} }, & \text{otherwise}\end{cases}
 $$
 ### DictionaryWeightedAverage
 Calculates a weighted average as specified by the user.
 $$
 \text{DWA}(L, D) = \frac{ \sum\limits_{i=1}^{n} w_i x_i }{ \sum\limits_{i=1}^{n} w_i }
 $$
 Where:
 - $L = \{(k_1, x_1), (k_2, x_2), \dots, (k_n, x_n)\}$ is the list of key–value pairs
  - $x_i$ is the float value associated with key $k_i$
 - $D : k_i \mapsto w_i$ is a dictionary mapping keys $k_i$ to weights $w_i \in \mathbb{R}$
 $
 e.g.:
 ```
 probmethod_datapoint = "DictionaryWeightedAverage:{\"ollama:bge-m3\": 4, \"ollama:mxbai-embed-large\": 1}"
 probmethod_entity = "DictionaryWeightedAverage:{\"title\": 2, \"filename\": 0.1, \"text\": 0.25}"
 ```
 Returns a single floating point value that represents the resulting similarity for this Entity.
 ## Python
 To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.