Updated README to reflect the recent addition of the Indexer

This commit is contained in:
EzFeDezy
2025-05-27 22:30:52 +02:00
parent 7096f6591f
commit 371f9c7411
3 changed files with 144 additions and 24 deletions

View File

@@ -3,27 +3,34 @@
Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries. Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries.
This repository comes with This repository comes with
- a server (accessible via API calls) - a server (accessible via API calls & swagger)
- a clientside library - a clientside library
- a CLI module - a CLI module (deprecated)
- an indexer (TBD) - a scripting based indexer service that supports
- Python
- Golang (WIP)
- Javascript (WIP)
(Currently only initial retrieval is implemented.
Reranker support is planned, but its integration is not yet conceptualized.)
# How to set up / use # How to set up / use
## server ## server
1. Install [ollama](https://ollama.com/download) 1. Install [ollama](https://ollama.com/download)
2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`) 2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`)
3. [Install the depencencies](docs/Server.md#installing-the-dependencies) 3. [Install the depencencies](docs/Server.md#installing-the-dependencies)
4. [Set up a local mysql database](docs/Server.md#mysql-database-setup) 4. [Set up a local mysql database](docs/Server.md#mysql-database-setup)
5. (optional) [Create a searchdomain](#create-a-searchdomain) 5. [Set up the configuration](docs/Server.md#setup)
6. In `src/server` execute `dotnet build && dotnet run` to start the server
7. (optional) [Create a searchdomain using the web interface](docs/Server.md#accessing-the-api)
## client ## client
1. Download the package and add it to your project (TODO: NuGet) 1. Download the package and add it to your project (TODO: NuGet)
2. Create a new client by either: 2. Create a new client by either:
1. By injecting IConfiguration (e.g. `services.AddSingleton<Client>();`) 1. By injecting IConfiguration (e.g. `services.AddSingleton<Client>();`)
2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`) 2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`)
## indexer ## indexer
TBD 1. [Install the dependencies](docs/Indexer.md#installing-the-dependencies)
2. [Set up the server](#server)
3. [Configure the indexer](docs/Indexer.md#configuration)
4. [Set up your indexing script(s)](docs/Indexer.md#scripting)
5. Run with `dotnet build && dotnet run` (Or `/usr/bin/dotnet build && /usr/bin/dotnet run`)
## CLI ## CLI
Before anything follow these steps: Before anything follow these steps:
1. Enter the project's `src` directory (used as the working directory in all examples) 1. Enter the project's `src` directory (used as the working directory in all examples)
@@ -37,7 +44,7 @@ All commands, parameters and examples are documented here: [docs/CLI.md](docs/CL
| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. | | Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. |
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist | | Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist |
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD | | Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD |
| System.DllNotFoundException: Could not load libpython3.12.so with flags RTLD_NOW \| RTLD_GLOBAL: libpython3.12.so: cannot open shared object file: No such file or directory | Install python3.12-dev via apt |
# To-do # To-do
- (High priority) Add default indexer - (High priority) Add default indexer
- Library - Library
@@ -53,22 +60,21 @@ All commands, parameters and examples are documented here: [docs/CLI.md](docs/CL
- Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page) - Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
- Other? (TBD) - Other? (TBD)
- Server - Server
- Scripting capability (Python; perhaps also lua) - ~~Scripting capability (Python; perhaps also lua)~~ (Done with the latest commits)
- Intended sourcing possibilities: - ~~Intended sourcing possibilities:~~
- Local/Remote files (CIFS, SMB, FTP) - ~~Local/Remote files (CIFS, SMB, FTP)~~
- Database contents (MySQL, MSSQL) - ~~Database contents (MySQL, MSSQL)~~
- Web requests (E.g. manual crawling) - ~~Web requests (E.g. manual crawling)~~
- Script call management (interval based & event based) - ~~Script call management (interval based & event based)~~
- NuGet packaging and according README documentation - Implement Healthz check
- Implement [ReaderWriterLock](https://learn.microsoft.com/en-us/dotnet/api/system.threading.readerwriterlockslim?view=net-9.0&redirectedfrom=MSDN) for entityCache to allow for multithreaded read access while retaining single-threaded write access.
- NuGet packaging and corresponding README documentation
- Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?) - Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support - Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
- Implement environment variable use in CLI - Remove the CLI
- fix the `--help` functionality
- Rename `cli` to something unique but still short, e.g. `escli`?
- Improve error messaging for when retrieving a searchdomain fails. - Improve error messaging for when retrieving a searchdomain fails.
- Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key. - Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key.
- Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.) - Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
- Write a Linux installer for the CLI tool
- Make the API server (and indexer, once it is done) a docker container - Make the API server (and indexer, once it is done) a docker container
# Future features # Future features

View File

@@ -0,0 +1,83 @@
# Overview
## Installing the dependencies
## Ubuntu 24.04
1. Install the .NET SDK: `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
2. Install the python SDK: `sudo apt install python3 python3.12 python3.12-dev`
## Windows
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
2. Enter your WSL environment `wsl.exe` and configure it
3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh`
4. Continue here: [Ubuntu 24.04](#Ubuntu-24.04)
# Configuration
The configuration is located in `src/Indexer` and conforms to the [ASP.NET configuration design pattern](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-9.0), i.e. `src/Indexer/appsettings.json` is the base configuration, and `/src/Indexer/appsettings.Development.json` overrides it.
If you plan to use multiple environments, create any `appsettings.{YourEnvironment}.json` (e.g. `Development`, `Staging`, `Prod`) and set the environment variable `DOTNET_ENVIRONMENT` accordingly on the target machine.
## Setup
If you just installed the server and want to configure it:
1. Open `src/server/appsettings.Development.json`
2. If your search server is not on the same machine as the indexer, update "BaseUri" to reflect the URL to the server.
3. If your search server requires API keys, (i.e. it's operating outside of the "Development" environment) set `"ApiKey": "<your key here>"` beneath `"BaseUri"` in the `"Embeddingsearch"` section.
4. Create your own indexing script(s) in `src/Indexer/Scripts/` and configure their use as
## Structure
```json
"EmbeddingsearchIndexer": {
"Worker":
[ // This is a list; you can have as many "workers" as you want
{
"Name": "example",
"Searchdomains": [
"example"
],
"Script": "Scripts/example.py",
"Calls": [ // This is also a list. You can have as many different calls as you need.
{
"Type": "interval", // See: Call types
"Interval": 60000
}
]
}
]
}
```
## Call types
- `interval`
- What does it do: The script gets called periodically based on the specified `Interval` parameter.
- Parameters:
- Interval (in milliseconds)
- `schedule` (WIP)
- What does it do: The script gets called based on the provided schedule
- Parameters: (WIP)
- `fileupdate` (WIP)
- What does it do: The script gets called whenever a file is updated in the specified subdirectory
- Parameters: (WIP)
# Scripting
## Python
To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.
These are not yet defined in a way that makes them 100% interactible with the Dotnet CLR, meaning some methods that require anything more than strings or other simple data types to be passed are not yet supported. (WIP)
### Required elements
Here is an overview of required elements by example:
```python
from tools import * # Import all tools that are provided for ease of scripting
def init(toolset: Toolset): # defining an init() function with 1 parameter is required.
pass # Your code would go here.
# DO NOT put a main loop here! Why?
# This function prevents the application from initializing and maintains exclusive control over the GIL
def update(toolset: Toolset): # defining an update() function with 1 parameter is required.
pass # Your code would go here.
```
### Using the toolset passed by the .NET CLR
The use of the toolset is laid out in good example by `src/Indexer/Scripts/example.py`.
Currently, `Toolset`, as provided by the IndexerService to the Python script, contains 3 elements:
1. (only for `update`, not `init`) `callbackInfos` - an object that provides all information regarding the callback. (e.g. what file was updated)
2. `client` - a .NET object that has the functions as described in `src/Indexer/Scripts/tools.py`. It's the client that - according to the configuration - communicates with the search server and executes the API calls.
3. `filePath` - the path to the script, as specified in the configuration
## Golang
TODO
## Javascript
TODO

View File

@@ -1,4 +1,4 @@
# Server # Overview
The server by default The server by default
- runs on port 5146 - runs on port 5146
- Uses Swagger UI in development mode (`/swagger/index.html`) - Uses Swagger UI in development mode (`/swagger/index.html`)
@@ -6,7 +6,7 @@ The server by default
# Installing the dependencies # Installing the dependencies
## Ubuntu 24.04 ## Ubuntu 24.04
1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y` 1. Install the .NET SDK: `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
## Windows ## Windows
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL: Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`) 1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
@@ -19,8 +19,39 @@ Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow t
- Linux/WSL: `sudo apt install mysql-server` - Linux/WSL: `sudo apt install mysql-server`
- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/) - Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/)
2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`) 2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`)
3. Create the database 3. Create the database:
`CREATE DATABASE embeddingsearch; use embeddingsearch;` `CREATE DATABASE embeddingsearch; use embeddingsearch;`
4. Create the user 4. Create the user (replace "somepassword! with a secure password):
`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;` `CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;`
5. Create the tables using the CLI tool: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup` 5. Create the tables using the CLI tool: `cd src/cli; dotnet build` and `bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup` (replace the variables with the actual values)
# Configuration
## Environments
The configuration is located in `src/server/` and conforms to the [ASP.NET configuration design pattern](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-9.0), i.e. `src/server/appsettings.json` is the base configuration, and `/src/server/appsettings.Development.json` overrides it.
If you plan to use multiple environments, create any `appsettings.{YourEnvironment}.json` (e.g. `Development`, `Staging`, `Prod`) and set the environment variable `DOTNET_ENVIRONMENT` accordingly on the target machine.
## Setup
If you just installed the server and want to configure it:
1. Open `src/server/appsettings.Development.json`
2. Change the password in the "SQL" section (`pwd=<your password goes here>;`)
3. If your Ollama instance does not run locally, update "OllamaURL" to point at your Ollama instance.
4. If you plan on using the server in production:
1. Set the environment variable `DOTNET_ENVIRONMENT` to something that is not "Development". (e.g. "Prod")
2. Rename the `appsettings.Development.json` - replace "Development" with whatever you chose. (e.g. "Prod")
3. Set API keys in the "ApiKeys" section (generate keys using the `uuid` command on Linux)
# API
## Accessing the api
Once started, the server's API can be comfortably be viewed and manipulated via swagger.
By default it is accessible under: `http://localhost:5146/swagger/index.html`
To make an API request from within swagger:
1. Open one of the actions ("GET" / "POST")
2. Click the "Try it out" button. The input fields (if there are any for your action) should now be editable.
3. Fill in the necessary information
4. Click "Execute"
## Restricting access
API keys do **not** get checked in Development environment!
Set up a non-development environment as described in [Configuration>Setup](#setup) to enable API key authentication.