From 371f9c7411c08c6fa35e85ee7f374f22eb1e57a3 Mon Sep 17 00:00:00 2001 From: EzFeDezy Date: Tue, 27 May 2025 22:30:52 +0200 Subject: [PATCH] Updated README to reflect the recent addition of the Indexer --- README.md | 44 +++++++++++++++----------- docs/Indexer.md | 83 +++++++++++++++++++++++++++++++++++++++++++++++++ docs/Server.md | 41 +++++++++++++++++++++--- 3 files changed, 144 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index d594855..ca332e1 100644 --- a/README.md +++ b/README.md @@ -3,27 +3,34 @@ Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries. This repository comes with -- a server (accessible via API calls) +- a server (accessible via API calls & swagger) - a clientside library -- a CLI module -- an indexer (TBD) +- a CLI module (deprecated) +- a scripting based indexer service that supports + - Python + - Golang (WIP) + - Javascript (WIP) -(Currently only initial retrieval is implemented. -Reranker support is planned, but its integration is not yet conceptualized.) # How to set up / use ## server 1. Install [ollama](https://ollama.com/download) 2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`) 3. [Install the depencencies](docs/Server.md#installing-the-dependencies) 4. [Set up a local mysql database](docs/Server.md#mysql-database-setup) -5. (optional) [Create a searchdomain](#create-a-searchdomain) +5. [Set up the configuration](docs/Server.md#setup) +6. In `src/server` execute `dotnet build && dotnet run` to start the server +7. (optional) [Create a searchdomain using the web interface](docs/Server.md#accessing-the-api) ## client 1. Download the package and add it to your project (TODO: NuGet) 2. Create a new client by either: 1. By injecting IConfiguration (e.g. `services.AddSingleton();`) 2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`) ## indexer -TBD +1. [Install the dependencies](docs/Indexer.md#installing-the-dependencies) +2. [Set up the server](#server) +3. [Configure the indexer](docs/Indexer.md#configuration) +4. [Set up your indexing script(s)](docs/Indexer.md#scripting) +5. Run with `dotnet build && dotnet run` (Or `/usr/bin/dotnet build && /usr/bin/dotnet run`) ## CLI Before anything follow these steps: 1. Enter the project's `src` directory (used as the working directory in all examples) @@ -37,7 +44,7 @@ All commands, parameters and examples are documented here: [docs/CLI.md](docs/CL | Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. | | Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist | | Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD | - +| System.DllNotFoundException: Could not load libpython3.12.so with flags RTLD_NOW \| RTLD_GLOBAL: libpython3.12.so: cannot open shared object file: No such file or directory | Install python3.12-dev via apt | # To-do - (High priority) Add default indexer - Library @@ -53,22 +60,21 @@ All commands, parameters and examples are documented here: [docs/CLI.md](docs/CL - Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page) - Other? (TBD) - Server - - Scripting capability (Python; perhaps also lua) - - Intended sourcing possibilities: - - Local/Remote files (CIFS, SMB, FTP) - - Database contents (MySQL, MSSQL) - - Web requests (E.g. manual crawling) - - Script call management (interval based & event based) -- NuGet packaging and according README documentation + - ~~Scripting capability (Python; perhaps also lua)~~ (Done with the latest commits) + - ~~Intended sourcing possibilities:~~ + - ~~Local/Remote files (CIFS, SMB, FTP)~~ + - ~~Database contents (MySQL, MSSQL)~~ + - ~~Web requests (E.g. manual crawling)~~ + - ~~Script call management (interval based & event based)~~ +- Implement Healthz check +- Implement [ReaderWriterLock](https://learn.microsoft.com/en-us/dotnet/api/system.threading.readerwriterlockslim?view=net-9.0&redirectedfrom=MSDN) for entityCache to allow for multithreaded read access while retaining single-threaded write access. +- NuGet packaging and corresponding README documentation - Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?) - Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support -- Implement environment variable use in CLI -- fix the `--help` functionality -- Rename `cli` to something unique but still short, e.g. `escli`? +- Remove the CLI - Improve error messaging for when retrieving a searchdomain fails. - Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key. - Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.) -- Write a Linux installer for the CLI tool - Make the API server (and indexer, once it is done) a docker container # Future features diff --git a/docs/Indexer.md b/docs/Indexer.md index e69de29..ad3b06d 100644 --- a/docs/Indexer.md +++ b/docs/Indexer.md @@ -0,0 +1,83 @@ +# Overview + +## Installing the dependencies +## Ubuntu 24.04 +1. Install the .NET SDK: `sudo apt update && sudo apt install dotnet-sdk-8.0 -y` +2. Install the python SDK: `sudo apt install python3 python3.12 python3.12-dev` +## Windows +Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL: +1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`) +2. Enter your WSL environment `wsl.exe` and configure it +3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh` +4. Continue here: [Ubuntu 24.04](#Ubuntu-24.04) +# Configuration +The configuration is located in `src/Indexer` and conforms to the [ASP.NET configuration design pattern](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-9.0), i.e. `src/Indexer/appsettings.json` is the base configuration, and `/src/Indexer/appsettings.Development.json` overrides it. + +If you plan to use multiple environments, create any `appsettings.{YourEnvironment}.json` (e.g. `Development`, `Staging`, `Prod`) and set the environment variable `DOTNET_ENVIRONMENT` accordingly on the target machine. +## Setup +If you just installed the server and want to configure it: +1. Open `src/server/appsettings.Development.json` +2. If your search server is not on the same machine as the indexer, update "BaseUri" to reflect the URL to the server. +3. If your search server requires API keys, (i.e. it's operating outside of the "Development" environment) set `"ApiKey": ""` beneath `"BaseUri"` in the `"Embeddingsearch"` section. +4. Create your own indexing script(s) in `src/Indexer/Scripts/` and configure their use as +## Structure +```json + "EmbeddingsearchIndexer": { + "Worker": + [ // This is a list; you can have as many "workers" as you want + { + "Name": "example", + "Searchdomains": [ + "example" + ], + "Script": "Scripts/example.py", + "Calls": [ // This is also a list. You can have as many different calls as you need. + { + "Type": "interval", // See: Call types + "Interval": 60000 + } + ] + } + ] + } +``` +## Call types +- `interval` + - What does it do: The script gets called periodically based on the specified `Interval` parameter. + - Parameters: + - Interval (in milliseconds) +- `schedule` (WIP) + - What does it do: The script gets called based on the provided schedule + - Parameters: (WIP) +- `fileupdate` (WIP) + - What does it do: The script gets called whenever a file is updated in the specified subdirectory + - Parameters: (WIP) +# Scripting +## Python +To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods. + +These are not yet defined in a way that makes them 100% interactible with the Dotnet CLR, meaning some methods that require anything more than strings or other simple data types to be passed are not yet supported. (WIP) +### Required elements +Here is an overview of required elements by example: +```python +from tools import * # Import all tools that are provided for ease of scripting + +def init(toolset: Toolset): # defining an init() function with 1 parameter is required. + pass # Your code would go here. + # DO NOT put a main loop here! Why? + # This function prevents the application from initializing and maintains exclusive control over the GIL + +def update(toolset: Toolset): # defining an update() function with 1 parameter is required. + pass # Your code would go here. +``` +### Using the toolset passed by the .NET CLR +The use of the toolset is laid out in good example by `src/Indexer/Scripts/example.py`. + +Currently, `Toolset`, as provided by the IndexerService to the Python script, contains 3 elements: +1. (only for `update`, not `init`) `callbackInfos` - an object that provides all information regarding the callback. (e.g. what file was updated) +2. `client` - a .NET object that has the functions as described in `src/Indexer/Scripts/tools.py`. It's the client that - according to the configuration - communicates with the search server and executes the API calls. +3. `filePath` - the path to the script, as specified in the configuration +## Golang +TODO +## Javascript +TODO \ No newline at end of file diff --git a/docs/Server.md b/docs/Server.md index 71be705..f9c0e2e 100644 --- a/docs/Server.md +++ b/docs/Server.md @@ -1,4 +1,4 @@ -# Server +# Overview The server by default - runs on port 5146 - Uses Swagger UI in development mode (`/swagger/index.html`) @@ -6,7 +6,7 @@ The server by default # Installing the dependencies ## Ubuntu 24.04 -1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y` +1. Install the .NET SDK: `sudo apt update && sudo apt install dotnet-sdk-8.0 -y` ## Windows Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL: 1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`) @@ -19,8 +19,39 @@ Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow t - Linux/WSL: `sudo apt install mysql-server` - Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/) 2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`) -3. Create the database +3. Create the database: `CREATE DATABASE embeddingsearch; use embeddingsearch;` -4. Create the user +4. Create the user (replace "somepassword! with a secure password): `CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;` -5. Create the tables using the CLI tool: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup` +5. Create the tables using the CLI tool: `cd src/cli; dotnet build` and `bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup` (replace the variables with the actual values) + +# Configuration +## Environments +The configuration is located in `src/server/` and conforms to the [ASP.NET configuration design pattern](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-9.0), i.e. `src/server/appsettings.json` is the base configuration, and `/src/server/appsettings.Development.json` overrides it. + +If you plan to use multiple environments, create any `appsettings.{YourEnvironment}.json` (e.g. `Development`, `Staging`, `Prod`) and set the environment variable `DOTNET_ENVIRONMENT` accordingly on the target machine. +## Setup +If you just installed the server and want to configure it: +1. Open `src/server/appsettings.Development.json` +2. Change the password in the "SQL" section (`pwd=;`) +3. If your Ollama instance does not run locally, update "OllamaURL" to point at your Ollama instance. +4. If you plan on using the server in production: + 1. Set the environment variable `DOTNET_ENVIRONMENT` to something that is not "Development". (e.g. "Prod") + 2. Rename the `appsettings.Development.json` - replace "Development" with whatever you chose. (e.g. "Prod") + 3. Set API keys in the "ApiKeys" section (generate keys using the `uuid` command on Linux) + +# API +## Accessing the api +Once started, the server's API can be comfortably be viewed and manipulated via swagger. + +By default it is accessible under: `http://localhost:5146/swagger/index.html` + +To make an API request from within swagger: +1. Open one of the actions ("GET" / "POST") +2. Click the "Try it out" button. The input fields (if there are any for your action) should now be editable. +3. Fill in the necessary information +4. Click "Execute" +## Restricting access +API keys do **not** get checked in Development environment! + +Set up a non-development environment as described in [Configuration>Setup](#setup) to enable API key authentication. \ No newline at end of file