Updated README to reflect the recent addition of the Indexer
This commit is contained in:
44
README.md
44
README.md
@@ -3,27 +3,34 @@
|
|||||||
Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries.
|
Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries.
|
||||||
|
|
||||||
This repository comes with
|
This repository comes with
|
||||||
- a server (accessible via API calls)
|
- a server (accessible via API calls & swagger)
|
||||||
- a clientside library
|
- a clientside library
|
||||||
- a CLI module
|
- a CLI module (deprecated)
|
||||||
- an indexer (TBD)
|
- a scripting based indexer service that supports
|
||||||
|
- Python
|
||||||
|
- Golang (WIP)
|
||||||
|
- Javascript (WIP)
|
||||||
|
|
||||||
(Currently only initial retrieval is implemented.
|
|
||||||
Reranker support is planned, but its integration is not yet conceptualized.)
|
|
||||||
# How to set up / use
|
# How to set up / use
|
||||||
## server
|
## server
|
||||||
1. Install [ollama](https://ollama.com/download)
|
1. Install [ollama](https://ollama.com/download)
|
||||||
2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`)
|
2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`)
|
||||||
3. [Install the depencencies](docs/Server.md#installing-the-dependencies)
|
3. [Install the depencencies](docs/Server.md#installing-the-dependencies)
|
||||||
4. [Set up a local mysql database](docs/Server.md#mysql-database-setup)
|
4. [Set up a local mysql database](docs/Server.md#mysql-database-setup)
|
||||||
5. (optional) [Create a searchdomain](#create-a-searchdomain)
|
5. [Set up the configuration](docs/Server.md#setup)
|
||||||
|
6. In `src/server` execute `dotnet build && dotnet run` to start the server
|
||||||
|
7. (optional) [Create a searchdomain using the web interface](docs/Server.md#accessing-the-api)
|
||||||
## client
|
## client
|
||||||
1. Download the package and add it to your project (TODO: NuGet)
|
1. Download the package and add it to your project (TODO: NuGet)
|
||||||
2. Create a new client by either:
|
2. Create a new client by either:
|
||||||
1. By injecting IConfiguration (e.g. `services.AddSingleton<Client>();`)
|
1. By injecting IConfiguration (e.g. `services.AddSingleton<Client>();`)
|
||||||
2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`)
|
2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`)
|
||||||
## indexer
|
## indexer
|
||||||
TBD
|
1. [Install the dependencies](docs/Indexer.md#installing-the-dependencies)
|
||||||
|
2. [Set up the server](#server)
|
||||||
|
3. [Configure the indexer](docs/Indexer.md#configuration)
|
||||||
|
4. [Set up your indexing script(s)](docs/Indexer.md#scripting)
|
||||||
|
5. Run with `dotnet build && dotnet run` (Or `/usr/bin/dotnet build && /usr/bin/dotnet run`)
|
||||||
## CLI
|
## CLI
|
||||||
Before anything follow these steps:
|
Before anything follow these steps:
|
||||||
1. Enter the project's `src` directory (used as the working directory in all examples)
|
1. Enter the project's `src` directory (used as the working directory in all examples)
|
||||||
@@ -37,7 +44,7 @@ All commands, parameters and examples are documented here: [docs/CLI.md](docs/CL
|
|||||||
| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. |
|
| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. |
|
||||||
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist |
|
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist |
|
||||||
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD |
|
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD |
|
||||||
|
| System.DllNotFoundException: Could not load libpython3.12.so with flags RTLD_NOW \| RTLD_GLOBAL: libpython3.12.so: cannot open shared object file: No such file or directory | Install python3.12-dev via apt |
|
||||||
# To-do
|
# To-do
|
||||||
- (High priority) Add default indexer
|
- (High priority) Add default indexer
|
||||||
- Library
|
- Library
|
||||||
@@ -53,22 +60,21 @@ All commands, parameters and examples are documented here: [docs/CLI.md](docs/CL
|
|||||||
- Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
|
- Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
|
||||||
- Other? (TBD)
|
- Other? (TBD)
|
||||||
- Server
|
- Server
|
||||||
- Scripting capability (Python; perhaps also lua)
|
- ~~Scripting capability (Python; perhaps also lua)~~ (Done with the latest commits)
|
||||||
- Intended sourcing possibilities:
|
- ~~Intended sourcing possibilities:~~
|
||||||
- Local/Remote files (CIFS, SMB, FTP)
|
- ~~Local/Remote files (CIFS, SMB, FTP)~~
|
||||||
- Database contents (MySQL, MSSQL)
|
- ~~Database contents (MySQL, MSSQL)~~
|
||||||
- Web requests (E.g. manual crawling)
|
- ~~Web requests (E.g. manual crawling)~~
|
||||||
- Script call management (interval based & event based)
|
- ~~Script call management (interval based & event based)~~
|
||||||
- NuGet packaging and according README documentation
|
- Implement Healthz check
|
||||||
|
- Implement [ReaderWriterLock](https://learn.microsoft.com/en-us/dotnet/api/system.threading.readerwriterlockslim?view=net-9.0&redirectedfrom=MSDN) for entityCache to allow for multithreaded read access while retaining single-threaded write access.
|
||||||
|
- NuGet packaging and corresponding README documentation
|
||||||
- Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
|
- Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
|
||||||
- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
|
- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
|
||||||
- Implement environment variable use in CLI
|
- Remove the CLI
|
||||||
- fix the `--help` functionality
|
|
||||||
- Rename `cli` to something unique but still short, e.g. `escli`?
|
|
||||||
- Improve error messaging for when retrieving a searchdomain fails.
|
- Improve error messaging for when retrieving a searchdomain fails.
|
||||||
- Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key.
|
- Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key.
|
||||||
- Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
|
- Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
|
||||||
- Write a Linux installer for the CLI tool
|
|
||||||
- Make the API server (and indexer, once it is done) a docker container
|
- Make the API server (and indexer, once it is done) a docker container
|
||||||
|
|
||||||
# Future features
|
# Future features
|
||||||
|
|||||||
@@ -0,0 +1,83 @@
|
|||||||
|
# Overview
|
||||||
|
|
||||||
|
## Installing the dependencies
|
||||||
|
## Ubuntu 24.04
|
||||||
|
1. Install the .NET SDK: `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
|
||||||
|
2. Install the python SDK: `sudo apt install python3 python3.12 python3.12-dev`
|
||||||
|
## Windows
|
||||||
|
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
|
||||||
|
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
|
||||||
|
2. Enter your WSL environment `wsl.exe` and configure it
|
||||||
|
3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh`
|
||||||
|
4. Continue here: [Ubuntu 24.04](#Ubuntu-24.04)
|
||||||
|
# Configuration
|
||||||
|
The configuration is located in `src/Indexer` and conforms to the [ASP.NET configuration design pattern](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-9.0), i.e. `src/Indexer/appsettings.json` is the base configuration, and `/src/Indexer/appsettings.Development.json` overrides it.
|
||||||
|
|
||||||
|
If you plan to use multiple environments, create any `appsettings.{YourEnvironment}.json` (e.g. `Development`, `Staging`, `Prod`) and set the environment variable `DOTNET_ENVIRONMENT` accordingly on the target machine.
|
||||||
|
## Setup
|
||||||
|
If you just installed the server and want to configure it:
|
||||||
|
1. Open `src/server/appsettings.Development.json`
|
||||||
|
2. If your search server is not on the same machine as the indexer, update "BaseUri" to reflect the URL to the server.
|
||||||
|
3. If your search server requires API keys, (i.e. it's operating outside of the "Development" environment) set `"ApiKey": "<your key here>"` beneath `"BaseUri"` in the `"Embeddingsearch"` section.
|
||||||
|
4. Create your own indexing script(s) in `src/Indexer/Scripts/` and configure their use as
|
||||||
|
## Structure
|
||||||
|
```json
|
||||||
|
"EmbeddingsearchIndexer": {
|
||||||
|
"Worker":
|
||||||
|
[ // This is a list; you can have as many "workers" as you want
|
||||||
|
{
|
||||||
|
"Name": "example",
|
||||||
|
"Searchdomains": [
|
||||||
|
"example"
|
||||||
|
],
|
||||||
|
"Script": "Scripts/example.py",
|
||||||
|
"Calls": [ // This is also a list. You can have as many different calls as you need.
|
||||||
|
{
|
||||||
|
"Type": "interval", // See: Call types
|
||||||
|
"Interval": 60000
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
## Call types
|
||||||
|
- `interval`
|
||||||
|
- What does it do: The script gets called periodically based on the specified `Interval` parameter.
|
||||||
|
- Parameters:
|
||||||
|
- Interval (in milliseconds)
|
||||||
|
- `schedule` (WIP)
|
||||||
|
- What does it do: The script gets called based on the provided schedule
|
||||||
|
- Parameters: (WIP)
|
||||||
|
- `fileupdate` (WIP)
|
||||||
|
- What does it do: The script gets called whenever a file is updated in the specified subdirectory
|
||||||
|
- Parameters: (WIP)
|
||||||
|
# Scripting
|
||||||
|
## Python
|
||||||
|
To ease scripting, tools.py contains all definitions of the .NET objects passed to the script. This includes attributes and methods.
|
||||||
|
|
||||||
|
These are not yet defined in a way that makes them 100% interactible with the Dotnet CLR, meaning some methods that require anything more than strings or other simple data types to be passed are not yet supported. (WIP)
|
||||||
|
### Required elements
|
||||||
|
Here is an overview of required elements by example:
|
||||||
|
```python
|
||||||
|
from tools import * # Import all tools that are provided for ease of scripting
|
||||||
|
|
||||||
|
def init(toolset: Toolset): # defining an init() function with 1 parameter is required.
|
||||||
|
pass # Your code would go here.
|
||||||
|
# DO NOT put a main loop here! Why?
|
||||||
|
# This function prevents the application from initializing and maintains exclusive control over the GIL
|
||||||
|
|
||||||
|
def update(toolset: Toolset): # defining an update() function with 1 parameter is required.
|
||||||
|
pass # Your code would go here.
|
||||||
|
```
|
||||||
|
### Using the toolset passed by the .NET CLR
|
||||||
|
The use of the toolset is laid out in good example by `src/Indexer/Scripts/example.py`.
|
||||||
|
|
||||||
|
Currently, `Toolset`, as provided by the IndexerService to the Python script, contains 3 elements:
|
||||||
|
1. (only for `update`, not `init`) `callbackInfos` - an object that provides all information regarding the callback. (e.g. what file was updated)
|
||||||
|
2. `client` - a .NET object that has the functions as described in `src/Indexer/Scripts/tools.py`. It's the client that - according to the configuration - communicates with the search server and executes the API calls.
|
||||||
|
3. `filePath` - the path to the script, as specified in the configuration
|
||||||
|
## Golang
|
||||||
|
TODO
|
||||||
|
## Javascript
|
||||||
|
TODO
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# Server
|
# Overview
|
||||||
The server by default
|
The server by default
|
||||||
- runs on port 5146
|
- runs on port 5146
|
||||||
- Uses Swagger UI in development mode (`/swagger/index.html`)
|
- Uses Swagger UI in development mode (`/swagger/index.html`)
|
||||||
@@ -6,7 +6,7 @@ The server by default
|
|||||||
|
|
||||||
# Installing the dependencies
|
# Installing the dependencies
|
||||||
## Ubuntu 24.04
|
## Ubuntu 24.04
|
||||||
1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
|
1. Install the .NET SDK: `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
|
||||||
## Windows
|
## Windows
|
||||||
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
|
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
|
||||||
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
|
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
|
||||||
@@ -19,8 +19,39 @@ Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow t
|
|||||||
- Linux/WSL: `sudo apt install mysql-server`
|
- Linux/WSL: `sudo apt install mysql-server`
|
||||||
- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/)
|
- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/)
|
||||||
2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`)
|
2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`)
|
||||||
3. Create the database
|
3. Create the database:
|
||||||
`CREATE DATABASE embeddingsearch; use embeddingsearch;`
|
`CREATE DATABASE embeddingsearch; use embeddingsearch;`
|
||||||
4. Create the user
|
4. Create the user (replace "somepassword! with a secure password):
|
||||||
`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;`
|
`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;`
|
||||||
5. Create the tables using the CLI tool: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup`
|
5. Create the tables using the CLI tool: `cd src/cli; dotnet build` and `bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup` (replace the variables with the actual values)
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
## Environments
|
||||||
|
The configuration is located in `src/server/` and conforms to the [ASP.NET configuration design pattern](https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-9.0), i.e. `src/server/appsettings.json` is the base configuration, and `/src/server/appsettings.Development.json` overrides it.
|
||||||
|
|
||||||
|
If you plan to use multiple environments, create any `appsettings.{YourEnvironment}.json` (e.g. `Development`, `Staging`, `Prod`) and set the environment variable `DOTNET_ENVIRONMENT` accordingly on the target machine.
|
||||||
|
## Setup
|
||||||
|
If you just installed the server and want to configure it:
|
||||||
|
1. Open `src/server/appsettings.Development.json`
|
||||||
|
2. Change the password in the "SQL" section (`pwd=<your password goes here>;`)
|
||||||
|
3. If your Ollama instance does not run locally, update "OllamaURL" to point at your Ollama instance.
|
||||||
|
4. If you plan on using the server in production:
|
||||||
|
1. Set the environment variable `DOTNET_ENVIRONMENT` to something that is not "Development". (e.g. "Prod")
|
||||||
|
2. Rename the `appsettings.Development.json` - replace "Development" with whatever you chose. (e.g. "Prod")
|
||||||
|
3. Set API keys in the "ApiKeys" section (generate keys using the `uuid` command on Linux)
|
||||||
|
|
||||||
|
# API
|
||||||
|
## Accessing the api
|
||||||
|
Once started, the server's API can be comfortably be viewed and manipulated via swagger.
|
||||||
|
|
||||||
|
By default it is accessible under: `http://localhost:5146/swagger/index.html`
|
||||||
|
|
||||||
|
To make an API request from within swagger:
|
||||||
|
1. Open one of the actions ("GET" / "POST")
|
||||||
|
2. Click the "Try it out" button. The input fields (if there are any for your action) should now be editable.
|
||||||
|
3. Fill in the necessary information
|
||||||
|
4. Click "Execute"
|
||||||
|
## Restricting access
|
||||||
|
API keys do **not** get checked in Development environment!
|
||||||
|
|
||||||
|
Set up a non-development environment as described in [Configuration>Setup](#setup) to enable API key authentication.
|
||||||
Reference in New Issue
Block a user