Improved documentation
This commit is contained in:
168
README.md
168
README.md
@@ -3,150 +3,72 @@
|
||||
Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries.
|
||||
|
||||
This repository comes with
|
||||
- the library
|
||||
- a server
|
||||
- a clientside library
|
||||
- a ready-to-use CLI module
|
||||
- a REST API server (WIP) for if you want to process the data on a remote server or make it available to other languages.
|
||||
- an API for if you want to process the data on a remote server or make it available to other languages.
|
||||
|
||||
(Reranker support is planned, but its integration is not yet conceptualized.)
|
||||
(Currently only initial retrieval is implemented.
|
||||
Reranker support is planned, but its integration is not yet conceptualized.)
|
||||
# How to set up
|
||||
## server
|
||||
1. Install [ollama](https://ollama.com/download)
|
||||
2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`)
|
||||
3. [Install the depencencies](#installing-the-dependencies)
|
||||
4. [Set up a local mysql database](#mysql-database-setup)
|
||||
3. [Install the depencencies](docs/Server.md#installing-the-dependencies)
|
||||
4. [Set up a local mysql database](docs/Server.md#mysql-database-setup)
|
||||
5. (optional) [Create a searchdomain](#create-a-searchdomain)
|
||||
|
||||
# Installing the dependencies
|
||||
## Ubuntu 24.04
|
||||
1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
|
||||
## Windows
|
||||
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
|
||||
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
|
||||
2. Enter your WSL environment `wsl.exe` and configure it
|
||||
3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh`
|
||||
4. GOTO [Ubuntu 24.04](#Ubuntu-24.04)
|
||||
|
||||
# MySQL database setup
|
||||
1. Install the MySQL server:
|
||||
- Linux/WSL: `sudo apt install mysql-server`
|
||||
- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/)
|
||||
2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`)
|
||||
3. Create the database
|
||||
`CREATE DATABASE embeddingsearch; use embeddingsearch;`
|
||||
4. Create the user
|
||||
`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;`
|
||||
5. Create the tables: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup`
|
||||
|
||||
# Using the CLI
|
||||
Before anything follow these steps:
|
||||
1. Enter the project's `src` directory (used as the working directory in all examples)
|
||||
2. Build the project: `dotnet build`
|
||||
All user-defined parameters are denoted using the `$` symbol. I.e. `$mysql_ip` means: replace this with your MySQL IP address or set it as a local variable in your terminal session.
|
||||
|
||||
## Database
|
||||
### Create or check
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --create [--setup]`
|
||||
|
||||
Without the `--setup` parameter a "dry-run" is performed. I.e. no actions are taken. Only the database is checked for read access and that all tables exist.
|
||||
|
||||
## Searchdomain
|
||||
### Create a searchdomain
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --create -s $searchdomain_name`
|
||||
|
||||
Creates the searchdomain as specified under `$searchdomain_name`
|
||||
|
||||
### List searchdomains
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --list`
|
||||
|
||||
List all searchdomains
|
||||
|
||||
### Update searchdomain
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --update -s $searchdomain_name [-n $searchdomain_newname] [-S $searchdomain_newsettings]`
|
||||
|
||||
Set a new name and/or update the settings for the searchdomain.
|
||||
|
||||
### Delete searchdomain
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --delete -s $searchdomain_name`
|
||||
|
||||
Deletes a searchdomain and its corresponding entites.
|
||||
|
||||
## Entity
|
||||
### Create / Index entity
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e $entities_as_JSON`
|
||||
Creates the entities using the json string as specified under $entities_as_JSON
|
||||
|
||||
Example:
|
||||
- Linux: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{"name": "myfile.txt", "probmethod": "weighted_average", "searchdomain": "mysearchdomain", "attributes": {"mimetype": "text-plain"}, "datapoints": [{"name": "text", "text": "this is the full text", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}, {"name": "filepath", "text": "/home/myuser/myfile.txt", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}]}]'`
|
||||
- Powershell: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{\"name\": \"myfile.txt\", \"probmethod\": \"weighted_average\", \"searchdomain\": \"mysearchdomain\", \"attributes\": {\"mimetype\": \"text-plain\"}, \"datapoints\": [{\"name\": \"text\", \"text\": \"this is the full text\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}, {\"name\": \"filepath\", \"text\": \"\/home\/myuser\/myfile.txt\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}]}]'`
|
||||
|
||||
Only the json:
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "myfile.txt",
|
||||
"probmethod": "weighted_average",
|
||||
"searchdomain": "mysearchdomain",
|
||||
"attributes": {
|
||||
"mimetype": "text-plain"
|
||||
},
|
||||
"datapoints": [
|
||||
{
|
||||
"name": "text",
|
||||
"text": "this is the full text",
|
||||
"probmethod_embedding": "weighted_average",
|
||||
"model": [
|
||||
"bge-m3",
|
||||
"nomic-embed-text",
|
||||
"paraphrase-multilingual"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "filepath",
|
||||
"text": "/home/myuser/myfile.txt",
|
||||
"probmethod_embedding": "weighted_average",
|
||||
"model": [
|
||||
"bge-m3",
|
||||
"nomic-embed-text",
|
||||
"paraphrase-multilingual"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
### Evaluate query (i.e. "search"; that what you're here for)
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --evaluate -o $ollama_URL -s $searchdomain_name -q $query_string [-n $max_results]`
|
||||
|
||||
Executes a search using the specified query string and outputs the results.
|
||||
|
||||
### List entities
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --list -s $searchdomain_name`
|
||||
|
||||
Lists all entities in that domain (together with its attributes and datapoints and probmethod)
|
||||
|
||||
### Delete entity
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --remove -s $searchdomain_name -n $entity_name`
|
||||
|
||||
Deletes the entity specified by `$entity_name`.
|
||||
## indexer
|
||||
TBD
|
||||
## client
|
||||
1. Download the package and add it to your project (TODO: NuGet)
|
||||
2. Create a new client by either:
|
||||
1. By injecting IConfiguration (e.g. `services.AddSingleton<Client>();`)
|
||||
2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`)
|
||||
|
||||
# Known issues
|
||||
| Issue | Solution |
|
||||
| --- | --- |
|
||||
| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap. Try using `/usr/bin/dotnet` instead of `dotnet`. |
|
||||
| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. |
|
||||
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist |
|
||||
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD |
|
||||
|
||||
# To-do
|
||||
- (High priority) Add default indexer
|
||||
- Library
|
||||
- Processing:
|
||||
- Text / Markdown documents: file name, full text, paragraphs
|
||||
- Documents
|
||||
- PDF: file name, full text, headline?, paragraphs, images?
|
||||
- odt/docx: file name, full text, headline?, images?
|
||||
- msg/eml: file name, title, recipients, cc, text
|
||||
- Images: file name, OCR, image description?
|
||||
- Videos?
|
||||
- Presentations (Impress/Powerpoint): file name, full text, first slide title, titles, slide texts
|
||||
- Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
|
||||
- Other? (TBD)
|
||||
- Server
|
||||
- Scripting capability (Python; perhaps also lua)
|
||||
- Intended sourcing possibilities:
|
||||
- Local/Remote files (CIFS, SMB, FTP)
|
||||
- Database contents (MySQL, MSSQL)
|
||||
- Web requests (E.g. manual crawling)
|
||||
- Script call management (interval based & event based)
|
||||
- NuGet packaging and according README documentation
|
||||
- Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
|
||||
- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
|
||||
- Implement environment variable use in CLI
|
||||
- fix the `--help` functionality
|
||||
- Rename `cli` to something unique but still short, e.g. `escli`?
|
||||
- Improve error messaging for when retrieving a searchdomain fails.
|
||||
- Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key.
|
||||
- Implement the api server
|
||||
- Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
|
||||
- Write a Linux installer for the CLI tool
|
||||
- Make the API server a docker container
|
||||
- Maybe add a config such that one does not need to always specify the MySQL connection info
|
||||
- Make the API server (and indexer, once it is done) a docker container
|
||||
|
||||
# Future features
|
||||
- Support for other database types (TSQL, SQLite)
|
||||
- Support for other database types (MSSQL, SQLite)
|
||||
|
||||
|
||||
# Community
|
||||
[](https://discord.gg/MUKeZM3k)
|
||||
91
docs/CLI.md
Normal file
91
docs/CLI.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# Using the CLI
|
||||
Before anything follow these steps:
|
||||
1. Enter the project's `src` directory (used as the working directory in all examples)
|
||||
2. Build the project: `dotnet build`
|
||||
All user-defined parameters are denoted using the `$` symbol. I.e. `$mysql_ip` means: replace this with your MySQL IP address or set it as a local variable in your terminal session.
|
||||
|
||||
## Database
|
||||
### Create or check
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --create [--setup]`
|
||||
|
||||
Without the `--setup` parameter a "dry-run" is performed. I.e. no actions are taken. Only the database is checked for read access and that all tables exist.
|
||||
|
||||
## Searchdomain
|
||||
### Create a searchdomain
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --create -s $searchdomain_name`
|
||||
|
||||
Creates the searchdomain as specified under `$searchdomain_name`
|
||||
|
||||
### List searchdomains
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --list`
|
||||
|
||||
List all searchdomains
|
||||
|
||||
### Update searchdomain
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --update -s $searchdomain_name [-n $searchdomain_newname] [-S $searchdomain_newsettings]`
|
||||
|
||||
Set a new name and/or update the settings for the searchdomain.
|
||||
|
||||
### Delete searchdomain
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --delete -s $searchdomain_name`
|
||||
|
||||
Deletes a searchdomain and its corresponding entites.
|
||||
|
||||
## Entity
|
||||
### Create / Index entity
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e $entities_as_JSON`
|
||||
Creates the entities using the json string as specified under $entities_as_JSON
|
||||
|
||||
Example:
|
||||
- Linux: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{"name": "myfile.txt", "probmethod": "weighted_average", "searchdomain": "mysearchdomain", "attributes": {"mimetype": "text-plain"}, "datapoints": [{"name": "text", "text": "this is the full text", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}, {"name": "filepath", "text": "/home/myuser/myfile.txt", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}]}]'`
|
||||
- Powershell: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{\"name\": \"myfile.txt\", \"probmethod\": \"weighted_average\", \"searchdomain\": \"mysearchdomain\", \"attributes\": {\"mimetype\": \"text-plain\"}, \"datapoints\": [{\"name\": \"text\", \"text\": \"this is the full text\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}, {\"name\": \"filepath\", \"text\": \"\/home\/myuser\/myfile.txt\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}]}]'`
|
||||
|
||||
Only the json:
|
||||
```json
|
||||
[
|
||||
{
|
||||
"name": "myfile.txt",
|
||||
"probmethod": "weighted_average",
|
||||
"searchdomain": "mysearchdomain",
|
||||
"attributes": {
|
||||
"mimetype": "text-plain"
|
||||
},
|
||||
"datapoints": [
|
||||
{
|
||||
"name": "text",
|
||||
"text": "this is the full text",
|
||||
"probmethod_embedding": "weighted_average",
|
||||
"model": [
|
||||
"bge-m3",
|
||||
"nomic-embed-text",
|
||||
"paraphrase-multilingual"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "filepath",
|
||||
"text": "/home/myuser/myfile.txt",
|
||||
"probmethod_embedding": "weighted_average",
|
||||
"model": [
|
||||
"bge-m3",
|
||||
"nomic-embed-text",
|
||||
"paraphrase-multilingual"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
```
|
||||
### Evaluate query (i.e. "search"; that what you're here for)
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --evaluate -o $ollama_URL -s $searchdomain_name -q $query_string [-n $max_results]`
|
||||
|
||||
Executes a search using the specified query string and outputs the results.
|
||||
|
||||
### List entities
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --list -s $searchdomain_name`
|
||||
|
||||
Lists all entities in that domain (together with its attributes and datapoints and probmethod)
|
||||
|
||||
### Delete entity
|
||||
`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --remove -s $searchdomain_name -n $entity_name`
|
||||
|
||||
Deletes the entity specified by `$entity_name`.
|
||||
0
docs/Indexer.md
Normal file
0
docs/Indexer.md
Normal file
26
docs/Server.md
Normal file
26
docs/Server.md
Normal file
@@ -0,0 +1,26 @@
|
||||
# Server
|
||||
The server by default
|
||||
- runs on port 5146
|
||||
- Uses Swagger UI in development mode (`/swagger/index.html`)
|
||||
- Ignores API keys when not in development mode
|
||||
|
||||
# Installing the dependencies
|
||||
## Ubuntu 24.04
|
||||
1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y`
|
||||
## Windows
|
||||
Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL:
|
||||
1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`)
|
||||
2. Enter your WSL environment `wsl.exe` and configure it
|
||||
3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh`
|
||||
4. Continue here: [Ubuntu 24.04](#Ubuntu-24.04)
|
||||
|
||||
# MySQL database setup
|
||||
1. Install the MySQL server:
|
||||
- Linux/WSL: `sudo apt install mysql-server`
|
||||
- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/)
|
||||
2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`)
|
||||
3. Create the database
|
||||
`CREATE DATABASE embeddingsearch; use embeddingsearch;`
|
||||
4. Create the user
|
||||
`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;`
|
||||
5. Create the tables using the CLI tool: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup`
|
||||
Reference in New Issue
Block a user