diff --git a/README.md b/README.md index d1b7f1e..187a106 100644 --- a/README.md +++ b/README.md @@ -3,150 +3,72 @@ Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries. This repository comes with -- the library +- a server +- a clientside library - a ready-to-use CLI module -- a REST API server (WIP) for if you want to process the data on a remote server or make it available to other languages. +- an API for if you want to process the data on a remote server or make it available to other languages. -(Reranker support is planned, but its integration is not yet conceptualized.) +(Currently only initial retrieval is implemented. +Reranker support is planned, but its integration is not yet conceptualized.) # How to set up +## server 1. Install [ollama](https://ollama.com/download) 2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`) -3. [Install the depencencies](#installing-the-dependencies) -4. [Set up a local mysql database](#mysql-database-setup) +3. [Install the depencencies](docs/Server.md#installing-the-dependencies) +4. [Set up a local mysql database](docs/Server.md#mysql-database-setup) 5. (optional) [Create a searchdomain](#create-a-searchdomain) - -# Installing the dependencies -## Ubuntu 24.04 -1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y` -## Windows -Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL: -1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`) -2. Enter your WSL environment `wsl.exe` and configure it -3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh` -4. GOTO [Ubuntu 24.04](#Ubuntu-24.04) - -# MySQL database setup -1. Install the MySQL server: -- Linux/WSL: `sudo apt install mysql-server` -- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/) -2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`) -3. Create the database -`CREATE DATABASE embeddingsearch; use embeddingsearch;` -4. Create the user -`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;` -5. Create the tables: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup` - -# Using the CLI -Before anything follow these steps: -1. Enter the project's `src` directory (used as the working directory in all examples) -2. Build the project: `dotnet build` -All user-defined parameters are denoted using the `$` symbol. I.e. `$mysql_ip` means: replace this with your MySQL IP address or set it as a local variable in your terminal session. - -## Database -### Create or check -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --create [--setup]` - -Without the `--setup` parameter a "dry-run" is performed. I.e. no actions are taken. Only the database is checked for read access and that all tables exist. - -## Searchdomain -### Create a searchdomain -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --create -s $searchdomain_name` - -Creates the searchdomain as specified under `$searchdomain_name` - -### List searchdomains -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --list` - -List all searchdomains - -### Update searchdomain -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --update -s $searchdomain_name [-n $searchdomain_newname] [-S $searchdomain_newsettings]` - -Set a new name and/or update the settings for the searchdomain. - -### Delete searchdomain -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --delete -s $searchdomain_name` - -Deletes a searchdomain and its corresponding entites. - -## Entity -### Create / Index entity -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e $entities_as_JSON` -Creates the entities using the json string as specified under $entities_as_JSON - -Example: -- Linux: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{"name": "myfile.txt", "probmethod": "weighted_average", "searchdomain": "mysearchdomain", "attributes": {"mimetype": "text-plain"}, "datapoints": [{"name": "text", "text": "this is the full text", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}, {"name": "filepath", "text": "/home/myuser/myfile.txt", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}]}]'` -- Powershell: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{\"name\": \"myfile.txt\", \"probmethod\": \"weighted_average\", \"searchdomain\": \"mysearchdomain\", \"attributes\": {\"mimetype\": \"text-plain\"}, \"datapoints\": [{\"name\": \"text\", \"text\": \"this is the full text\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}, {\"name\": \"filepath\", \"text\": \"\/home\/myuser\/myfile.txt\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}]}]'` - -Only the json: -```json -[ - { - "name": "myfile.txt", - "probmethod": "weighted_average", - "searchdomain": "mysearchdomain", - "attributes": { - "mimetype": "text-plain" - }, - "datapoints": [ - { - "name": "text", - "text": "this is the full text", - "probmethod_embedding": "weighted_average", - "model": [ - "bge-m3", - "nomic-embed-text", - "paraphrase-multilingual" - ] - }, - { - "name": "filepath", - "text": "/home/myuser/myfile.txt", - "probmethod_embedding": "weighted_average", - "model": [ - "bge-m3", - "nomic-embed-text", - "paraphrase-multilingual" - ] - } - ] - } -] -``` -### Evaluate query (i.e. "search"; that what you're here for) -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --evaluate -o $ollama_URL -s $searchdomain_name -q $query_string [-n $max_results]` - -Executes a search using the specified query string and outputs the results. - -### List entities -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --list -s $searchdomain_name` - -Lists all entities in that domain (together with its attributes and datapoints and probmethod) - -### Delete entity -`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --remove -s $searchdomain_name -n $entity_name` - -Deletes the entity specified by `$entity_name`. +## indexer +TBD +## client +1. Download the package and add it to your project (TODO: NuGet) +2. Create a new client by either: + 1. By injecting IConfiguration (e.g. `services.AddSingleton();`) + 2. By specifying the baseUri, apiKey, and searchdomain (e.g. `new Client.Client(baseUri, apiKey, searchdomain)`) # Known issues | Issue | Solution | | --- | --- | -| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap. Try using `/usr/bin/dotnet` instead of `dotnet`. | +| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using `/usr/bin/dotnet` instead of `dotnet`. | | Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist | | Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD | # To-do +- (High priority) Add default indexer + - Library + - Processing: + - Text / Markdown documents: file name, full text, paragraphs + - Documents + - PDF: file name, full text, headline?, paragraphs, images? + - odt/docx: file name, full text, headline?, images? + - msg/eml: file name, title, recipients, cc, text + - Images: file name, OCR, image description? + - Videos? + - Presentations (Impress/Powerpoint): file name, full text, first slide title, titles, slide texts + - Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page) + - Other? (TBD) + - Server + - Scripting capability (Python; perhaps also lua) + - Intended sourcing possibilities: + - Local/Remote files (CIFS, SMB, FTP) + - Database contents (MySQL, MSSQL) + - Web requests (E.g. manual crawling) + - Script call management (interval based & event based) +- NuGet packaging and according README documentation - Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?) +- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support - Implement environment variable use in CLI - fix the `--help` functionality - Rename `cli` to something unique but still short, e.g. `escli`? - Improve error messaging for when retrieving a searchdomain fails. - Remove the `id` collumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key. -- Implement the api server - Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.) - Write a Linux installer for the CLI tool -- Make the API server a docker container -- Maybe add a config such that one does not need to always specify the MySQL connection info +- Make the API server (and indexer, once it is done) a docker container # Future features -- Support for other database types (TSQL, SQLite) +- Support for other database types (MSSQL, SQLite) + + +# Community +[![Discord +Server](https://img.shields.io/badge/Join%20Discord-7289DA?style=flat&logo=discord&logoColor=whiteServer](https://img.shields.io/badge/Join%20Discord-7289DA?style=flat&logo=discord&lgoColor=white)](https://discord.gg/MUKeZM3k) \ No newline at end of file diff --git a/docs/CLI.md b/docs/CLI.md new file mode 100644 index 0000000..84597f2 --- /dev/null +++ b/docs/CLI.md @@ -0,0 +1,91 @@ +# Using the CLI +Before anything follow these steps: +1. Enter the project's `src` directory (used as the working directory in all examples) +2. Build the project: `dotnet build` +All user-defined parameters are denoted using the `$` symbol. I.e. `$mysql_ip` means: replace this with your MySQL IP address or set it as a local variable in your terminal session. + +## Database +### Create or check +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --create [--setup]` + +Without the `--setup` parameter a "dry-run" is performed. I.e. no actions are taken. Only the database is checked for read access and that all tables exist. + +## Searchdomain +### Create a searchdomain +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --create -s $searchdomain_name` + +Creates the searchdomain as specified under `$searchdomain_name` + +### List searchdomains +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --list` + +List all searchdomains + +### Update searchdomain +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --update -s $searchdomain_name [-n $searchdomain_newname] [-S $searchdomain_newsettings]` + +Set a new name and/or update the settings for the searchdomain. + +### Delete searchdomain +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --searchdomain --delete -s $searchdomain_name` + +Deletes a searchdomain and its corresponding entites. + +## Entity +### Create / Index entity +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e $entities_as_JSON` +Creates the entities using the json string as specified under $entities_as_JSON + +Example: +- Linux: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{"name": "myfile.txt", "probmethod": "weighted_average", "searchdomain": "mysearchdomain", "attributes": {"mimetype": "text-plain"}, "datapoints": [{"name": "text", "text": "this is the full text", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}, {"name": "filepath", "text": "/home/myuser/myfile.txt", "probmethod_embedding": "weighted_average", "model": ["bge-m3", "nomic-embed-text", "paraphrase-multilingual"]}]}]'` +- Powershell: `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --index -o $ollama_URL -s $searchdomain_name -e '[{\"name\": \"myfile.txt\", \"probmethod\": \"weighted_average\", \"searchdomain\": \"mysearchdomain\", \"attributes\": {\"mimetype\": \"text-plain\"}, \"datapoints\": [{\"name\": \"text\", \"text\": \"this is the full text\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}, {\"name\": \"filepath\", \"text\": \"\/home\/myuser\/myfile.txt\", \"probmethod_embedding\": \"weighted_average\", \"model\": [\"bge-m3\", \"nomic-embed-text\", \"paraphrase-multilingual\"]}]}]'` + +Only the json: +```json +[ + { + "name": "myfile.txt", + "probmethod": "weighted_average", + "searchdomain": "mysearchdomain", + "attributes": { + "mimetype": "text-plain" + }, + "datapoints": [ + { + "name": "text", + "text": "this is the full text", + "probmethod_embedding": "weighted_average", + "model": [ + "bge-m3", + "nomic-embed-text", + "paraphrase-multilingual" + ] + }, + { + "name": "filepath", + "text": "/home/myuser/myfile.txt", + "probmethod_embedding": "weighted_average", + "model": [ + "bge-m3", + "nomic-embed-text", + "paraphrase-multilingual" + ] + } + ] + } +] +``` +### Evaluate query (i.e. "search"; that what you're here for) +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --evaluate -o $ollama_URL -s $searchdomain_name -q $query_string [-n $max_results]` + +Executes a search using the specified query string and outputs the results. + +### List entities +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --list -s $searchdomain_name` + +Lists all entities in that domain (together with its attributes and datapoints and probmethod) + +### Delete entity +`src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --entity --remove -s $searchdomain_name -n $entity_name` + +Deletes the entity specified by `$entity_name`. \ No newline at end of file diff --git a/docs/Indexer.md b/docs/Indexer.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/Server.md b/docs/Server.md new file mode 100644 index 0000000..71be705 --- /dev/null +++ b/docs/Server.md @@ -0,0 +1,26 @@ +# Server +The server by default +- runs on port 5146 +- Uses Swagger UI in development mode (`/swagger/index.html`) +- Ignores API keys when not in development mode + +# Installing the dependencies +## Ubuntu 24.04 +1. `sudo apt update && sudo apt install dotnet-sdk-8.0 -y` +## Windows +Download the [.NET SDK](https://dotnet.microsoft.com/en-us/download) or follow these steps to use WSL: +1. Install Ubuntu in WSL (`wsl --install` and `wsl --install -d Ubuntu`) +2. Enter your WSL environment `wsl.exe` and configure it +3. Update via `sudo apt update && sudo apt upgrade -y && sudo snap refresh` +4. Continue here: [Ubuntu 24.04](#Ubuntu-24.04) + +# MySQL database setup +1. Install the MySQL server: +- Linux/WSL: `sudo apt install mysql-server` +- Windows: [MySQL Community Server](https://dev.mysql.com/downloads/mysql/) +2. connect to it: `sudo mysql -u root` (Or from outside of WSL: `mysql -u root`) +3. Create the database +`CREATE DATABASE embeddingsearch; use embeddingsearch;` +4. Create the user +`CREATE USER 'embeddingsearch'@'%' identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch; FLUSH PRIVILEGES;` +5. Create the tables using the CLI tool: `dotnet build` and `src/cli/bin/Debug/net8.0/cli -h $mysql_ip -p $mysql_port -U $mysql_username -P $mysql_password --database --setup`