465926dcab703e7e7322c7b876e64dba4746307c
embeddingsearch
Embeddingsearch is a DotNet C# library that uses Embedding Similarity Search (similiarly to Magna) to semantically compare a given input to a database of pre-processed entries.
This repository comes with
- a server (accessible via API calls & swagger)
- a clientside library
- a CLI module (deprecated)
- a scripting based indexer service that supports
- Python
- Golang (WIP)
- Javascript (WIP)
How to set up / use
server
- Install ollama
- Pull a few models using ollama (e.g.
paraphrase-multilingual,bge-m3,mxbai-embed-large,nomic-embed-text) - Install the depencencies
- Set up a local mysql database
- Set up the configuration
- In
src/serverexecutedotnet build && dotnet runto start the server - (optional) Create a searchdomain using the web interface
client
- Download the package and add it to your project (TODO: NuGet)
- Create a new client by either:
- By injecting IConfiguration (e.g.
services.AddSingleton<Client>();) - By specifying the baseUri, apiKey, and searchdomain (e.g.
new Client.Client(baseUri, apiKey, searchdomain))
- By injecting IConfiguration (e.g.
indexer
- Install the dependencies
- Set up the server
- Configure the indexer
- Set up your indexing script(s)
- Run with
dotnet build && dotnet run(Or/usr/bin/dotnet build && /usr/bin/dotnet run)
CLI
Before anything follow these steps:
- Enter the project's
srcdirectory (used as the working directory in all examples) - Build the project:
dotnet buildAll user-defined parameters are denoted using the$symbol. I.e.$mysql_ipmeans: replace this with your MySQL IP address or set it as a local variable in your terminal session.
All commands, parameters and examples are documented here: docs/CLI.md
Known issues
| Issue | Solution |
|---|---|
| Failed to load /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so, error: /snap/core20/current/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.29' not found (required by /usr/lib/dotnet/host/fxr/8.0.15/libhostfxr.so) | You likely installed dotnet via snap instead of apt. Try running the CLI using /usr/bin/dotnet instead of dotnet. |
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Invalid attempt to access a field before calling Read() | The searchdomain you entered does not exist |
| Unhandled exception. MySql.Data.MySqlClient.MySqlException (0x80004005): Authentication to host 'localhost' for user 'embeddingsearch' using method 'caching_sha2_password' failed with message: Access denied for user 'embeddingsearch'@'localhost' (using password: YES) | TBD |
| System.DllNotFoundException: Could not load libpython3.12.so with flags RTLD_NOW | RTLD_GLOBAL: libpython3.12.so: cannot open shared object file: No such file or directory | Install python3.12-dev via apt |
To-do
- (High priority) Add default indexer
- Library
- Processing:
- Text / Markdown documents: file name, full text, paragraphs
- Documents
- PDF: file name, full text, headline?, paragraphs, images?
- odt/docx: file name, full text, headline?, images?
- msg/eml: file name, title, recipients, cc, text
- Images: file name, OCR, image description?
- Videos?
- Presentations (Impress/Powerpoint): file name, full text, first slide title, titles, slide texts
- Tables (Calc / Excel): file name, tab/page names?, full text (per tab/page)
- Other? (TBD)
- Processing:
- Server
Scripting capability (Python; perhaps also lua)(Done with the latest commits)Intended sourcing possibilities:Local/Remote files (CIFS, SMB, FTP)Database contents (MySQL, MSSQL)Web requests (E.g. manual crawling)
Script call management (interval based & event based)
- Library
- Implement hash value to reduce wasteful re-indexing (Perhaps as a default property for an entity, set by the default indexer)
- Implement Healthz check
- Implement ReaderWriterLock for entityCache to allow for multithreaded read access while retaining single-threaded write access.
- NuGet packaging and corresponding README documentation
- Add option for query result detail levels. e.g.:
- Level 0:
{"Name": "...", "Value": 0.53} - Level 1:
{"Name": "...", "Value": 0.53, "Datapoints": [{"Name": "title", "Value": 0.65}, {...}]} - Level 2:
{"Name": "...", "Value": 0.53, "Datapoints": [{"Name": "title", "Value": 0.65, "Embeddings": [{"Model": "bge-m3", "Value": 0.87}, {...}]}, {...}]}
- Level 0:
- Add "Click-Through" result evaluation (For each entity: store a list of queries that led to the entity being chosen by the user. Then at query-time choose the best-fitting entry and maybe use it as another datapoint? Or use a separate weight function?)
- Reranker/Crossencoder/RAG (or anything else beyond initial retrieval) support
- Remove the CLI
- Improve error messaging for when retrieving a searchdomain fails.
- Remove the
idcollumns from the database tables where the table is actually identified (and should be unique by) the name, which should become the new primary key. - Improve performance & latency (Create ready-to-go processes where each contain an n'th share of the entity cache, ready to perform a query. Prepare it after creating the entity cache.)
- Make the API server (and indexer, once it is done) a docker container
- Implement dynamic invocation based database migrations
- Remove remaining DRY violations using the SQLHelper
- Update server setup in README.md to reflect the removal of the CLI
Future features
- Support for other database types (MSSQL, SQLite)
Community
Languages
C#
66.1%
HTML
25.1%
Python
5.1%
JavaScript
2.3%
CSS
1.2%
Other
0.2%