Files
embeddingsearch/README.md
2025-03-23 22:22:40 +01:00

56 lines
2.8 KiB
Markdown

# embeddingsearch
**This is still highly work-in-progress**
Embeddingsearch is a python library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries.
When first implementing the idea, it was conceptualized to only import files into the database.
# How to set up
1. Install [ollama](https://ollama.com/download)
2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`)
3. [Install the depencencies](#installing-the-dependencies)
4. [Set up a local mysql database](#mysql-database-setup)
# How to run the example script
1. Start the script `python3 dbtest.py`
2. Generate the index. Type in `index_folder` and submit. Then `target` and submit. (This might take a while with no GPU acceleration - go get some coffee)
3. After the indexing is done, you may prompt searches using `search`
# Installing the dependencies
## Ubuntu 24.04
`pip install mysql.connector`
`apt install python3-magic`
## Windows
TODO
# MySQL database setup
1. Install mysql: `sudo apt install mysql-server` and connect to it: `sudo mysql -u root`
1. Create the database
`CREATE DATABASE embeddingsearch; use embeddingsearch;`
2. Create the user
`CREATE USER embeddingsearch identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch;`
3. Create the tables
```sql
CREATE TABLE searchdomain (id int PRIMARY KEY auto_increment, name varchar(512), settings JSON);
CREATE TABLE query (id int PRIMARY KEY auto_increment, id_searchdomain int, query TEXT, FOREIGN KEY (id_searchdomain) REFERENCES searchdomain(id));
CREATE TABLE entity (id int PRIMARY KEY auto_increment, name varchar(512), probmethod varchar(128), id_searchdomain int, FOREIGN KEY (id_searchdomain) REFERENCES searchdomain(id));
CREATE TABLE queryresult (id int PRIMARY KEY auto_increment, id_query int, id_entity int, result double, FOREIGN KEY (id_query) REFERENCES query(id), FOREIGN KEY (id_entity) REFERENCES entity(id));
CREATE TABLE attribute (id int PRIMARY KEY auto_increment, id_entity int, attribute varchar(512), value longtext, FOREIGN KEY (id_entity) REFERENCES entity(id));
CREATE TABLE datapoint (id int PRIMARY KEY auto_increment, name varchar(512), probmethod_embedding varchar(512), id_entity int, FOREIGN KEY (id_entity) REFERENCES entity(id));
CREATE TABLE embedding (id int PRIMARY KEY auto_increment, id_datapoint int, model varchar(512), embedding blob, FOREIGN KEY (id_datapoint) REFERENCES datapoint(id));
```
# To-do
- Implement the api server (WSGI via gunicorn / falcon)
- Add database setup script?
- Remove tables related to caching (It's not done on the sql server side anymore.)
# Off-scope
- Support for other database types