first commit

This commit is contained in:
EzFeDezy
2025-03-23 22:01:46 +01:00
commit 70759e7870
759 changed files with 9591 additions and 0 deletions

54
README.md Normal file
View File

@@ -0,0 +1,54 @@
# embeddingsearch
Embeddingsearch is a python library that uses Embedding Similarity Search (similiarly to [Magna](https://github.com/yousef-rafat/Magna/tree/main)) to semantically compare a given input to a database of pre-processed entries.
When first implementing the idea, it was conceptualized to only import files into the database.
# How to set up
1. Install ![ollama](https://ollama.com/download)
2. Pull a few models using ollama (e.g. `paraphrase-multilingual`, `bge-m3`, `mxbai-embed-large`, `nomic-embed-text`)
3. [Install the depencencies](#installing-the-dependencies)
4. [Set up a local mysql database](#mysql-database-setup)
# How to run the example script
1. Start the script `python3 dbtest.py`
2. Generate the index. Type in `index_folder` and submit. Then `target` and submit. (This might take a while with no GPU acceleration - go get some coffee)
3. After the indexing is done, you may prompt searches using `search`
# Installing the dependencies
## Ubuntu 24.04
`pip install mysql.connector`
`apt install python3-magic`
## Windows
TODO
# MySQL database setup
1. Install mysql: `sudo apt install mysql-server` and connect to it: `sudo mysql -u root`
1. Create the database
`CREATE DATABASE embeddingsearch; use embeddingsearch;`
2. Create the user
`CREATE USER embeddingsearch identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch;`
3. Create the tables
```sql
CREATE TABLE searchdomain (id int PRIMARY KEY auto_increment, name varchar(512), settings JSON);
CREATE TABLE query (id int PRIMARY KEY auto_increment, id_searchdomain int, query TEXT, FOREIGN KEY (id_searchdomain) REFERENCES searchdomain(id));
CREATE TABLE entity (id int PRIMARY KEY auto_increment, name varchar(512), probmethod varchar(128), id_searchdomain int, FOREIGN KEY (id_searchdomain) REFERENCES searchdomain(id));
CREATE TABLE queryresult (id int PRIMARY KEY auto_increment, id_query int, id_entity int, result double, FOREIGN KEY (id_query) REFERENCES query(id), FOREIGN KEY (id_entity) REFERENCES entity(id));
CREATE TABLE attribute (id int PRIMARY KEY auto_increment, id_entity int, attribute varchar(512), value longtext, FOREIGN KEY (id_entity) REFERENCES entity(id));
CREATE TABLE datapoint (id int PRIMARY KEY auto_increment, name varchar(512), probmethod_embedding varchar(512), id_entity int, FOREIGN KEY (id_entity) REFERENCES entity(id));
CREATE TABLE embedding (id int PRIMARY KEY auto_increment, id_datapoint int, model varchar(512), embedding blob, FOREIGN KEY (id_datapoint) REFERENCES datapoint(id));
```
# To-do
- Proper config file
- Add support for other databases?
- Add database setup script?
- Remove tables related to caching (It's not done on the sql server side anymore.)
# Off-scope
- Support for other database types