Files
embeddingsearch/README.md
2025-03-23 22:01:46 +01:00

2.7 KiB

embeddingsearch

Embeddingsearch is a python library that uses Embedding Similarity Search (similiarly to Magna) to semantically compare a given input to a database of pre-processed entries.

When first implementing the idea, it was conceptualized to only import files into the database.

How to set up

  1. Install ollama
  2. Pull a few models using ollama (e.g. paraphrase-multilingual, bge-m3, mxbai-embed-large, nomic-embed-text)
  3. Install the depencencies
  4. Set up a local mysql database

How to run the example script

  1. Start the script python3 dbtest.py
  2. Generate the index. Type in index_folder and submit. Then target and submit. (This might take a while with no GPU acceleration - go get some coffee)
  3. After the indexing is done, you may prompt searches using search

Installing the dependencies

Ubuntu 24.04

pip install mysql.connector apt install python3-magic

Windows

TODO

MySQL database setup

  1. Install mysql: sudo apt install mysql-server and connect to it: sudo mysql -u root
  2. Create the database CREATE DATABASE embeddingsearch; use embeddingsearch;
  3. Create the user CREATE USER embeddingsearch identified by "somepassword!"; GRANT ALL ON embeddingsearch.* TO embeddingsearch;
  4. Create the tables
CREATE TABLE searchdomain (id int PRIMARY KEY auto_increment, name varchar(512), settings JSON);

CREATE TABLE query (id int PRIMARY KEY auto_increment, id_searchdomain int, query TEXT, FOREIGN KEY (id_searchdomain) REFERENCES searchdomain(id));

CREATE TABLE entity (id int PRIMARY KEY auto_increment, name varchar(512), probmethod varchar(128), id_searchdomain int, FOREIGN KEY (id_searchdomain) REFERENCES searchdomain(id));

CREATE TABLE queryresult (id int PRIMARY KEY auto_increment, id_query int, id_entity int, result double, FOREIGN KEY (id_query) REFERENCES query(id), FOREIGN KEY (id_entity) REFERENCES entity(id));

CREATE TABLE attribute (id int PRIMARY KEY auto_increment, id_entity int, attribute varchar(512), value longtext, FOREIGN KEY (id_entity) REFERENCES entity(id));

CREATE TABLE datapoint (id int PRIMARY KEY auto_increment, name varchar(512), probmethod_embedding varchar(512), id_entity int, FOREIGN KEY (id_entity) REFERENCES entity(id));

CREATE TABLE embedding (id int PRIMARY KEY auto_increment, id_datapoint int, model varchar(512), embedding blob, FOREIGN KEY (id_datapoint) REFERENCES datapoint(id));

To-do

  • Proper config file
  • Add support for other databases?
  • Add database setup script?
  • Remove tables related to caching (It's not done on the sql server side anymore.)

Off-scope

  • Support for other database types