Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For similarity it is important to consider the dimensionality of your embeddings. The larger the text you wish to compare the bigger each embedding should be (to my limited understanding).

So a paragraph might be good as a 384-dim vector but if you have 1,000 words then you might want a 768-dim embedding (if not higher). Embedding models have slightly better/worse accuracy based on the training data they're fed, but higher dimensionality definitely gives better results - to a great extent. If you have an extensively long piece of text, it's easier to chunk it into pieces and create separate embeddings. You do have to manually stitch them back together and do some cleanup when displaying results but it works.

Once you have embeddings for all your data the rest is just cosine similarity, play around with the min_similarity. You will need to build good indexes on postgres but it is basically all you need.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: