- Explicar el workflow actual: para sacar data de una bdd a una aplicacion. [surister]
- Explicar el de queryzen en yuxtaposición con el anterior (es mucho mejor.) [darkfus]
- Explain every point with slides/graphs. [surister]
- Demo personal!!! [darkfus]
Right now CrateDB has ..
What other databases are doing:
| services: | |
| cratedb01: | |
| image: crate/crate:5.10.1 | |
| ports: | |
| - "4200:4200" | |
| - "5432:5432" | |
| volumes: | |
| - cratedb1:/data | |
| command: ["crate", | |
| "-Chttp.cors.enabled=true", |
In this talk Ivan, a Database Ecosystem Engineer at @CrateDB is going to show you how to create from scratch a hybrid search (key-word search and vector search ) service in Python with CrateDB.
He will give/explain you:
SELECT
table_name,
SUM(num_docs) as records,
(SUM(size) / (1024 * 1024)) as total_size_mib,
(SUM(size) / count(*)) / (1024 * 1024) as avg_size_per_shard_in_mib,
(SUM(size) / SUM(num_docs) :: DOUBLE) as avg_size_in_bytes_per_record
FROM
sys.shards
| <!-- | |
| In order to display math equations in vue3, you need to use something like mathjax or katex, I prefer | |
| katex since it seems to be the most powerful solution. | |
| vue3 katex libraries are mostly unmaintained or don't properly work on my setup, as of 2024-12-07 they bug out | |
| on my latest vue3 + nuxt projects, this is the most simple way I made it to work. | |
| You only need to run |
CrateDB - Storage usage on disk
CrateDB stores data in a row and column store, on top of that, it automatically creates an index, on reads the index will be leveraged, and depending on the query, it will use the most efficient store.
This is one of the many features that makes CrateDB very fast when reading and aggregating data, but it has an impact on storage.
We are going to use Yellow taxi trip - January 2024 which has 2_964_624 rows
One of the most effective ways to improve query performance is through indexing. At CrateDB, we said, what's faster than one index? everything indexed! - We took the bold approach: indexing every column by default. But we didn't stop there—we leverage multiple data structures for every indexed column. At query time, CrateDB intelligently selects the optimal index based on the query type, enabling faster and more efficient results.
But you probably have many questions. Does this actually work? How did you do it? Isn't there a performance penalty on write speed? And updates? How about storage size?
In this talk we will tell you all about Hybrid Idexes, one of the fundamental aspects of CrateDB: an Open-source distributed SQL Database for Real-Time Analytics and Hybrid Search.
In https://cratedb.com/blog/hybrid-search-explained we learned about Hybrid Search and how to do it in pure SQL, the resulting query can be hard to understand if you don't have too much experience doing Common table expressions (CTEs), in this piece we will dive deeper into CTEs and the smaller details of the query.
In the last chapter, we learned that Hybrid Search is pretty much doing some queries that capture different meanings and combine them, don't forget about this as we will see how CTEs are very similar.
##Common Table Expressions.
CTEs are subqueries that can be referenced from a main query, they are temporal, meaning that outside of this main query, they do not exist.
| There are already a bunch of hybrid search in haystack past conferences: | |
| EU 2023: (Mastering Hybrid Search: Blending Classic Ranking Functions with Vector Search for Superior Search Relevance)[https://haystackconf.com/eu2023/talk-10/] | |
| EU 2023: (Reciprocal Rank Fusion (RRF) or How to Stop Worrying about Boosting)[https://haystackconf.com/eu2023/talk-2/] | |
| US 2024: (All Vector Search is Hybrid Search)[https://haystackconf.com/us2024/talk-1/] | |
| US 2024: (Better Semantic Search with Hybrid (Sparse-Dense) Search) | |
| # Doing hybrid search on your real-time data in pure SQL with CrateDB's index-all strategy. | |
| Points to highlight: |