Database Federation: Decentralized and ACL-Compliant Hive Databases (11 minute read)
Uber Engineering decentralized its monolithic Hive data warehouse by implementing Database Federation, migrating datasets into smaller, domain or team-owned Hive databases without data duplication or downtime through a one-time Bootstrap Migrator to copy data to new HDFS locations and update pointers, while real-time (Apache Flink + Piper) and batch synchronizers maintained HMS metadata consistency bidirectionally.
|
HNSW at Scale: Why Adding More Documents to Your Database Breaks RAG (27 minute read)
When HNSW-based RAG systems exceed around 100K vectors, latency grows super linearly and recall drops, often returning highly similar but irrelevant results for rare queries. The causes include local minima traps, hubness in high dimensions, and memory pressure. Mitigations include tuning M, ef_construct, and ef_search, using hybrid two-stage retrieval, applying quantization with oversampling and rescoring, and relying on optimized engines like Qdrant.
|
Inside Tencent Games' Real-Time Event-Driven Analytics System (8 minute read)
Tencent Games implemented a real-time analytics architecture leveraging CQRS and event sourcing principles, using Apache Pulsar for high-throughput event ingestion and ScyllaDB to efficiently fan out events to millions of gameplay sessions. They partition events via session IDs and leverage ScyllaDB's keyspaces and cross-region data replication to streamline multi-tenant data management. This design decouples application logic from data distribution and delivers low-latency, globally consistent operations for risk monitoring and in-game moderation.
|
Have Your Cake and Decompress it Too (11 minute read)
SpiralDB's Vortex columnar format implements Cascading Compression, a recursive, data-driven approach that chains multiple lightweight, fast-decoding encodings per column. It evaluates schemes on stratified samples (~1% of data), selects the best greedy path, and recursively compresses intermediate outputs (codes, dictionaries, and lengths) in a depth-limited tree to handle skewed, clustered, or mixed distributions without relying on a fixed codec like ZSTD.
|
|
Analytics Engineering's Unfinished Work (5 minute read)
Analytics engineering is resurging as semantic layers, AI context, and structured business logic become critical across structural, measurement, and interpretive tiers. dbt shaped the role but narrowed it to SQL. With AI agents consuming business logic, ambiguity is riskier, making strong context engineering and clear semantics essential to keep data trusted and machine-ready.
|
Using LLMs to amplify human labeling and improve Dash search relevance (6 minute read)
Dropbox enhanced search relevance in Dropbox Dash by combining a small seed set of high-quality human-labeled query-document relevance judgments with LLM-assisted labeling to scale training data by ~100x. Its team tuned LLM prompts on human data to minimize disagreements, then used the calibrated LLM offline as a "teacher" to generate massive synthetic labels from error-prone or representative samples.
|
To Live in an AI World, Knowing is Half the Battle (28 minute podcast)
True human agency in an AI-driven world requires a deep understanding of algorithms and technology, as people need to move from passive consumers to active shapers who can critique and improve systems. Developers should explain ideas clearly, prioritize societal good over raw speed, and actively redesign tech through curiosity, smart policy, and responsible choices.
|
|
Hardwood: A New Parser for Apache Parquet (7 minute read)
Hardwood is a lightweight, open-source Parquet parser for Java 21+ built for high throughput, multi-threaded reads with minimal dependencies. Using page-level parallelism, adaptive prefetching, and memory mapping, it can read 9.2GB or 650M rows in about 1.2 seconds on 16 cores, over 2x faster than row-wise reads, and offers row and column APIs with broad compression support and planned predicate pushdown.
|
Rivet (GitHub Repo)
Rivet is a serverless actor platform where each actor has built-in state, storage, and real-time workflows, letting you run per-user or per-agent systems that scale from zero to millions with one API.
|
SpacetimeDB 2.02 Released (GitHub Repo)
SpacetimeDB combines the database and server into one system, running your app logic inside the database for fast, real-time applications without separate backend infrastructure.
|
|
Ontology driven Dimensional Modeling (12 minute read)
Traditional data models and semantic layers strip away real business meaning, so AI systems end up guessing context and giving confident but wrong answers. Adding a clear ontology, a structured map of how the business actually works, enables AI to understand cause and effect and move from basic reporting to true strategic insight.
|
The Lunatics are taking over the Asylum (13 minute read)
Agentic AI accelerates code writing but mainly amplifies existing team maturity: strong teams improve faster, weak teams create more bugs and risk. The bottleneck shifts to testing, CI/CD, security, governance, and legacy friction. Expect fewer, higher-skilled engineers, and better results from leaders who invest in capability rather than headcount cuts.
|
|
Selectstar (Website)
This Apache Flink SQL and Table API sandbox lets you experiment with dynamic tables and changelog streams, run live SQL against streaming data, and see how queries translate between tables and streams in real time
|
|
|
|
|