Preview

TLDR

Together With

TLDR Data 2026-03-02

underCurrent: A one-day conference for data engineers and architects. (Sponsor)

Confluent is hosting a free one-day conference with a catch: there's no catch. It's a single-track event with no sponsors and no product pitches—just technical talks for data engineers and architects.

🎙️ Speakers include Joe Reis, Holden Karau, and Max Beauchemin
🚫 No vendors. No sales pitches
✨ 100% free to attend
📍San Francisco 📅 March 26
🎟️ Limited to 100 seats — register for free here

📱

Deep Dives

Database Federation: Decentralized and ACL-Compliant Hive Databases (11 minute read)

Uber Engineering decentralized its monolithic Hive data warehouse by implementing Database Federation, migrating datasets into smaller, domain or team-owned Hive databases without data duplication or downtime through a one-time Bootstrap Migrator to copy data to new HDFS locations and update pointers, while real-time (Apache Flink + Piper) and batch synchronizers maintained HMS metadata consistency bidirectionally.

HNSW at Scale: Why Adding More Documents to Your Database Breaks RAG (27 minute read)

When HNSW-based RAG systems exceed around 100K vectors, latency grows super linearly and recall drops, often returning highly similar but irrelevant results for rare queries. The causes include local minima traps, hubness in high dimensions, and memory pressure. Mitigations include tuning M, ef_construct, and ef_search, using hybrid two-stage retrieval, applying quantization with oversampling and rescoring, and relying on optimized engines like Qdrant.

Inside Tencent Games' Real-Time Event-Driven Analytics System (8 minute read)

Tencent Games implemented a real-time analytics architecture leveraging CQRS and event sourcing principles, using Apache Pulsar for high-throughput event ingestion and ScyllaDB to efficiently fan out events to millions of gameplay sessions. They partition events via session IDs and leverage ScyllaDB's keyspaces and cross-region data replication to streamline multi-tenant data management. This design decouples application logic from data distribution and delivers low-latency, globally consistent operations for risk monitoring and in-game moderation.

Have Your Cake and Decompress it Too (11 minute read)

SpiralDB's Vortex columnar format implements Cascading Compression, a recursive, data-driven approach that chains multiple lightweight, fast-decoding encodings per column. It evaluates schemes on stratified samples (~1% of data), selects the best greedy path, and recursively compresses intermediate outputs (codes, dictionaries, and lengths) in a depth-limited tree to handle skewed, clustered, or mixed distributions without relying on a fixed codec like ZSTD.

🚀

Opinions & Advice

Analytics Engineering's Unfinished Work (5 minute read)

Analytics engineering is resurging as semantic layers, AI context, and structured business logic become critical across structural, measurement, and interpretive tiers. dbt shaped the role but narrowed it to SQL. With AI agents consuming business logic, ambiguity is riskier, making strong context engineering and clear semantics essential to keep data trusted and machine-ready.

Using LLMs to amplify human labeling and improve Dash search relevance (6 minute read)

Dropbox enhanced search relevance in Dropbox Dash by combining a small seed set of high-quality human-labeled query-document relevance judgments with LLM-assisted labeling to scale training data by ~100x. Its team tuned LLM prompts on human data to minimize disagreements, then used the calibrated LLM offline as a "teacher" to generate massive synthetic labels from error-prone or representative samples.

To Live in an AI World, Knowing is Half the Battle (28 minute podcast)

True human agency in an AI-driven world requires a deep understanding of algorithms and technology, as people need to move from passive consumers to active shapers who can critique and improve systems. Developers should explain ideas clearly, prioritize societal good over raw speed, and actively redesign tech through curiosity, smart policy, and responsible choices.

💻

Launches & Tools

Hardwood: A New Parser for Apache Parquet (7 minute read)

Hardwood is a lightweight, open-source Parquet parser for Java 21+ built for high throughput, multi-threaded reads with minimal dependencies. Using page-level parallelism, adaptive prefetching, and memory mapping, it can read 9.2GB or 650M rows in about 1.2 seconds on 16 cores, over 2x faster than row-wise reads, and offers row and column APIs with broad compression support and planned predicate pushdown.

Rivet (GitHub Repo)

Rivet is a serverless actor platform where each actor has built-in state, storage, and real-time workflows, letting you run per-user or per-agent systems that scale from zero to millions with one API.

SpacetimeDB 2.02 Released (GitHub Repo)

SpacetimeDB combines the database and server into one system, running your app logic inside the database for fast, real-time applications without separate backend infrastructure.

PostgresCompare 1.1.104 Released (2 minute read)

Schema diffing and deployment automation for Postgres get smoother with new scripting and dependency-aware features.

🎁

Miscellaneous

Ontology driven Dimensional Modeling (12 minute read)

Traditional data models and semantic layers strip away real business meaning, so AI systems end up guessing context and giving confident but wrong answers. Adding a clear ontology, a structured map of how the business actually works, enables AI to understand cause and effect and move from basic reporting to true strategic insight.

The Lunatics are taking over the Asylum (13 minute read)

Agentic AI accelerates code writing but mainly amplifies existing team maturity: strong teams improve faster, weak teams create more bugs and risk. The bottleneck shifts to testing, CI/CD, security, governance, and legacy friction. Expect fewer, higher-skilled engineers, and better results from leaders who invest in capability rather than headcount cuts.

⚡

Quick Links

Selectstar (Website)

This Apache Flink SQL and Table API sandbox lets you experiment with dynamic tables and changelog streams, run live SQL against streaming data, and see how queries translate between tables and streams in real time

Backfills - The Necessary Evil of Data Engineering (9 minute read)

Backfills are unavoidable because data changes and bugs happen, but they are slow, risky, and can damage trust.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to jobs@tldr.tech and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Preview

TLDR Data 2026-03-02

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

More templates

🤑 Your deal on Blinkist Pro

Wedges You Can Walk In All Summer

Checks Just Dropped

Our line-up of the best offers this payday! ⭐