Preview

📱

Deep Dives

Garbage Day: How Pinecone Safely Deletes Billions of Objects at Scale (9 minute read)

Pinecone uses immutable storage where every write creates a new file, which leaves behind stale or orphaned data that increases storage costs. They built Janitor, a system that safely identifies, verifies, and deletes unused objects at scale using cautious multi-phase checks and auditability to avoid deleting data still in use.

Machine Learning with ADBC, DuckDB & XGBoost: No Pandas, Just Pure Arrow Tables (21 minute video)

This walkthrough shows a practical ML workflow where ADBC pulls data from DuckLake as Arrow tables directly into XGBoost, with SQL handling preprocessing and PyArrow handling the train/test split instead of Pandas or scikit-learn. The main point is that ADBC and Arrow can simplify and standardize database-to-ML pipelines with fewer libraries and a cleaner, more portable workflow.

Unified Context-Intent Embeddings for Scalable Text-to-SQL (17 minute read)

Pinterest's Analytics Agent transforms data discovery and Text-to-SQL by leveraging 2,500+ analysts' query history, governance metadata, and AI-generated documentation into a centralized, semantically searchable knowledge base. By embedding SQL intent and structural patterns, the system delivers context-aware, asset-first analytics, achieving 40% user adoption, cutting documentation effort by 70%, and dramatically improving query reliability and data trust.

The Practical Limits of DuckDB on Commodity Hardware (9 minute read)

DuckDB delivers warehouse-style, columnar analytics with sub-second performance on datasets up to 5 million rows and remains comfortably interactive for GROUP BY and percentile queries up to 10 million rows, even on $500 laptops (16GB RAM). On this setup, window functions become noticeably slower beyond 10M rows (1.7s at 5M, 6s at 10M, ~1 minute at 50M rows). Memory usage remains modest (<1.2GB for 50M rows), making the 1M-20M zone the sweet spot for local interactive analytics with DuckDB on consumer laptops.

🚀

Opinions & Advice

Claude Code + Dives = Any data UI (11 minute read)

With Claude Code connected to MotherDuck's MCP server for live data access, anyone can quickly build custom, interactive, refreshable data apps and visualizations as shareable “Dives” (React files with embedded live SQL queries), iterate rapidly using diff previews, and publish them directly in MotherDuck for team use.

It's about the strategy, stupid (15 minute read)

Most analytics work focuses on tactics (tools, audits, and tracking improvements) without asking how the data will actually support business strategy. Effective analytics starts by understanding the company's goals and decisions first, then designing data and metrics that directly support those strategic priorities.

💻

Launches & Tools

Your CEO has questions about the data. Can she answer them? (Sponsor)

The way people ask questions has changed. With Starburst's AIDA, every employee from analyst to CEO can explore enterprise data in plain language, run complex analyses, and apply their own business rules in real time. No SQL. No tickets. No BI backlog. Meet AIDA, your AI data assistant

Feldera (GitHub Repo)

Feldera is a query engine for incremental computation written in Rust. It continuously updates materialized views from inserts, updates, and deletes instead of recomputing everything. It supports full SQL, handles larger-than-memory data, connects to sources like Kafka, S3, CDC, and warehouses, and provides strong consistency so results match equivalent batch execution, with low-latency, high-throughput processing for real-time analytics and ETL workloads at scale.

Skore Is Live: Track Your Data Science (8 minute read)

Skore is an open-source layer around scikit-learn focused on model evaluation, comparison, and experiment/report persistence for teams. It reduces evaluation boilerplate, adds methodological guardrails, and makes collaboration more reproducible. Skore means cleaner handoffs from notebooks to production through structured artifacts, metrics, and project-level tracking.

Inside the flight path of real-time ingestion in Apache Pinot (13 minute read)

To guarantee a single consistent "winning" segment across multiple Pinot server replicas consuming the same Kafka partition, Pinot uses a lightweight controller-orchestrated blocking commit protocol where the controller elects a committer based on the max offset, while non-committers discard locals and download the official version.

🎁

Miscellaneous

Scaling to 120+ AI Agents Without Losing Control (23 minute read)

Switch to a multi-agent system with a conductor-specialists pattern when scaling beyond 15 tools or 3 conflicting domains. Multi-agent design maximizes per-task quality but adds orchestration complexity, hybrid retrieval outperforms single approaches, and tightly-scoped tool profiles reduce token waste. Example architecture: VoltAgent for orchestration, SurrealDB for unified vector/graph/relational storage, hybrid retrieval (0.6 vector, 0.2 graph, 0.2 keyword), and cost control via dynamic task routing (with Haiku for classification).

Anthropic's Compute Advantage: Why Silicon Strategy is Becoming an AI Moat (11 minute read)

Anthropic has established a structurally superior, diversified compute stack (leveraging AWS Trainium2 and Google TPUv7) delivering 30–60% lower per-token costs than Nvidia-only configurations and enabling 2+ gigawatts of dedicated capacity. This architecture, secured through $52 billion in long-term commitments with Broadcom, AWS, and Google, grants Anthropic unmatched negotiating leverage, cost-efficiency, and iteration speed. In contrast, OpenAI and Microsoft remain largely dependent on Nvidia, facing significantly higher inference costs and delayed internal silicon programs.

⚡

Quick Links

Pandas vs. Polars: A Complete Comparison of Syntax, Speed, and Memory (5 minute read)

Pandas is more intuitive for exploratory analysis, while Polars uses expressions and lazy evaluation to optimize queries, making it better suited for high-performance, large-scale data engineering and production workloads.

Stop Tuning Hyperparameters. Start Tuning Your Problem (14 minute read)

Over 80% of ML failures come from poor problem framing, not bad models or data.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of data engineering professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here, create your own role or send a friend's resume to matt@templ8.email and get $1k if we hire them! TLDR is one of Inc.'s Best Bootstrapped businesses of 2025.

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Joel Van Veluwen, Tzu-Ruey Ching & Remi Turpaud

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR Data isn't for you, please .

Preview

TLDR Data 2026-03-09

Deep Dives

Opinions & Advice

Launches & Tools

Miscellaneous

Quick Links

More templates

A hundred reasons to buy new underwear

Final Hours! ⏰ 25% Off Dresses & Beauty

Take Up To 75% Off Clearance!

Behind the Scenes with Jayne Brown