Data Info Knowledge

I build pobablistic ML systems and much more

Pietrasanta, IT

01About

Probability over certainty, end-to-end over hand-off.

Assisted reading

About

I'm a Data Scientist and ML Engineer with a statistics background. BSc in Statistical Sciences from Bologna (110 cum laude, 2018), MSc in Data Science from Milano-Bicocca (108/110, 2022).
That order matters: I think in distributions first, in models second.

My working line is probabilistic ML and LLM agent systems, but I don't stop at the model. I've shipped products end-to-end — from ingestion and warehouse modeling, through training and evaluation, up to the UI a user actually clicks. The reason is selfish: a model that never reaches a user is a paper, not a product, and I prefer products.

Right now I'm Data Scientist at Menumal, designing and implementing the company's data platform: heterogeneous ingestion with Airbyte, transforms in dbt, orchestration on Temporal.io, probabilistic record linkage with Splink, Postgres underneath, Metabase on top. Production decisions documented, reconciliation audited, security audited.

Before Menumal, I spent close to three years at Tomato AI as Full-Stack Developer and AI Engineer, rewriting a hospitality SaaS from legacy PHP to Next.js + Django + PostgreSQL with a team of three, migrating active clients to the new platform, and building AI modules for revenue management and supplier price monitoring on the OpenAI API. Deployment was hybrid AWS + Vercel. Independent consulting since 2019 keeps the muscles for market research, segmentation, and lightweight modeling honest.

On the side I'm building vague — a Python library that represents an LLM agent's memory as a Gaussian Mixture Model in embedding space, instead of a list of retrievable chunks. Two primitives, benchmarked on LongBench (3 tasks × 2 models). GaussianBelief is F1-equivalent to naive RAG — the value is structural: belief states compose, update incrementally, and merge between agents in closed form as mixture parameters. SummaryBelief (experimental) compresses the injected context 15–40×: on small/quantized models (Qwen3-8B-4bit) it recovers +30–65% F1 over the best baseline; on a frontier model (Haiku 4.5) it costs −15%. A real engineering trade-off, surfaced empirically. Integrations with the Anthropic SDK and LangGraph. Pre-release; numbers are real, API isn't frozen yet.

There's a consistent line through the academic work too — my MSc thesis was on a framework for unbiased word embeddings, and I have a separate project on adversarial fair classification. Responsible AI isn't a sticker I add later; it's how I was trained to think.

MSc Data Science
BSc Statistics
4Y+ production ML
Open source

02Stack

Honest fingerprint of what I use in production.

ML / DL

Probabilistic and deep models.

Python
PyTorch
scikit-learn
Gaussian Mixture Models
Probabilistic ML
Embeddings

LLM / Agents

Agentic systems, retrieval, evaluation.

Anthropic SDK
OpenAI API
LangGraph
RAG
LongBench
Needle-in-haystack

Data Engineering

Ingestion, modeling, orchestration.

Airbyte
dbt
Temporal.io
Splink
Postgres
SQL
pandas

Web / Infra

Product surface, APIs, and infrastructure.

Next.js
Django
Docker
AWS
Vercel
Railway
Git

Practices: production deployment · data quality & reconciliation · security audit documentation

03Contact

Fastest way to reach me.

Emailchongsu.p@gmail.com LinkedInlinkedin.com/in/lorenzopastore1 GitHubgithub.com/LorenzoPastore