Data Prep Kit

⬆️ open-source/

Data Prep Kit accelerates unstructured data preparation for LLM app developers. Developers can use Data Prep Kit to cleanse, transform, and enrich use case-specific unstructured data to pre-train LLMs, fine-tune LLMs, instruct-tune LLMs, or build Retrieval Augmented Generation (RAG) applications for LLMs.

Data Prep Kit can scale from a single laptop to a cluster scale.

Repo: data-prep-kit/data-prep-kit   GitHub stars GitHub forks

My Contribution to Data Prep Kit

I worked with the dev team to make Data Prep Kit more accessible to new users and easy to use.

Talks / Workshops Using Data Prep Kit

2025-Nov: Allycat workshop at QConSF
session details

2024-Dec: PyData Global 2024, Online

2024-Oct: AI Summit Silicon Valley

2024-Oct: IBM TechXchange, Las Vegas, NV

Pics