⬆️ open-source/
Data Prep Kit accelerates unstructured data preparation for LLM app developers. Developers can use Data Prep Kit to cleanse, transform, and enrich use case-specific unstructured data to pre-train LLMs, fine-tune LLMs, instruct-tune LLMs, or build Retrieval Augmented Generation (RAG) applications for LLMs.
Data Prep Kit can scale from a single laptop to a cluster scale.
Repo: data-prep-kit/data-prep-kit

My Contribution to Data Prep Kit
I worked with the dev team to make Data Prep Kit more accessible to new users and easy to use.
- Created numerous examples for PDF processing / RAG / HTML processing
- Ran workshops and gave talks at conferences
- My examples repo: sujee/data-prep-kit-examples
- issues opened by me
- PRs submitted by me
Talks / Workshops Using Data Prep Kit
2025-Nov: Allycat workshop at QConSF
session details
2024-Dec: PyData Global 2024, Online
2024-Oct: AI Summit Silicon Valley
2024-Oct: IBM TechXchange, Las Vegas, NV
Pics



