I have been an avid user and advocate of open source since the early days of Linux (circa 1995). I also make frequent contributions to open source and authored a few of my own.
💿 Repos: github.com/sujee • github.com/elephantscale
⭐️ Project highlights: Allycat • Data Prep Kit
Jump to:
My Open Source Projects
Allycat
Allycat is an end to end open-source RAG pipeline for website content.
Dockerized Stacks
I created these dockerized stacks of Big Data components to make development and running them easier
- Kafka in docker - run mini kafka cluster on a machine
- Spark in docker - run mini Spark cluster
- Training sandbox docker - that has Spark, Kafka, Tensorflow, ML stack, DL stack, Anaconda all pre-installed and configured to seamlessly work together.
- BigDL docker - Run Intel BigDL framework
Hadoop DNS Checker
Hadoop is very particular about DNS records of servers in the cluster. DNS record mis matches can cause runtime errors.
My hadoop DNS checker utility verifies DNS records of cluster machines.
My Open source contributions
Data Prep Kit
DataPrepKit is a set of open source tools designed to prepare data at large scale to train AI models.
Spark Job Server
Spark Job Server allows running Spark jobs with low latency.
Submitted multiple patches and pull requests
HBase
Contributed performance patch and document patches to HBase - a distributed noSQL database
- HBASE-4440 : A write benchmark writes lot of records. Then when a region splits, the writes are paused until the region is split and migrated to another server. This delay negatively affects the benchmark. My patch adds an option to pre-split the table, so the writes can be performed in parallel acros multiple regions / servers
- HBASE-5555 - documentation and scripts to verify DNS records of HBase machines.