Dataset miner and manager, which uses Selenium DeepSeek automation.
Datasets are one of the most important factors for LM (language model) development.
A dataset with perfect examples and size makes a perfect LM. Not too much, not too short; it should fit to models size / parameter count.
Making language models can be seem so hard. But in fact, they are just math, trained by your dataset. Hard things are:
- Computational power to train them
- Creating a perfect dataset
Yes, creating a dataset manually would take years. So this is why DataSeek exists.
Demo video coming soon.
-
Clone the repository
git clone https://github.com/MYusufY/dataseek.git cd dataseek -
Lauch it
pip install -r requirements.txt python3 main.py
-
Enter your system prompt
- A system prompt for DataSeek is the way to describe what kind of dataset you want to be generated to DeepSeek.
- You should give information about your desired output count per interval, output format, style etc.
- You can see some examples in the examples folder.
-
Enter your example base JSON dataset (optional)
- If you enter or import a JSON dataset which already exists, its last 30 examples will be sent to DeepSeek right after the system prompt. So it would have more idea about the format & dataset.
- This slightly improves performance. You can give a few examples, or a whole dataset to improve it. (not start from scratch- add the new examples on top of it.)
- Its completely optional.
DataSeek will be released as a standalone app soon, for Linux, macOS and (maybe) Windows. If you want to, you can open an issue to make this process faster!
This repository is only for research purposes. I am not responsible for misuse. Please do not use in production!
📧 [email protected]
☕ Buy me a coffee
Thanks — hope this helps!