Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using and training an LLM and/or ML model to identify data leakage and code anomalies 🤖 #561

Open
JamieSlome opened this issue May 11, 2024 · 3 comments
Assignees
Labels
citi-hackathon Related to the Citi India Hackathon (Oct '24) question Further information is requested

Comments

@JamieSlome
Copy link
Member

LLMs, Machine Learning and Data Leakage Prevention

LLMs appear to be very capable (with caveats, no doubt) at detecting anomalous text, data and information used in some given context. A couple of experiments will quickly prove their capability. With the help of Perplexity AI, I have run the following experiments:

Experiment 1, Data Exfiltration

Inserting a price property into a package.json and asking Perplexity AI to identify the anomalous property.

Screenshot 2024-05-11 at 09 25 19

Experiment 2, Treasure Island

Inserting a phrase about social media in the second paragraph of the first chapter of Treasure Island by Robert Louis Stevenson.

Input 🔽

Screenshot 2024-05-11 at 09 32 03

Output 📤

Screenshot 2024-05-11 at 09 32 14

Models of usage ♻️

A couple of "finger in the air" models for implementation:

  1. Converting company policies around code creation and open source into prompts asked of the code before it is accepted for push or publication; results shared as informational
  2. Implementing a prompt for internal agents to ask further or more expansive questions of the code set; different coding ecosystems and languages introduce nuances that people can bring expertise too
  3. Use the basis of inputs and outputs, as confirmed with human support, to re-train and strengthen a given model

Feedback 💬

On reflection, this seems extremely useful for our mission of reducing enterprise data leakage for code creation and contribution. Whilst I am not per se advocating for full automation with an LLM, it could certainly be additive and informational when a given agent is assessing potential leakage in a given snippet of code or an entire codebase.

I'd be keen to hear the opinion of others and reflections of whether this is "just another AI idea (JAAII)" or something that is practical, implementable and helpful to us. I am aware that accuracy of some model is no doubt a looming issue that would need to be addressed. Hallucinations and false positives come with a high risk in this scenario (i.e. leakage) - I would be curious to hear any feedback on how this could be mitigated or addressed.

If you think you can present experiments that better demonstrate the power of an LLM in detecting and preventing data or IP leakage in code, feel free to share ❤️

@JamieSlome JamieSlome added the question Further information is requested label May 11, 2024
@JamieSlome JamieSlome self-assigned this May 11, 2024
@JamieSlome JamieSlome changed the title R&D: Using and training an LLM and/or ML model to identify data leakage and code anomalies 🤖 Using and training an LLM and/or ML model to identify data leakage and code anomalies 🤖 May 11, 2024
@vaibssingh
Copy link
Contributor

@JamieSlome Hey, I am taking this up for initial R&D

@JamieSlome JamieSlome assigned vaibssingh and unassigned JamieSlome Jul 29, 2024
@JamieSlome JamieSlome added the citi-hackathon Related to the Citi India Hackathon (Oct '24) label Oct 22, 2024
@chhokarpardeep
Copy link

@JamieSlome Hi, Start working on this use case.

@ssachis
Copy link

ssachis commented Nov 3, 2024

Hi @JamieSlome can I take this issue up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
citi-hackathon Related to the Citi India Hackathon (Oct '24) question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants