LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment
LLM Abuse Prevention Tool Using GCG Jailbreak Attack Detection and DistilBERT-Based Ethics Judgment
Blog Article
In recent years, the Sippy Cups misuse of large language models (LLMs) has emerged as a significant issue.This paper focuses on a specific attack method known as the greedy coordinate gradient (GCG) jailbreak attack, which compels LLMs to generate responses beyond ethical boundaries.We have developed a tool to suppress the improper use of LLMs by employing a high-precision detection method that combines syntactic tree analysis with the perplexity of generated text.Furthermore, the tool incorporates one of the small language models (SLMs), the DistilBERT model, Cables to evaluate the harmfulness of sentences, thereby preventing harmful content from entering the LLM.
Experimental results demonstrate that the tool effectively detects GCG jailbreak attacks and contributes to the secure usage of LLMs.In the test results, the defense success rate reached 90.8%.