Reinforcement Learning and the Rise of Task-Specialized AI Malware
July 30, 2025
The discussion around AI-assisted cyberattacks has long been speculative, with most real-world use limited to simple code generation or phishing content. But recent work demonstrated by Kyle Avery marks a turning point. Using Reinforcement Learning, Avery transformed a standard open-source language model into a lightweight engine capable of generating malware that evades Microsoft Defender — without access to large malware datasets and at a very low cost. this signals a significant shift: attackers may soon rely on specialized AI models trained to defeat specific security tools, rather than general-purpose assistance from traditional LLMs.
Since late 2023, people have warned that LLMs would eventually enable attackers to build advanced malware at scale — but so far, the practical use of ai by hackers has mostly been limited to generating simple code, phishing emails or basic research. that narrative is now changing. at the upcoming black hat conference, kyle avery (outflank) plans to demonstrate a lightweight model that can consistently evade Microsoft Defender for Endpoint, and it was developed for just a few thousand dollars.
The key turning point is Reinforcement Learning (RL). most mainstream LLMs (gpt-3.5, gpt-4 etc.) are trained by feeding them massive volumes of general data and letting them learn unsupervised. the jump from 3.5 to 4 simply made the model better at everything. But when OpenAi released o1, its behavior was different — stronger at math/coding, weaker at writing. that seemed odd until deepseek released R1 and explained how it used RL with verifiable rewards. instead of « learn everything from the internet », the model is trained to repeatedly solve one type of task and receives automatic feedback on whether the answer is correct.
This means a model can be shaped to become very good at one specific objective — for example evading a security product — even if you don’t have a large training dataset of malware samples.
Avery proved this in practice. he took the general open-source model Qwen 2.5, placed it in a sandbox with Microsoft Defender, and wrote a scoring script. at first the model randomly produced a piece of malicious code every once in a while, (maybe 1 in 1000 attempts). Whenever it produced something that actually ran, it received a reward. after many iterations, the model became good at producing functional malware, without ever being shown examples.
in the next step, the model was connected to defender’s alert API. Now, the scoring was based on the severity of the alert raised by defender. high alert = low reward. low alert = high reward. after enough loops, the model started producing malware that still worked but triggered lower and lower alerts — until some samples fully evaded detection.
After three months of RL training (costing roughly $1.5k), Avery’s model could generate undetectable malware in about 8% of attempts. that means an attacker using it repeatedly could expect to get a working evasive sample after roughly a dozen tries. other generalist models (Anthropic, Deepseek) only managed <1% in similar tests. the model is also small and lightweight and could run locally on a normal gpu.
The research suggests that in the near to medium term it will be very realistic for criminals to use task-specialized models to defeat commercial security tools — even without large training datasets or massive budgets.