The Email That Started It All
Imagine you work at a 40-person logistics company. It's a Tuesday afternoon. A message lands in your inbox: "Microsoft Support Alert — unusual sign-in activity detected on your corporate account within the last 24 hours." There is a link. There is a deadline. There is urgency.
You click.
Within hours, your company's email credentials are compromised. The attackers pivot to your finance team. An invoice gets rerouted. By the time anyone notices, real money is gone.
This is not a hypothetical. According to the Verizon 2025 Data Breach Investigations Report, phishing remains the single most common initial access vector across industries. The Hoxhunt Phishing Trends Report shows Media Production organizations alone face over 4,500 phishing attempts per 1,000 organizations annually. Government and Manufacturing are not far behind.
"The attack didn't exploit a software vulnerability. It exploited a person — someone who had no tool to help them stop and think."
That gap — between a sophisticated attack and a person with no context to evaluate it — is exactly what this project set out to close.
Why Existing Tools Fall Short for SMEs
Large organizations have Security Operations Centers, threat intelligence feeds, and dedicated analysts. Small and Medium Enterprises (SMEs) have none of that. They have an email client and a hope that their spam filter is having a good day.
The tools that do exist fall into three categories, each with a critical flaw:
Black-box ML models
Modern phishing classifiers built on BERT or RoBERTa can detect phishing with impressive accuracy. But when they flag an email, they return one word: PHISHING. No explanation. No evidence. No reason for the employee to understand why this specific email is dangerous or what to look for next time. A black box does not build security literacy — it builds dependency.
Cloud LLMs
You could pipe every email into GPT-4 or Gemini and ask for an explanation. The output would be excellent. But to do that, every email in your organization passes through a third-party server. For companies subject to HIPAA, GDPR, or basic confidentiality obligations, this is a non-starter. Sensitive client correspondence, internal financial emails, HR communications — all of it transmitted to an external API.
Simulation platforms
Phishing awareness training tools like KnowBe4 are genuinely valuable for building security culture. But they are training-time tools. They do not sit in your inbox and reason about live emails as they arrive. Training is not the same as real-time defense.
No tool existed that combined on-device privacy, real-time operation, and human-readable natural language explanation grounded in actual email evidence.
Our Approach
What if you had a small, local AI that reads your email, makes a decision, and then explains itself to you in plain English — pointing to the exact phrases that gave it away?
That is our system. A Chrome extension that sits alongside Gmail. When you open an email, it runs a locally-deployed AI model on your own machine. No internet call. No data leaving your device. The model returns:
The key design constraint: everything runs on your machine. The model is small enough to run on a consumer GPU. No subscription. No API key. No data leaving your network.
How We Built It
Three decisions shaped the technical architecture. Each one was a deliberate trade-off.
Decision 1: Why a Small Language Model?
We evaluated the landscape of Small Language Models — models with under 3 billion parameters that can run locally without data center infrastructure. We chose Google Gemma-2-2B-IT, an instruction-tuned variant with strong reasoning relative to its size. Crucially, it runs on a consumer Nvidia RTX 4060. No cloud. No subscription. An SME employee with a modern laptop can use it.
Decision 2: The Teacher-Student Pipeline
Here is the core insight of this project. A 2-billion-parameter model, out of the box, is terrible at phishing detection — our zero-shot recall was just 0.118. It catches barely one in ten phishing emails. Useless.
But what if we could teach it? We used Gemini 2.5 Pro as a teacher. For each of our 996 training emails, we gave Gemini the true label and asked it to generate a structured explanation: the reasoning, the evidence, the advice. Gemini is large, expensive, and cloud-hosted — but we only used it once, during dataset creation. The student (Gemma) learns from these high-quality examples and from then on runs entirely locally.
// Gemini 2.5 Pro receives the true label and email // It generates structured JSON that Gemma learns to replicate { "label": "PHISHING", "explanation": "This email uses urgency...", "evidence_snippets": ["micr0soft-security.example", "within 12 hours"], "user_advice": ["Do not click links", "Verify sender directly"] }
Decision 3: LoRA Fine-Tuning
Fine-tuning a 2-billion-parameter model from scratch requires enormous compute. We used Low-Rank Adaptation (LoRA) — a technique that inserts small trainable matrices into the model's attention layers and updates only those, leaving the rest frozen. We updated just 12.45 million parameters — 0.48% of the total. The training ran on a single A100 GPU in under two epochs.
annotated by teacher
updated via LoRA
on A100 GPU
Seeing It in Action
Here is our system analyzing a real phishing email — a crafted "Microsoft Support Alert" designed to steal account credentials. The email uses three classic techniques: urgency framing, threat of account restriction, and a spoofed domain (micr0soft-security.example) where the letter "o" is replaced by a zero.
The sidebar shows exactly what makes this dangerous: a suspicious domain that impersonates Microsoft, urgency language designed to prevent careful thinking, and a deadline pressure tactic. The advice is immediate and actionable — no jargon, no technical background required.
This is the difference between a tool that says "phishing" and a tool that says "here is why, here is the evidence, here is what to do."
The Numbers
Numbers are only meaningful when you understand what they measure. Here is what our evaluation found — and why each metric matters in practice.
| Model Setting | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Zero-Shot (no fine-tuning) | 0.60 | 0.42 | 0.12 | 0.18 |
| Fine-tuned — validation | 0.89 | 0.98 | 0.83 | 0.89 |
| Fine-tuned — test | 0.96 | 0.98 | 0.95 | 0.96 |
Why recall matters most here: In security, a false negative — a phishing email that gets through — is far more costly than a false alarm. Our zero-shot model had a recall of 0.12, meaning it missed 88% of phishing emails. After fine-tuning, recall jumped to 0.95. The teacher-student pipeline made the critical difference.
Beyond classification, we measured explanation quality:
explanation aligns with decision
fabricated evidence
on RTX 4060
The 97% label match rate is the result we are most proud of. It means the model does not just classify correctly — it generates explanations that are semantically consistent with its own decision. When it says PHISHING, its explanation says PHISHING too, and points to real evidence from the email. That consistency is what earns user trust.
What We Learned
Every project that runs into real constraints teaches you something. Here are ours, honestly:
Small datasets have limits
We trained on 996 emails. That is enough to prove a concept and hit strong numbers on our test set. It is not enough to cover the full diversity of real-world phishing: multilingual attacks, QR code redirects, malicious PDF attachments, targeted spear-phishing against specific industries. The model is good at what it saw. It has not seen everything.
GPU dependency is a real barrier
Our latency numbers are based on an Nvidia RTX 4060. On a CPU-only machine — which is most SME employees — inference is substantially slower. The model is practically usable on modern workstations with discrete GPUs, but we cannot honestly claim it is ready for every laptop in a 40-person company today.
Automated evaluation has a ceiling
We measured whether explanations are label-consistent and whether evidence appears verbatim in the email. We did not measure whether a real non-technical person, reading the explanation, would make a better decision. That is the evaluation that actually matters, and it requires a human study we have not yet run.
We built something that works well on a well-defined problem with a clean dataset. Taking it to production — diverse attacks, diverse users, diverse devices — is a separate and harder problem.
What Comes Next
The current system analyzes text. Phishing in 2026 is not only text. Attackers embed malicious instructions inside images, attach weaponized PDFs, and increasingly use QR codes to route victims to credential-harvesting sites. A complete defense needs to see what the user sees — not just the plain text content of an email body.
We also want to run a proper human study. Does our system's explanation actually change what a real employee does when they see a suspicious email? Does reading "this URL uses a zero instead of the letter O to impersonate Microsoft" make someone less likely to click next time? We believe it does. We want to measure it.
On the model side, there is interesting work to be done on quantized inference — running the model in 4-bit precision to bring latency down on CPU-only machines. That would remove the GPU dependency and make our system genuinely accessible to the SMEs it is designed for.
Closing Thoughts
Go back to the person at the logistics company. The one who clicked the link on a Tuesday afternoon.
They were not careless. They were not uninformed. They were a normal person facing a well-crafted attack designed by professionals, with no tool to help them slow down and look critically at what was in front of them.
Our work is an attempt at that tool. Not a filter that silently removes emails before anyone sees them. Not a black box that returns a verdict without evidence. A transparent, local, explainable assistant that says: here is what I found, here is why I am worried, here is what I would suggest you do.
The results suggest this is technically feasible at consumer hardware scale. The next step is putting it in front of real people and finding out if it actually helps.
The goal was never 100% detection. The goal was to give one more person the information they needed to make a better decision.
If you have questions about the project, the dataset, or the fine-tuning pipeline, reach out. The research is open and the conversation is welcome.