Skip to content

Instantly share code, notes, and snippets.

@dstreefkerk
Created April 1, 2026 23:04
Show Gist options
  • Select an option

  • Save dstreefkerk/23d69591fd9a33759dcceb7b3a1aa274 to your computer and use it in GitHub Desktop.

Select an option

Save dstreefkerk/23d69591fd9a33759dcceb7b3a1aa274 to your computer and use it in GitHub Desktop.
YARA Agent Skill - Concept

Advanced Techniques

Read this file when the sample is packed or obfuscated, uses stack strings, or you need detailed hex pattern guidance (atoms, jumps, XOR/base64).

Contents

Handling obfuscation and packing

Modern malware rarely presents strings in the clear. Adapt the detection strategy rather than giving up.

XOR-encoded strings

Use the xor modifier when the sample uses single-byte XOR on known strings. Always pair with tight scope guards — xor multiplies the search space by 255x:

$x1 = "ThisIsMyC2" xor(0x01-0xff) fullword  // skip 0x00, that's the plaintext

Base64-encoded strings

Use base64 or base64wide when the sample embeds base64 payloads. Same performance caveat — constrain with filesize and file-type guards.

Packed/encrypted binaries with no visible strings

When a sample is highly entropic and yields no useful strings, pivot to structural detection:

  • math.entropy(0, filesize) > 7.0 flags packed content (but pair with other conditions — many legitimate installers are packed too).
  • PE section anomalies: zero raw_data_size with large virtual_size indicates a section that unpacks at runtime.
  • pe.imphash() catches families that use the same import table across variants.
  • Small import table with only LoadLibrary + GetProcAddress suggests dynamic API resolution (a packer/crypter pattern).
  • Unusual section names (.UPX0, .themida, custom names) indicate specific packers.

The goal is to detect the packing behaviour or structural fingerprint rather than the (invisible) payload strings.

Stack strings

Modern malware (especially C++ and Go) frequently builds strings character-by-character on the stack to evade static string extraction. The resulting assembly produces a distinctive pattern of mov byte instructions:

// Stack string pattern for "cmd" built via mov byte [rbp+offset], char
// C6 45 = mov byte ptr [rbp+...], followed by the character
$stack_cmd = { C6 45 ?? 63 C6 45 ?? 6D C6 45 ?? 64 }  // 'c', 'm', 'd'

When cleartext strings are missing but the sample clearly uses certain strings (visible in dynamic analysis or capa output), look for this mov byte pattern in the disassembly and translate it into a hex pattern. FLOSS is specifically designed to recover stack strings automatically — run it first before resorting to manual hex extraction.

Hex patterns

Hex patterns with wildcards are fast and precise for byte-level matching:

// Single-byte wildcards for relative offsets
$h1 = { 48 8B 05 ?? ?? ?? ?? 48 85 C0 74 ?? }

// Variable-length jumps — prefer [min-max] over long chains of ??
$h2 = { E8 [4] 85 C0 0F 84 [4-8] 48 89 }

// Bad — 10 wildcards when the gap is always 4-6 bytes
// $h3 = { E8 ?? ?? ?? ?? 85 C0 ?? ?? ?? ?? ?? ?? ?? ?? 48 89 }
// Good — express the actual variability
$h3 = { E8 [4] 85 C0 [4-6] 48 89 }

Use ?? for single-byte wildcards and [min-max] for variable-length gaps. Prefer [N] or [min-max] jumps over long chains of ?? — they are more performant and express intent more clearly.

Atom length rule

YARA's pre-filter extracts 4-byte "atoms" from strings to decide which files deserve a full scan. Strings shorter than 4 bytes (or hex patterns whose only literal run is < 4 bytes) cannot form a useful atom, forcing the engine to scan every file. Avoid:

  • Strings shorter than 4 characters.
  • Hex patterns where all literal segments are very short (e.g., { AA ?? BB ?? CC }).
  • Byte sequences that are extremely common at the binary level (e.g., { 00 00 00 00 }, { CC CC CC CC }).

These degrade to near-brute-force scanning.

Jump range limits

Keep variable-length jumps reasonable — under ~200 bytes. Large jumps (e.g., [0-1000]) cause state explosion in the engine's matching automaton and drastically slow scanning. If the gap between meaningful byte sequences is larger or unpredictable, split into two separate strings and constrain them with a condition:

$part_a and $part_b and @part_b > @part_a and @part_b - @part_a < 500

Negative guards

When a rule risks matching a known legitimate binary, add explicit exclusions rather than removing useful strings:

condition:
    uint16(0) == 0x5A4D and
    filesize < 400KB and
    not pe.exports("DllRegisterServer") and  // exclude legit COM DLLs
    not pe.imphash() == "a1b2c3d4..." and    // exclude known-good imphash
    (1 of ($s*) or 3 of ($x*))

Prefer adding constraints over removing strings. Removing a string reduces detection coverage; adding a negative guard preserves it.

Complementary tooling

YARA works best as part of a broader analysis workflow. These FLARE team open-source tools are particularly useful before and during rule writing:

FLOSS (FLARE Obfuscated String Solver) — Run FLOSS against a sample before writing strings. Unlike the standard strings utility, FLOSS recovers stack-constructed strings, decoded strings, and tight strings that malware authors use to evade basic static analysis. The output feeds directly into string selection for the rule. Especially valuable for Go binaries where the compiler generates stack strings by default.

capa (capability detection) — Run capa against a sample to understand what it does before deciding how to detect it. capa identifies capabilities at the code level (API call patterns, behaviours) and maps them to MITRE ATT&CK and MBC. Where YARA matches byte sequences, capa describes features at the function level. Use capa output to: identify which capabilities to target in the rule, find the right MITRE ATT&CK IDs for the meta block, and decide whether to pivot to structural detection (import combinations, PE anomalies) when strings alone are insufficient.

FakeNet-NG — Run FakeNet-NG during dynamic analysis to capture network indicators (C2 URLs, User-Agent strings, HTTP headers, DNS queries). These network artefacts are often embedded as strings in the binary and make excellent $s* or $x* candidates for the rule.

The workflow is: FLOSS + capa for static triage → FakeNet-NG for dynamic indicators → YARA rule using the combined output.

name yara-rules
description Write high-quality, performant, low-false-positive YARA rules for malware detection. Use this skill whenever the user asks to write, review, improve, refine, or debug a YARA rule, or when they provide malware samples, IOCs, or threat intel and want detection signatures. Also trigger when the user mentions YARA, detection engineering, malware signatures, hunting rules, retrohunting, or asks to convert IOCs/hashes into detection logic. Trigger when a user reports a false positive on an existing rule and wants help fixing it. Even if the user just pastes strings or hex and says "make a rule", use this skill.

Writing YARA Rules

This skill produces sound, reliable, performant, low-false-positive YARA rules. It draws on detection engineering best practices from Florian Roth (Nextron Systems), the official YARA documentation, and community-tested patterns.

Reference files

Read the relevant reference file before writing a rule when the situation applies:

File Read when...
references/targets.md The target is non-PE (LNK, OOXML, RTF, .NET, Go), you need magic bytes, or you need module predicates
references/advanced-techniques.md The sample is packed/obfuscated, uses stack strings, or you need hex pattern guidance (atoms, jumps, XOR/base64)
references/validation.md You need to test a rule, handle a false positive report, or want the full annotated example

Why most AI-generated YARA rules are bad

AI models reproduce anti-patterns from poor-quality training data. Understanding why they fail prevents repeating them:

  • Hash-equivalent rules — strings so specific they detect one sample only. Rules should catch variants.
  • Modifier spamascii wide nocase on everything widens search space enormously. Each modifier must be a deliberate choice.
  • Regex overuse — expensive; defeats YARA's atom-based optimiser. Use literal strings and hex patterns instead.
  • Missing scope guards — no filesize or magic-byte check forces the engine to evaluate every file. A filesize < 5MB guard eliminates 90%+.
  • No goodware testing — a rule that fires on svchost.exe is worse than no rule at all.
  • Lazy any of them — sets the bar too low. Use quantified counts (3 of ($x*)) requiring convergence of evidence.

Rule structure

Every rule follows this layout. Deviations need justification.

import "pe"  // Import only what the condition actually uses

rule <Namespace>_<Family>_<VariantOrPurpose>
{
    meta:
        author      = "<author>"
        description = "<one-line purpose  what it detects and why>"
        reference   = "<VT hash, report URL, or analysis ticket>"
        malpedia    = "<https://malpedia.caad.fkie.fraunhofer.de/details/...>"
        date        = "<YYYY-MM-DD>"
        version     = "1.0"
        mitre_attck = "<TXXXX>"  // recommended, not mandatory

    strings:
        // Categorised by signal strength
        $s1 = "..."  // high-signal
        $x1 = "..."  // medium-signal
        $z1 = "..."  // supporting

    condition:
        // Scope guards first, then string logic
        uint16(0) == 0x5A4D and
        filesize < 400KB and
        (1 of ($s*) or 3 of ($x*) or (2 of ($x*) and 2 of ($z*)))
}

Naming convention

Use Namespace_Family_Variant format:

  • APT29_CobaltStrike_Beacon — threat actor + tooling + variant
  • Ransomware_Conti_Dropper — category + family + stage
  • Hacktool_Mimikatz_Generic — category + tool + scope

Meta block

Always populate: author, description, reference, date. Include malpedia when a Malpedia entry exists — it is the industry standard for malware classification. The description should let another analyst understand what the rule targets without reading the strings.

String selection and categorisation

This is where rules succeed or fail. The goal is to select strings that are unique to the malware family and absent from legitimate software.

Identify the pivot first

Before extracting strings, ask: "What is the one thing the author cannot easily change without breaking the malware?" This is the pivot — the anchor your rule is built around. Examples: a custom crypto constant, a hardcoded C2 URI format string, a mutex name, a specific API call sequence.

The pivot becomes your highest-signal string ($s*). Everything else supports it. If you cannot identify a pivot, the rule will be fragile — flag this and recommend structural detection (PE anomalies, import combinations) instead.

What to look for in samples

Prioritise these artefact types in order of signal strength:

  1. Typos and misspellings in metadata or version info — gold, because legitimate software is spell-checked.
  2. Unique GUIDs, mutex names, pipe names, registry paths — author-generated, unlikely to collide with legitimate software.
  3. Unique file paths or filenames for staging, persistence, or exfiltration.
  4. Distinctive command/format strings — e.g., "GET /beacon/%s/%d".
  5. Unusual import combinations — better expressed via pe.imports() in the condition than as strings.
  6. Specific byte sequences at known offsets — custom headers, XOR keys.
  7. .NET metadatadotnet.assembly.name, dotnet.typelib, GUIDs, resource counts. Often higher signal than ASCII strings for managed assemblies.

Categorisation system

Use prefixed variable names to express signal strength:

Prefix Signal Description Example
$s* High Typos, unique GUIDs, distinctive paths. One match = strong evidence. $s1 = "Micorsoft Corportation" fullword wide
$x* Medium Unlikely in goodware, not unique to one sample. Need 2+. $x1 = "imemonsvc.dll" fullword
$z* Low Common strings that add confidence when combined with higher signals. $z1 = "urlmon" fullword ascii

The decision test: "Would I expect to find this string in a freshly installed Windows system, a browser, Office, or common enterprise software?" If yes, it is $z* at best. If no, $x* or $s* depending on uniqueness.

String modifiers

Apply the minimum set of modifiers needed. Justify every modifier in a comment above the string — this forces you to verify your choice:

  • fullword — use when natural word boundaries exist. Prevents substring FPs.
  • ascii — default for ASCII-only strings. Explicit is better.
  • wide — only when the string genuinely appears as UTF-16LE in samples.
  • ascii wide — only when both encodings appear across your sample set.
  • nocase — sparingly. Justified for registry values, HTTP headers. Not for GUIDs, paths, or API names.
  • private — helper strings that should match but not appear in output.

For obfuscation modifiers (xor, base64), hex patterns, and stack string detection, read references/advanced-techniques.md.

Condition construction

The condition is evaluated top-to-bottom. Place the cheapest, most restrictive checks first so the engine can bail out early.

Scope guards (always include)

condition:
    // 1. Magic bytes — eliminates non-matching file types instantly
    uint16(0) == 0x5A4D and

    // 2. File size — eliminates oversized files before string scanning
    filesize < 400KB and

    // 3. Module predicates — structural requirements
    pe.number_of_sections > 4 and

    // 4. String logic — evaluated last, most expensive
    (1 of ($s*) or 3 of ($x*) or (2 of ($x*) and 2 of ($z*)))

For magic bytes, module predicates, and non-PE file types (LNK, OOXML, RTF, .NET, Go), read references/targets.md.

Evidence layering

Prefer quantified counts over any of them. Set minimum match thresholds that require convergence of evidence:

// Good — escalating confidence
(
    1 of ($s*) or         // one hard indicator is enough
    3 of ($x*) or         // three medium indicators converge
    (2 of ($x*) and 2 of ($z*))  // two medium + two supporting
)

// Good — threshold with offset constraint
3 of ($x*) and $x1 in (0..1024)

// Avoid — one weak string triggers detection
any of them

Use in (offset..offset) or at offset to restrict matches to specific file regions. Avoid counting loops (for any i in ...) unless necessary — they are slower than X of patterns.

Performance ranking

Fastest to slowest:

  1. Literal text strings and hex patterns
  2. fullword, ascii, wide modifiers
  3. nocase modifier
  4. PE module predicates
  5. xor / base64 modifiers
  6. Regular expressions
  7. Counting loops and for expressions
  8. math.entropy() over large ranges

Endpoint scanning demands fast rules (scope guards + simple strings). Retrohunting (VT, internal corpus) amplifies even small inefficiencies across millions of files.

Safety guard

If the user asks to write a YARA rule targeting a legitimate system file (e.g., kernel32.dll, svchost.exe) for learning purposes, provide it with a clear warning: deploying such a rule on an endpoint scanner will cause a denial-of-service condition. Recommend testing only against offline copies in an isolated environment.

Workflow

When the user requests a YARA rule, follow these steps:

  1. Clarify the target — Identify the malware family, variant, behaviour, or TTP. If the user has no samples, explicitly ask: "Do you have a sample hash, a list of imports, or a report? A sample yields a much tighter rule — without one, I'll base this on known TTPs."

  2. Identify the pivot — Ask: "What is the one thing the author cannot easily change without breaking the malware?" If no clear pivot, lean on structural detection.

  3. Extract candidate strings — Pull at least 5-8 candidates across signal levels. Use FLOSS rather than strings to recover obfuscated and stack strings. Run capa to identify capabilities and inform which behaviours to target.

  4. Categorise strings — Assign $s*, $x*, $z* prefixes using the decision test: "Would this appear in legitimate software?"

  5. Select modifiers — Apply the minimum set. Justify each in a comment.

  6. Assess obfuscation — If packed or encoded, decide: xor/base64 with tight guards, or pivot to structural detection. Read references/advanced-techniques.md for patterns.

  7. Build the condition — Scope guards first, then quantified X of patterns. Prefer evidence layering over low-bar any of.

  8. Write full meta — Author, description, reference, malpedia, date, MITRE ATT&CK.

  9. Perform the negative test — Identify at least one legitimate file type that could plausibly match, and explain how the rule avoids it.

  10. Self-review against common pitfalls:

    • Any string that could match common software?
    • Any unnecessary modifiers?
    • Any string shorter than 4 bytes (cannot form a useful atom)?
    • Is filesize constrained?
    • Is the file type checked?
    • Does the condition short-circuit on cheap checks first?
    • Would this rule match only one sample (too narrow) or half the internet (too broad)?
    • Is any of them used? If so, replace with a quantified count.
  11. Output the rule with these mandatory annotations:

    • A comment above each string explaining why it was chosen and its signal level (this is not optional — it ensures every string is justified).
    • A comment justifying why each modifier is needed (e.g., // wide: appears as UTF-16LE in version info).
    • A brief explanation after the rule covering key design decisions, the negative test result, and testing recommendations.

For the full annotated example, validation steps, and FP handling process, read references/validation.md.

Target-Specific Detection Patterns

Read this file when the target is non-PE, you need magic bytes for scope guards, or you need module predicates.

Contents

Magic bytes

Use these as the first line of the condition for instant file-type filtering:

Magic Format Check
0x5A4D MZ/PE uint16(0) == 0x5A4D
0x7F454C46 ELF uint32(0) == 0x7F454C46
0x504B0304 ZIP/OOXML uint32(0) == 0x504B0304
{25 50 44 46} PDF uint32(0) == 0x25504446
{D0 CF 11 E0} OLE (legacy Office) uint32(0) == 0xE011CFD0
{4C 00 00 00} LNK uint32(0) == 0x0000004C
{7B 5C 72 74} RTF uint32(0) == 0x7B5C7274
0xFEEDFACF Mach-O 64-bit uint32(0) == 0xFEEDFACF

Module usage guide

Import only what you use. Each module adds overhead. Not all YARA environments have all modules compiled in — endpoint agents, SIEM integrations, and VT retrohunts may support different module sets. If the user has a specific deployment target, ask which YARA version and modules are available. When in doubt, rules that rely only on uint16(0), filesize, and string logic are the most portable.

Module When to use Common predicates
pe Windows executables and DLLs pe.is_pe(), pe.imphash(), pe.number_of_sections, pe.imports("kernel32.dll", "VirtualAlloc"), pe.exports("DllRegisterServer"), pe.rich_signature.key
elf Linux binaries elf.type, elf.machine, elf.number_of_sections, elf.symtab_entries
dotnet .NET assemblies — often higher signal than ASCII strings dotnet.assembly.name, dotnet.typelib, dotnet.number_of_resources, dotnet.module_name, GUID matching
math Entropy checks for packed/encrypted content math.entropy(0, filesize) > 7.0 — always combine with other conditions
hash Known-section or known-resource matching hash.md5(pe.rich_signature.clear_data) — not for full-file hashes
magic File type when no magic byte check suffices magic.mime_type() == "application/pdf"

Non-PE targets

Not all malware is a Windows PE. Adapt scope guards for the file type.

LNK files

Increasingly used for initial access. Small files with a distinctive header:

uint32(0) == 0x0000004C and filesize < 100KB

Office documents (OOXML)

ZIP-based format. Look for embedded macros, OLE objects, or suspicious URLs:

uint32(0) == 0x504B0304 and filesize < 10MB

RTF documents

Text-based format frequently exploited for embedded OLE objects:

uint32(0) == 0x7B5C7274 and filesize < 5MB  // "{\rt"

.NET assemblies

Prefer dotnet module predicates over string matching — they are more specific and resistant to trivial obfuscation:

import "dotnet"

rule Malware_DotNet_Example
{
    condition:
        uint16(0) == 0x5A4D and
        dotnet.assembly.name == "SuspiciousPayload" and
        dotnet.number_of_resources > 5 and
        dotnet.typelib == "{MALICIOUS-GUID-HERE}"
}

Go binaries

Increasingly common for malware due to easy cross-compilation, static linking, and large binary size that can evade some AV scanners. Go binaries have distinctive structural features:

rule Malware_Go_Example
{
    meta:
        description = "Example Go malware detection using embedded metadata"

    strings:
        // Go build metadata — present in most non-stripped Go binaries
        $go_build = "Go build" fullword ascii
        $go_path  = "go.buildid" fullword ascii

        // User-defined Go module paths are high-signal — the malware author's
        // module path (e.g., "github.com/attacker/c2tool") is unique and
        // persists across compilations. Extract these with FLOSS or strings.
        $s1 = "github.com/malicious/implant" fullword ascii

        // User-defined function names survive in non-stripped Go binaries
        // and are excellent pivot strings
        $s2 = "main.runC2Beacon" fullword ascii
        $s3 = "main.exfilData" fullword ascii

    condition:
        uint16(0) == 0x5A4D and
        // Go binaries are large due to static linking — typically 2-15MB
        filesize > 1MB and filesize < 20MB and
        ($go_build or $go_path) and
        (1 of ($s*))
}

Key characteristics for detection:

  • Go module paths and function names are embedded as plaintext in non-stripped builds — these are the best pivot strings.
  • The .symtab and .gopclntab sections are unique to Go and serve as structural indicators.
  • Average Go malware size is ~4.6MB due to static linking, so filesize > 1MB guards against FPs on small legitimate utilities.
  • Go cross-compiles to Windows (PE), Linux (ELF), and macOS (Mach-O) — adjust the magic byte check accordingly (0x7F454C46 for ELF, 0xFEEDFACF for Mach-O 64).
  • Use FLOSS to extract strings — it recovers obfuscated and stack strings that Go's compiler frequently generates.

Validation, Refinement, and Examples

Read this file when you need to test a rule, handle a false positive report, or want the full annotated example.

Contents

Goodware corpus

A rule that has not been tested against goodware is not finished.

A practical goodware corpus for Windows-targeted rules includes: contents of C:\Windows\System32, C:\Program Files, a fresh Office install, common browsers (Chrome, Firefox, Edge), and common enterprise tools (7-Zip, Notepad++, PuTTY). For Linux rules: /usr/bin, /usr/lib, common packages for the target distro.

Aim for zero false positives. One FP on a common system file makes the rule operationally useless.

The negative test

Before finalising a rule, explicitly identify at least one legitimate file type that could plausibly trigger it, and explain how the rule avoids that match.

Example: "This rule could theoretically match a legitimate Microsoft-signed binary with a large import table, but the filesize < 400KB guard and the pe.number_of_sections > 4 check eliminate all known legitimate cases."

This forces a deliberate assessment of FP risk rather than hoping for the best.

Validation steps

  1. Scan known-bad samples — confirm the rule matches at least 3-5 variants. Use yara -s <rule> <samples> to verify which strings hit and where.
  2. Scan goodware corpus — confirm zero or near-zero matches. If FPs appear, identify which string(s) caused them and either tighten the string, add fullword, restrict with offset constraints, or recategorise.
  3. Check string coverage — use yara -s output to confirm that each string contributes to at least one match. Dead strings are clutter.
  4. Performance test — time the scan on a realistic dataset. If noticeably slow, check for regex, nocase, or missing scope guards.
  5. Variant test — if possible, scan related samples (same family, different versions) to verify the rule generalises beyond the initial sample set.

Iterative tightening

Start slightly broad, then add constraints until the FP rate is acceptable. It is easier to tighten a rule that catches too much than to broaden one that misses variants.

Rule refinement (handling false positives)

When a user reports a false positive, follow this triage process:

1. Identify the offending string(s)

Ask which file triggered the rule. Run yara -s against that file to see exactly which strings matched.

2. Classify the FP cause

  • String collision — a $x* or $z* string appears in legitimate software. Fix: add fullword, restrict with in (offset..offset), or downgrade to $z* and raise the count threshold in the condition.
  • Overly broad condition — the string logic threshold is too low. Fix: increase the required count (e.g., 2 of ($x*)3 of ($x*)).
  • Missing scope guard — the rule matches a file type or size it shouldn't. Fix: tighten filesize, add a magic-byte check, or add module constraints.
  • Legitimate software overlap — a specific legitimate binary shares structural features with the malware. Fix: add a negative guard excluding the specific legitimate binary by export, imphash, or signed certificate.

3. Apply the least disruptive fix

Prefer adding constraints over removing strings. Removing a string reduces detection coverage; adding a negative guard or tightening a threshold preserves it.

4. Re-validate

After any change, re-scan both the malware sample set (confirm no detection loss) and the goodware corpus (confirm the FP is resolved).

Example refinement

Rule FP'd on svchost.exe:

// Before: FP because svchost.exe imports urlmon and has >4 sections
(2 of ($x*) and all of ($z*))

// After: added negative guard for known-good export
not pe.exports("ServiceMain") and
(2 of ($x*) and all of ($z*))

Annotated example

This example demonstrates every principle from the skill: signal categorisation, scope guards, evidence layering, mandatory annotations, and the negative test.

import "pe"

rule Ransomware_Conti_Dropper_Generic
{
    meta:
        author      = "Detection Engineering"
        description = "Conti ransomware dropper/loader — detected via version-info
                       typo, unique service DLL name, and anomalous PE section count"
        reference   = "https://example.com/conti-analysis-report"
        malpedia    = "https://malpedia.caad.fkie.fraunhofer.de/details/win.conti"
        date        = "2026-04-02"
        version     = "1.0"
        mitre_attck = "T1027, T1055"

    strings:
        // HIGH-SIGNAL (pivot): Version-info typo — legitimate software would not
        // ship with "Micorsoft Corportation". The author cannot fix this without
        // rebuilding their toolchain. One match is strong evidence.
        // wide: appears as UTF-16LE in PE version info resource
        $s1 = "Micorsoft Corportation" fullword wide

        // HIGH-SIGNAL: Unique GUID used as mutex — generated by the malware
        // author, no legitimate collision expected.
        // fullword: bounded by null bytes in the binary
        $s2 = "{53A4988C-F91F-4054-9076-220AC5EC03F3}" fullword

        // MEDIUM-SIGNAL: Service DLL name unique to this family. Not in any
        // known legitimate software, but could theoretically appear elsewhere.
        // wide: appears as UTF-16LE in service registration strings
        $x1 = "imemonsvc.dll" fullword wide

        // MEDIUM-SIGNAL: Temp file used during payload extraction.
        // fullword: bounded by path separators
        $x2 = "iphlpsvc.tmp" fullword

        // MEDIUM-SIGNAL: Distinctive format string for C2 callback.
        // ascii: appears as ASCII in HTTP request construction
        $x3 = "/gate.php?id=%s&build=%d" fullword ascii

        // SUPPORTING: Common API library — present in many legitimate binaries.
        // Only meaningful when combined with higher-signal matches.
        // ascii: ASCII import name in PE import table
        $z1 = "urlmon" fullword ascii

        // SUPPORTING: Common API used for process injection.
        // ascii: ASCII import name
        $z2 = "VirtualAllocEx" fullword ascii

    condition:
        // Guard 1: Must be a PE file (MZ header check)
        uint16(0) == 0x5A4D and

        // Guard 2: Conti droppers are typically small — eliminates large
        // binaries before expensive string matching begins
        filesize < 400KB and

        // Guard 3: Anomalous section count adds structural confidence
        pe.number_of_sections > 4 and

        // String logic with evidence layering:
        // - One high-signal match is sufficient
        // - Three medium-signal matches converge independently
        // - Two medium + two supporting provides blended confidence
        (
            1 of ($s*) or
            3 of ($x*) or
            (2 of ($x*) and 2 of ($z*))
        )
}

Why this rule works

The scope guards eliminate ~95% of files before string evaluation. The high-signal strings ($s*) are unique enough that a single match justifies detection. The medium-signal path requires convergence of three rare strings, or two rare strings plus two supporting indicators — this evidence layering is robust against slight sample variations. The fullword modifier on every string prevents substring matches against innocent content. The wide modifier appears only where the string genuinely appears as UTF-16LE in the sample.

Negative test: This rule could theoretically match a small PE with many sections that happens to import urlmon and VirtualAllocEx. However, the $x* strings (imemonsvc.dll, iphlpsvc.tmp, the C2 format string) are unique to this family and absent from any known legitimate software, so the medium-signal path requires at least two of these to match simultaneously.

Common pitfalls — quick reference

Pitfall Why it fails Fix
Full-file hash in condition Detects exactly one sample; a hash list does this faster Use strings and module features
Regex for a simple string Slow, defeats atom optimiser Use literal string + fullword
ascii wide nocase on everything Massive search space, FPs from case-insensitive wide matches Apply only the modifiers the sample requires
any of them in condition One weak string triggers detection; high FP rate Use quantified counts: 3 of ($x*)
No goodware testing Unknown FP rate = operationally useless Scan a goodware corpus before releasing
Overly generic strings ("http", "cmd.exe") Matches thousands of legitimate files Choose strings unique to the malware
No filesize or file-type guard Engine evaluates strings against every file Always constrain scope
Single-sample rule Breaks on the next variant Require 3+ samples or use structural features
Undocumented meta Other analysts cannot maintain the rule Always populate author, description, reference, date
Long ?? chains in hex patterns Slow when distance varies; obscures intent Use [min-max] jumps instead
Giving up on packed samples No detection for obfuscated malware Pivot to structural detection (PE anomalies, imphash, entropy)
Strings shorter than 4 bytes Cannot form a useful atom for YARA's pre-filter Use longer, more specific strings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment