Email Parsing for AI Agents: From Raw MIME to Structured JSON

You'd think parsing an email would be easy. It's just text, right? A subject, a body, maybe some attachments. Your AI agent receives an email, you extract the text, pull out the OTP with a regex, and move on.

Then you encounter your first multipart/mixed message with a quoted-printable encoded HTML body containing a 6-digit code rendered as individual <span> elements with CSS letter-spacing, and suddenly you understand why email parsing is a career-ending bug factory.

This post is a technical deep-dive into what actually happens between "email arrives" and "agent has structured data." If you're building AI agents that interact with email, understanding this pipeline will save you weeks of debugging.

MIME: The Format That Refuses to Die

Every email your agent receives is encoded as a MIME message (Multipurpose Internet Mail Extensions), defined primarily by RFC 2045 through RFC 2049, with updates scattered across dozens of subsequent RFCs. The format dates to 1996 and it shows.

Here's what a "simple" email looks like at the wire level:

From: noreply@example.com
To: agent-7x9k@inboxes.lumbox.dev
Subject: Your verification code
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="----=_Part_123_456789.1712000000000"

------=_Part_123_456789.1712000000000
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

Your verification code is 847291.

------=_Part_123_456789.1712000000000
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<body>
<div style=3D"font-family: Arial, sans-serif;">
<p>Your verification code is:</p>
<div style=3D"font-size: 32px; letter-spacing: 8px; font-weight: bold;">
8<span>4</span><span>7</span><span>2</span><span>9</span><span>1</span>
</div>
</div>
</body>
</html>
------=_Part_123_456789.1712000000000--

Even in this "simple" example, there are multiple layers to unpack:

The multipart/alternative content type means the email contains multiple representations of the same content — the recipient's client picks which to display
The boundary string ----=_Part_123_456789.1712000000000 separates the parts, and the closing boundary has -- appended
The HTML part uses quoted-printable encoding, where =3D represents the = character
The OTP "847291" appears as individual <span> elements in the HTML — a common anti-scraping technique

And this is the easy case. Let's look at what makes it worse.

Content-Transfer-Encoding: The First Layer of Pain

Email was designed for 7-bit ASCII transmission. Any content outside that range must be encoded. You'll encounter three main encodings:

7bit / 8bit

Plain text, no encoding needed. You almost never see this in practice for HTML emails.

quoted-printable

Encodes non-ASCII characters as =XX hex pairs. The equals sign itself becomes =3D. Lines are wrapped at 76 characters with a trailing = as a soft line break. This means a single HTML tag might be split across multiple lines:

Content-Transfer-Encoding: quoted-printable

<div style=3D"background-color: #f0f0f0; padding: 20px; text-align: cente=
r; font-size: 24px;">Your code: <strong>8472=
91</strong></div>

Notice the OTP "847291" is split across a soft line break (8472=\n91). A naive regex looking for \d{6} on any single line will miss this entirely.

base64

The entire body is base64-encoded. Common for emails sent from mobile clients or mail servers that don't trust quoted-printable:

Content-Transfer-Encoding: base64

PCFET0NUWVBFIGh0bWw+CjxodG1sPgo8Ym9keT4KPGRpdj5Zb3VyIHZl
cmlmaWNhdGlvbiBjb2RlIGlzOiA8c3Ryb25nPjg0NzI5MTwvc3Ryb25n
PjwvZGl2Pgo8L2JvZHk+CjwvaHRtbD4=

You must decode first, then parse. Miss this step and your "email body" is a wall of alphanumeric gibberish.

Multipart Structures: It's Trees All the Way Down

Real-world emails are nested multipart structures. Here's a common pattern for an email with HTML body, plain text fallback, and an inline image:

Content-Type: multipart/mixed
├── Content-Type: multipart/related
│   ├── Content-Type: multipart/alternative
│   │   ├── Content-Type: text/plain          ← plaintext body
│   │   └── Content-Type: text/html           ← HTML body
│   └── Content-Type: image/png               ← inline logo (CID referenced)
│       Content-ID: <logo@example.com>
│       Content-Disposition: inline
└── Content-Type: application/pdf             ← attachment
    Content-Disposition: attachment; filename="receipt.pdf"

To extract the readable body, you need to:

Walk the tree recursively
Identify which parts are "body" vs "attachment" vs "inline resource"
Prefer HTML over plaintext (or extract both)
Resolve CID references (cid:logo@example.com) to the actual inline images
Decode each part according to its Content-Transfer-Encoding
Handle character sets (the HTML part might be charset=windows-1252, not UTF-8)

Get any of these wrong and your agent receives garbled text, missing content, or — worst case — parses an attachment's binary content as the email body.

Why Naive Regex Fails for OTP Extraction

Let's say you've correctly decoded the email body. Now you just need to find the 6-digit code. Easy, right? /\d{6}/?

Here are real-world OTP emails that break naive regex:

Example 1: Digits split across HTML elements

<div class="code">
  <span class="digit">8</span>
  <span class="digit">4</span>
  <span class="digit">7</span>
  <span class="digit">2</span>
  <span class="digit">9</span>
  <span class="digit">1</span>
</div>

After stripping HTML tags, you get 8 4 7 2 9 1 with spaces. The regex /\d{6}/ finds nothing. You need to account for whitespace and HTML between digits.

Example 2: Zero-width characters

<span>8&#8203;4&#8203;7&#8203;2&#8203;9&#8203;1</span>

The  entities are zero-width spaces. Visually the code renders as "847291" but the raw text is 8\u200B4\u200B7\u200B2\u200B9\u200B1. Your regex matches nothing. Some services insert these deliberately to prevent automated extraction.

Example 3: The code is in an image

Some services render the OTP as an image. There's no text to extract at all — you need OCR. This is rare for OTP emails but common enough to be aware of.

Example 4: Multiple number sequences

Your order #483920 has been confirmed.
Use code 847291 to verify your account.
This code expires in 600 seconds.

A naive /\d{6}/ match returns "483920" (the order number), not "847291" (the OTP). You need contextual extraction — looking for numbers that appear near keywords like "code," "verification," "OTP," or "confirm."

Example 5: Alphanumeric codes

Your verification code is: A8K-29F

Not all verification codes are purely numeric. Some are alphanumeric with hyphens. Your digit-only regex won't even consider this a candidate.

The Extraction Pipeline

Here's the full pipeline that Lumbox runs when an email arrives at an agent's inbox. Understanding this pipeline helps you appreciate why "just parse the email" is a week of work, not an afternoon.

Stage 1: Raw MIME → Decoded Parts

The raw RFC 2822 message is parsed into a tree structure. Each MIME part is decoded according to its Content-Transfer-Encoding. Character sets are normalized to UTF-8. Malformed messages (and there are many — not every mail server follows the RFCs) are handled with fallback heuristics.

Stage 2: Decoded Parts → Body Extraction

The tree is walked to find the "primary" body parts. HTML is preferred when available. Inline images are resolved. The HTML is sanitized (script tags removed, dangerous attributes stripped) but structural markup is preserved for parsing.

Stage 3: Body → Text Extraction

The HTML body is converted to clean text with structural awareness. Tables are flattened. List items are preserved. Zero-width characters and invisible Unicode are stripped. HTML entities are decoded.

Stage 4: Text → OTP / Link Extraction

Multiple extraction strategies run in parallel:

Pattern matching: Context-aware regex that looks for codes near trigger words, not just bare digit sequences
HTML structure analysis: Identifies "code" elements by class names, inline styles (large font, letter-spacing, monospace), and structural position
Link extraction: Verification links (https://example.com/verify?token=...) are identified and extracted separately
Digit reassembly: Digits split across HTML elements or separated by zero-width characters are reassembled

Stage 5: Categorization → Structured JSON

The email is categorized (OTP, verification link, notification, newsletter, etc.) and all extracted data is assembled into a clean JSON response.

Before and After: Raw MIME vs. Agent-Ready JSON

Here's what arrives at the wire level (abbreviated):

MIME-Version: 1.0
From: security@bigcorp.com
To: agent-7x9k@inboxes.lumbox.dev
Subject: =?UTF-8?B?WW91ciB2ZXJpZmljYXRpb24gY29kZQ==?=
Content-Type: multipart/alternative;
 boundary="0000000000001234567890"
Date: Wed, 09 Apr 2026 14:23:01 -0700
Message-ID: <abc123@mail.bigcorp.com>

--0000000000001234567890
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: base64

WW91ciB2ZXJpZmljYXRpb24gY29kZSBpcyA4NDcyOTEuIEl0IGV4cGly
ZXMgaW4gMTAgbWludXRlcy4=

--0000000000001234567890
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html><html><head><meta charset=3D"UTF-8"></head=
><body style=3D"margin:0;padding:0"><table width=3D"100%" ce=
llpadding=3D"0"><tr><td align=3D"center"><table width=3D"600=
"><tr><td><img src=3D"https://bigcorp.com/logo.png" /></td=
></tr><tr><td style=3D"padding:40px"><h1>Verification Code=
</h1><p>Use the code below to verify your identity:</p><div=
 style=3D"font-size:36px;letter-spacing:12px;font-family:monos=
pace;text-align:center;background:#f5f5f5;padding:20px;border-=
radius:8px"><span>8</span><span>4</span><span>7</span>=
<span>2</span><span>9</span><span>1</span></div><p st=
yle=3D"color:#666;font-size:12px">This code expires in 10 minu=
tes. If you didn't request this, please ignore this email.</p=
></td></tr></table></td></tr></table></body></html>
--0000000000001234567890--

Note the Subject header is RFC 2047 encoded (=?UTF-8?B?...?= means base64-encoded UTF-8). The plaintext body is base64. The HTML body is quoted-printable with soft line breaks splitting HTML attributes, element tags, and even the OTP digits across lines.

Here's what Lumbox delivers to your agent:

{
  "id": "msg_a1b2c3d4",
  "inbox_id": "inb_7x9k2m",
  "from": {
    "name": "BigCorp Security",
    "address": "security@bigcorp.com"
  },
  "to": ["agent-7x9k@inboxes.lumbox.dev"],
  "subject": "Your verification code",
  "text": "Use the code below to verify your identity: 847291. This code expires in 10 minutes.",
  "html": "<h1>Verification Code</h1><p>Use the code below...</p>...",
  "received_at": "2026-04-09T21:23:02Z",
  "category": "otp",
  "extracted": {
    "otp": {
      "code": "847291",
      "type": "numeric",
      "confidence": 0.98,
      "source": "html-structure",
      "expires_hint": "10 minutes"
    },
    "links": [
      {
        "url": "https://bigcorp.com/verify?token=abc",
        "type": "verification",
        "confidence": 0.85
      }
    ]
  },
  "headers": {
    "message_id": "<abc123@mail.bigcorp.com>",
    "date": "2026-04-09T21:23:01Z"
  }
}

The agent gets clean, structured JSON with the OTP already extracted, categorized, and confidence-scored. The source: "html-structure" field indicates the OTP was found by analyzing the HTML DOM structure (the <span>-per-digit pattern with monospace styling), not just regex. The expiration hint is parsed from the surrounding text.

Using the Parsed Output in Your Agent

With structured JSON, agent code becomes trivial:

import { Lumbox } from "lumbox";

const client = new Lumbox({ apiKey: "ak_your_key" });
const inbox = await client.inboxes.create();

// ... agent triggers a signup flow using inbox.email ...

// Wait for the OTP email and get structured data
const otp = await client.messages.waitForOTP({
  inboxId: inbox.id,
  timeout: 60000,
});

if (otp.code) {
  // Agent enters the code
  await page.fill("#verification-input", otp.code);
  await page.click("#verify-button");
}

Compare this to the DIY approach where you'd need to:

// DIY: The painful way
import { simpleParser } from "mailparser";

// Poll for new emails (no long-poll, so you're looping)
let email = null;
const start = Date.now();
while (!email && Date.now() - start < 60000) {
  const emails = await fetchNewEmails(inboxId);
  email = emails[0];
  if (!email) await sleep(2000); // poll every 2s
}

if (!email) throw new Error("Timeout waiting for email");

// Parse the MIME
const parsed = await simpleParser(email.raw);

// Extract OTP — good luck
const text = parsed.text || "";
const html = parsed.html || "";

// Try plaintext first
let code = text.match(/\b(\d{6})\b/)?.[1];

// If not found, try HTML after stripping tags
if (!code) {
  const stripped = html
    .replace(/&#8203;/g, "")     // zero-width spaces
    .replace(/&zwnj;/g, "")      // zero-width non-joiners
    .replace(/\u200B/g, "")         // unicode zero-width
    .replace(/<[^>]+>/g, " ")     // strip tags
    .replace(/&nbsp;/g, " ")      // nbsp entities
    .replace(/\s+/g, "");           // collapse whitespace
  code = stripped.match(/(\d{6})/)?.[1];
}

// Still not found? Try digit-per-element pattern
if (!code) {
  const digitSpans = html.match(
    /<span[^>]*>\s*(\d)\s*<\/span>/g
  );
  if (digitSpans && digitSpans.length >= 6) {
    code = digitSpans
      .slice(0, 6)
      .map((s) => s.match(/\d/)[0])
      .join("");
  }
}

// Hope for the best
console.log("Maybe the code is:", code);

And this DIY version still doesn't handle: quoted-printable soft line breaks splitting digits, alphanumeric codes, codes in tables, codes rendered as images, or the dozen other edge cases that appear in production.

Edge Cases That Will Ruin Your Week

A quick tour of email parsing horrors that I've encountered in production:

Nested multipart depth: I've seen emails with 7 levels of multipart nesting (usually forwarded emails containing forwarded emails). Your recursive parser needs a depth limit or it'll stack overflow.
Malformed boundaries: Some mail servers add extra whitespace or newlines around boundaries. Strict parsing fails; you need fuzzy boundary matching.
Mixed charsets: One MIME part is UTF-8, another is ISO-8859-1, the Subject is GB2312. Each must be decoded separately before combining.
Missing Content-Type: Some parts omit the Content-Type header entirely. Per RFC, the default is text/plain; charset=us-ascii, but in practice you should sniff the content.
Outlook winmail.dat: Microsoft's proprietary TNEF format, sent as application/ms-tnef. The actual email body might be trapped inside this binary blob. You need a TNEF parser to extract it.
Calendar invites as body: Emails where the "body" is a text/calendar part and there's no text/html or text/plain at all. Your agent sees an iCal blob instead of readable text.

Key Takeaways

Email parsing for AI agents is a deep technical problem that's easy to underestimate. The MIME format is a product of 1990s constraints that we're stuck with forever. OTP extraction is a game of cat-and-mouse with services that actively try to prevent automated extraction.

If you're building agents that need reliable email parsing, you have two real options: invest significant engineering time in building and maintaining a robust parser, or use a service like Lumbox that has already done that work and exposes clean structured JSON through a simple API.

The emails keep getting weirder. The edge cases never stop. Choose wisely where you spend your debugging time.

Email Parsing for AI Agents: From Raw MIME to Structured JSON

MIME: The Format That Refuses to Die

Content-Transfer-Encoding: The First Layer of Pain

7bit / 8bit

quoted-printable

base64

Multipart Structures: It's Trees All the Way Down

Why Naive Regex Fails for OTP Extraction

Example 1: Digits split across HTML elements

Example 2: Zero-width characters

Example 3: The code is in an image

Example 4: Multiple number sequences

Example 5: Alphanumeric codes

The Extraction Pipeline

Stage 1: Raw MIME → Decoded Parts

Stage 2: Decoded Parts → Body Extraction

Stage 3: Body → Text Extraction

Stage 4: Text → OTP / Link Extraction

Stage 5: Categorization → Structured JSON

Before and After: Raw MIME vs. Agent-Ready JSON

Using the Parsed Output in Your Agent

Edge Cases That Will Ruin Your Week

Key Takeaways

How to Add Email to AutoGen Multi-Agent Workflows

Solving the 2FA Problem: How AI Agents Handle Email-Based Authentication

AI Email Parsing: Extract Structured Data from Any Email

Subscribe to
the Logs

Email Parsing for AI Agents: From Raw MIME to Structured JSON

MIME: The Format That Refuses to Die

Content-Transfer-Encoding: The First Layer of Pain

7bit / 8bit

quoted-printable

base64

Multipart Structures: It's Trees All the Way Down

Why Naive Regex Fails for OTP Extraction

Example 1: Digits split across HTML elements

Example 2: Zero-width characters

Example 3: The code is in an image

Example 4: Multiple number sequences

Example 5: Alphanumeric codes

The Extraction Pipeline

Stage 1: Raw MIME → Decoded Parts

Stage 2: Decoded Parts → Body Extraction

Stage 3: Body → Text Extraction

Stage 4: Text → OTP / Link Extraction

Stage 5: Categorization → Structured JSON

Before and After: Raw MIME vs. Agent-Ready JSON

Using the Parsed Output in Your Agent

Edge Cases That Will Ruin Your Week

Key Takeaways

How to Add Email to AutoGen Multi-Agent Workflows

Solving the 2FA Problem: How AI Agents Handle Email-Based Authentication

AI Email Parsing: Extract Structured Data from Any Email

Subscribe tothe Logs

Subscribe to
the Logs