Skip to content

Conversion Pipeline

The converter follows an 11-step sequential pipeline to transform a ChatGPT export into a browsable Markdown archive.

Pipeline Overview

graph LR
    A[Load] --> B[Index]
    B --> C[Parse]
    C --> D[Enrich]
    D --> E[Resolve]
    E --> F[Linearize]
    F --> G[Filter]
    G --> H[Render]
    H --> I[Organize]
    I --> J[Generate Indices]
    J --> K[Validate]

Each step is implemented as a pure function in the pipeline/ module, operating on immutable data structures.

Step 1: Load

Module: pipeline/loader.py

Purpose: Parse core metadata files using Pydantic models

Inputs:

  • export_manifest.json
  • user.json
  • user_settings.json
  • message_feedback.json

Outputs:

  • ExportManifest object
  • User object (with optional PII redaction)
  • UserSettings object
  • list[MessageFeedback]

Key functions:

  • load_manifest(input_dir: Path) -> ExportManifest
  • load_user(input_dir: Path, redact_pii: bool) -> User
  • load_user_settings(input_dir: Path) -> UserSettings | None
  • load_feedback(input_dir: Path) -> list[MessageFeedback]

PII redaction:

When redact_pii=True, replaces:

  • email"[REDACTED]"
  • phone_numberNone
  • birth_yearNone

Step 2: Index

Module: pipeline/indexer.py

Purpose: Build file ID → path lookup table and SHA-256 deduplication index

Inputs:

  • ExportManifest object
  • Export directory path

Outputs:

  • dict[str, str]: File ID → relative path
  • dict[str, list[str]]: SHA-256 hash → list of file IDs

Key functions:

  • build_file_index(manifest: ExportManifest) -> dict[str, str]
  • hash_file(file_path: Path) -> str
  • build_dedup_index(file_index: dict, input_dir: Path) -> dict[str, list[str]]

File ID extraction:

  • Modern: file-1tisCpYYvMfEMvXcf5uNTb-name.extfile-1tisCpYYvMfEMvXcf5uNTb
  • Legacy: file_000000006d18720cb0249e36a7f3d2d5-name.extfile_000000006d18720cb0249e36a7f3d2d5

Step 3: Parse

Module: pipeline/parser.py

Purpose: Parse all conversations-*.json files into validated Pydantic models

Inputs:

  • Export directory path
  • conversations-0.json, conversations-1.json, ...

Outputs:

  • list[Conversation]

Key functions:

  • parse_conversations(input_dir: Path) -> list[Conversation]

Behavior:

  • Globs conversations-*.json files
  • Sorts by partition number (e.g., 0, 1, 2)
  • Validates each conversation with Pydantic
  • Skips malformed entries (logs warning)

Step 4: Enrich

Module: pipeline/enricher.py

Purpose: Build feedback lookup index

Inputs:

  • list[MessageFeedback]

Outputs:

  • dict[str, list[MessageFeedback]]: Conversation ID → feedback records

Key functions:

  • build_feedback_index(feedback: list[MessageFeedback]) -> dict[str, list[MessageFeedback]]

Usage:

This index is used when generating metadata files to show which conversations received thumbs-up/thumbs-down ratings.

Step 5: Resolve

Module: pipeline/resolver.py

Purpose: Resolve file-service:// and sediment:// URIs to local file paths

Inputs:

  • File index from Step 2
  • Asset pointer URI (e.g., file-service://file-8Vk2ls8JSO2iOVBq87yJ880Q)

Outputs:

  • ResolvedAsset dataclass with source URI, resolved path, and status

Key functions:

  • strip_pointer_prefix(uri: str) -> str
  • resolve_asset_pointer(file_id: str, file_index: dict) -> Path | None
  • resolve_conversation_assets(conversation: Conversation, file_index: dict) -> list[ResolvedAsset]

Resolution process:

  1. Strip URI scheme prefix (file-service:// or sediment://)
  2. Look up file ID in index
  3. If not found, return None (logged as missing asset)

Example:

resolve_asset_pointer(
    "file-8Vk2ls8JSO2iOVBq87yJ880Q",
    file_index
)
# Returns: Path("file-8Vk2ls8JSO2iOVBq87yJ880Q-D2A52410.jpeg")

Step 6: Linearize

Module: pipeline/linearizer.py

Purpose: Walk message DAG from current_node to root, producing chronological message list

Inputs:

  • Conversation object with mapping and current_node

Outputs:

  • list[Message] in chronological order

Key functions:

  • linearize(conversation: Conversation) -> list[Message]

Algorithm:

def linearize(conversation: dict) -> list[dict]:
    """Walk from current_node to root, return messages in chronological order."""
    path: list[dict] = []
    node_id = conversation["current_node"]
    mapping = conversation["mapping"]
    visited = set()  # Cycle detection

    while node_id is not None:
        if node_id in visited:
            break  # Cycle detected
        visited.add(node_id)

        node = mapping.get(node_id)
        if node and node.get("message"):
            path.append(node["message"])
        node_id = node.get("parent") if node else None

    path.reverse()
    return path

Edge cases:

  • Cycle detection: Tracks visited nodes; breaks on repeated ID
  • Missing current_node: Falls back to DFS from root
  • Root node: Always has parent: null and message: null

Step 7: Filter

Module: pipeline/filterer.py

Purpose: Apply visibility rules to remove hidden/system messages

Inputs:

  • list[Message] from linearization
  • include_thinking flag

Outputs:

  • Filtered list[Message]

Key functions:

  • filter_messages(messages: list[Message], include_thinking: bool) -> list[Message]

Filtering rules:

Condition Action
message.weight == 0.0 Skip (hidden/system)
author.role == "system" and empty content Skip
metadata.is_visually_hidden_from_conversation == true Skip
content_type == "user_editable_context" Skip (custom instructions)
content_type == "tether_browsing_display" Skip (loading placeholder)
channel == "commentary" and include_thinking=False Skip (thinking blocks excluded by default)
channel == "commentary" and include_thinking=True Include with special rendering

Empty content detection:

For TextContent and MultimodalTextContent, checks if all parts are empty strings or whitespace.

Step 8: Render

Module: pipeline/renderer.py

Purpose: Convert filtered messages to Markdown using Jinja2 templates

Inputs:

  • Filtered list[Message]
  • Conversation metadata
  • File index and resolved assets

Outputs:

  • list[RenderedMessage] with Markdown strings

Key functions:

  • render_conversation(conversation: Conversation, messages: list[Message], config: ConverterConfig) -> str

Template: templates/conversation.md.j2

Rendering by content type:

Content Type Markdown Format
text Join parts with newlines
multimodal_text Text + ![alt](media/file.ext) for images
code Fenced code block with language
thoughts Joined content from all thought objects
sonic_webpage Blockquote with title, URL, and snippet
tether_quote Blockquote with file title and extracted text
execution_output Inline code or fenced block
system_error Warning block with error name and text
Missing asset <!-- MISSING ASSET: file-service://file-ID -->

Front matter:

---
id: 17cd7535-aa77-4553-8c04-ee082d0d702f
title: Image Creation Prompt
created: 2024-08-24T07:33:32Z
updated: 2024-08-24T07:33:40Z
model: gpt-4o
---

Step 9: Organize

Module: pipeline/organizer.py

Purpose: Create conversation directories and copy media/attachment files

Inputs:

  • Conversation object
  • Resolved assets
  • Output directory

Outputs:

  • Created directories and copied files

Key functions:

  • organize_conversation(conversation: Conversation, output_dir: Path) -> Path
  • organize_attachments(conversation: Conversation, assets: list[ResolvedAsset], output_dir: Path)
  • organize_dalle(dalle_assets: list[ResolvedAsset], output_dir: Path) -> list[DalleImage]

Directory structure:

archive/
└── conversations/
    └── 2024-08-24_image-creation-prompt_17cd7535/
        ├── index.md
        ├── media/
        │   └── 001-a1b2c3d4.jpeg
        └── attachments/
            └── document.pdf

Naming conventions:

  • Conversation dir: <YYYY-MM-DD>_<slug>_<short-id>/
  • Media files: <NNN>-<hash8>.<ext> (001-a1b2c3d4.jpeg)
  • Attachments: Sanitized original filename

Step 10: Generate Indices

Module: pipeline/index_generator.py

Purpose: Create index.md files for navigation

Inputs:

  • All conversations
  • Metadata objects
  • Output directory

Outputs:

  • archive/index.md (root index)
  • archive/conversations/index.md (conversation table)
  • archive/dalle/index.md (DALL-E gallery)
  • archive/metadata/*.md (metadata pages)

Key functions:

  • generate_root_index(output_dir: Path, stats: dict)
  • generate_conversations_index(conversations: list[Conversation], output_dir: Path)
  • generate_dalle_index(dalle_assets: list[ResolvedAsset], output_dir: Path)
  • generate_metadata_files(user: User, settings: UserSettings, feedback: list[MessageFeedback], output_dir: Path)

Templates:

  • templates/root_index.md.j2
  • templates/conversation_index.md.j2
  • templates/dalle_index.md.j2
  • templates/metadata/*.md.j2

Step 11: Validate

Module: pipeline/validator.py

Purpose: Check output integrity and log issues

Inputs:

  • Output directory

Outputs:

  • ValidationReport dataclass

Key functions:

  • validate_output(output_dir: Path) -> ValidationReport

Validation checks:

  1. Missing assets: Scan for <!-- MISSING ASSET: ... --> comments
  2. Broken links: Check relative links point to existing files
  3. Empty conversations: Flag conversations with no visible messages

Report fields:

@dataclass
class ValidationReport:
    missing_assets: list[str]
    broken_links: list[tuple[str, str]]  # (source_file, target)
    empty_conversations: list[str]

Pipeline Orchestration

The ChatGPTExportConverter class in converter.py orchestrates all steps:

class ChatGPTExportConverter:
    def run(self):
        # Step 1: Load
        manifest = load_manifest(self.input_dir)
        user = load_user(self.input_dir, self.config.redact_pii)

        # Step 2: Index
        file_index = build_file_index(manifest)

        # Step 3: Parse
        conversations = parse_conversations(self.input_dir)

        # Step 4-11: Process each conversation
        for conv in conversations:
            assets = resolve_conversation_assets(conv, file_index)
            messages = linearize(conv)
            filtered = filter_messages(messages, self.config.include_thinking)
            markdown = render_conversation(conv, filtered, self.config)
            organize_conversation(conv, self.output_dir)
            # ...

        # Generate indices and validate
        generate_root_index(self.output_dir, stats)
        report = validate_output(self.output_dir)

Error Handling

  • Missing files: Logged as warnings; conversion continues
  • Malformed JSON: Skipped with error log
  • Validation errors: Reported in ValidationReport; does not fail
  • Asset resolution failures: Placeholder comment in output

Performance Considerations

  • Parallelization: Currently sequential; could parallelize per-conversation processing
  • Deduplication: SHA-256 hashing is I/O bound; cache results
  • Memory usage: Loads all conversations into memory; could stream for large exports

Next Steps