Conversion Pipeline¶

The converter follows an 11-step sequential pipeline to transform a ChatGPT export into a browsable Markdown archive.

Pipeline Overview¶

graph LR
    A[Load] --> B[Index]
    B --> C[Parse]
    C --> D[Enrich]
    D --> E[Resolve]
    E --> F[Linearize]
    F --> G[Filter]
    G --> H[Render]
    H --> I[Organize]
    I --> J[Generate Indices]
    J --> K[Validate]

Each step is implemented as a pure function in the pipeline/ module, operating on immutable data structures.

Step 1: Load¶

Module: pipeline/loader.py

Purpose: Parse core metadata files using Pydantic models

Inputs:

export_manifest.json
user.json
user_settings.json
message_feedback.json

Outputs:

ExportManifest object
User object (with optional PII redaction)
UserSettings object
list[MessageFeedback]

Key functions:

load_manifest(input_dir: Path) -> ExportManifest
load_user(input_dir: Path, redact_pii: bool) -> User
load_user_settings(input_dir: Path) -> UserSettings | None
load_feedback(input_dir: Path) -> list[MessageFeedback]

PII redaction:

When redact_pii=True, replaces:

email → "[REDACTED]"
phone_number → None
birth_year → None

Step 2: Index¶

Module: pipeline/indexer.py

Purpose: Build file ID → path lookup table and SHA-256 deduplication index

Inputs:

ExportManifest object
Export directory path

Outputs:

dict[str, str]: File ID → relative path
dict[str, list[str]]: SHA-256 hash → list of file IDs

Key functions:

build_file_index(manifest: ExportManifest) -> dict[str, str]
hash_file(file_path: Path) -> str
build_dedup_index(file_index: dict, input_dir: Path) -> dict[str, list[str]]

File ID extraction:

Modern: file-1tisCpYYvMfEMvXcf5uNTb-name.ext → file-1tisCpYYvMfEMvXcf5uNTb
Legacy: file_000000006d18720cb0249e36a7f3d2d5-name.ext → file_000000006d18720cb0249e36a7f3d2d5

Step 3: Parse¶

Module: pipeline/parser.py

Purpose: Parse all conversations-*.json files into validated Pydantic models

Inputs:

Export directory path
conversations-0.json, conversations-1.json, ...

Outputs:

list[Conversation]

Key functions:

parse_conversations(input_dir: Path) -> list[Conversation]

Behavior:

Globs conversations-*.json files
Sorts by partition number (e.g., 0, 1, 2)
Validates each conversation with Pydantic
Skips malformed entries (logs warning)

Step 4: Enrich¶

Module: pipeline/enricher.py

Purpose: Build feedback lookup index

Inputs:

list[MessageFeedback]

Outputs:

dict[str, list[MessageFeedback]]: Conversation ID → feedback records

Key functions:

build_feedback_index(feedback: list[MessageFeedback]) -> dict[str, list[MessageFeedback]]

Usage:

This index is used when generating metadata files to show which conversations received thumbs-up/thumbs-down ratings.

Step 5: Resolve¶

Module: pipeline/resolver.py

Purpose: Resolve file-service:// and sediment:// URIs to local file paths

Inputs:

File index from Step 2
Asset pointer URI (e.g., file-service://file-8Vk2ls8JSO2iOVBq87yJ880Q)

Outputs:

ResolvedAsset dataclass with source URI, resolved path, and status

Key functions:

strip_pointer_prefix(uri: str) -> str
resolve_asset_pointer(file_id: str, file_index: dict) -> Path | None
resolve_conversation_assets(conversation: Conversation, file_index: dict) -> list[ResolvedAsset]

Resolution process:

Strip URI scheme prefix (file-service:// or sediment://)
Look up file ID in index
If not found, return None (logged as missing asset)

Example:

resolve_asset_pointer(
    "file-8Vk2ls8JSO2iOVBq87yJ880Q",
    file_index
)
# Returns: Path("file-8Vk2ls8JSO2iOVBq87yJ880Q-D2A52410.jpeg")

Step 6: Linearize¶

Module: pipeline/linearizer.py

Purpose: Walk message DAG from current_node to root, producing chronological message list

Inputs:

Conversation object with mapping and current_node

Outputs:

list[Message] in chronological order

Key functions:

linearize(conversation: Conversation) -> list[Message]

Algorithm:

def linearize(conversation: dict) -> list[dict]:
    """Walk from current_node to root, return messages in chronological order."""
    path: list[dict] = []
    node_id = conversation["current_node"]
    mapping = conversation["mapping"]
    visited = set()  # Cycle detection

    while node_id is not None:
        if node_id in visited:
            break  # Cycle detected
        visited.add(node_id)

        node = mapping.get(node_id)
        if node and node.get("message"):
            path.append(node["message"])
        node_id = node.get("parent") if node else None

    path.reverse()
    return path

Edge cases:

Cycle detection: Tracks visited nodes; breaks on repeated ID
Missing current_node: Falls back to DFS from root
Root node: Always has parent: null and message: null

Step 7: Filter¶

Module: pipeline/filterer.py

Purpose: Apply visibility rules to remove hidden/system messages

Inputs:

list[Message] from linearization
include_thinking flag

Outputs:

Filtered list[Message]

Key functions:

filter_messages(messages: list[Message], include_thinking: bool) -> list[Message]

Filtering rules:

Condition	Action
`message.weight == 0.0`	Skip (hidden/system)
`author.role == "system"` and empty content	Skip
`metadata.is_visually_hidden_from_conversation == true`	Skip
`content_type == "user_editable_context"`	Skip (custom instructions)
`content_type == "tether_browsing_display"`	Skip (loading placeholder)
`channel == "commentary"` and `include_thinking=False`	Skip (thinking blocks excluded by default)
`channel == "commentary"` and `include_thinking=True`	Include with special rendering

Empty content detection:

For TextContent and MultimodalTextContent, checks if all parts are empty strings or whitespace.

Step 8: Render¶

Module: pipeline/renderer.py

Purpose: Convert filtered messages to Markdown using Jinja2 templates

Inputs:

Filtered list[Message]
Conversation metadata
File index and resolved assets

Outputs:

list[RenderedMessage] with Markdown strings

Key functions:

render_conversation(conversation: Conversation, messages: list[Message], config: ConverterConfig) -> str

Template: templates/conversation.md.j2

Rendering by content type:

Content Type	Markdown Format
`text`	Join parts with newlines
`multimodal_text`	Text + `![alt](media/file.ext)` for images
`code`	Fenced code block with language
`thoughts`	Joined content from all thought objects
`sonic_webpage`	Blockquote with title, URL, and snippet
`tether_quote`	Blockquote with file title and extracted text
`execution_output`	Inline code or fenced block
`system_error`	Warning block with error name and text
Missing asset	`<!-- MISSING ASSET: file-service://file-ID -->`

Front matter:

---
id: 17cd7535-aa77-4553-8c04-ee082d0d702f
title: Image Creation Prompt
created: 2024-08-24T07:33:32Z
updated: 2024-08-24T07:33:40Z
model: gpt-4o
---

Step 9: Organize¶

Module: pipeline/organizer.py

Purpose: Create conversation directories and copy media/attachment files

Inputs:

Conversation object
Resolved assets
Output directory

Outputs:

Created directories and copied files

Key functions:

organize_conversation(conversation: Conversation, output_dir: Path) -> Path
organize_attachments(conversation: Conversation, assets: list[ResolvedAsset], output_dir: Path)
organize_dalle(dalle_assets: list[ResolvedAsset], output_dir: Path) -> list[DalleImage]

Directory structure:

archive/
└── conversations/
    └── 2024-08-24_image-creation-prompt_17cd7535/
        ├── index.md
        ├── media/
        │   └── 001-a1b2c3d4.jpeg
        └── attachments/
            └── document.pdf

Naming conventions:

Conversation dir: <YYYY-MM-DD>_<slug>_<short-id>/
Media files: <NNN>-<hash8>.<ext> (001-a1b2c3d4.jpeg)
Attachments: Sanitized original filename

Step 10: Generate Indices¶

Module: pipeline/index_generator.py

Purpose: Create index.md files for navigation

Inputs:

All conversations
Metadata objects
Output directory

Outputs:

archive/index.md (root index)
archive/conversations/index.md (conversation table)
archive/dalle/index.md (DALL-E gallery)
archive/metadata/*.md (metadata pages)

Key functions:

generate_root_index(output_dir: Path, stats: dict)
generate_conversations_index(conversations: list[Conversation], output_dir: Path)
generate_dalle_index(dalle_assets: list[ResolvedAsset], output_dir: Path)
generate_metadata_files(user: User, settings: UserSettings, feedback: list[MessageFeedback], output_dir: Path)

Templates:

templates/root_index.md.j2
templates/conversation_index.md.j2
templates/dalle_index.md.j2
templates/metadata/*.md.j2

Step 11: Validate¶

Module: pipeline/validator.py

Purpose: Check output integrity and log issues

Inputs:

Output directory

Outputs:

ValidationReport dataclass

Key functions:

validate_output(output_dir: Path) -> ValidationReport

Validation checks:

Missing assets: Scan for  comments
Broken links: Check relative links point to existing files
Empty conversations: Flag conversations with no visible messages

Report fields:

@dataclass
class ValidationReport:
    missing_assets: list[str]
    broken_links: list[tuple[str, str]]  # (source_file, target)
    empty_conversations: list[str]

Pipeline Orchestration¶

The ChatGPTExportConverter class in converter.py orchestrates all steps:

class ChatGPTExportConverter:
    def run(self):
        # Step 1: Load
        manifest = load_manifest(self.input_dir)
        user = load_user(self.input_dir, self.config.redact_pii)

        # Step 2: Index
        file_index = build_file_index(manifest)

        # Step 3: Parse
        conversations = parse_conversations(self.input_dir)

        # Step 4-11: Process each conversation
        for conv in conversations:
            assets = resolve_conversation_assets(conv, file_index)
            messages = linearize(conv)
            filtered = filter_messages(messages, self.config.include_thinking)
            markdown = render_conversation(conv, filtered, self.config)
            organize_conversation(conv, self.output_dir)
            # ...

        # Generate indices and validate
        generate_root_index(self.output_dir, stats)
        report = validate_output(self.output_dir)

Error Handling¶

Missing files: Logged as warnings; conversion continues
Malformed JSON: Skipped with error log
Validation errors: Reported in ValidationReport; does not fail
Asset resolution failures: Placeholder comment in output

Performance Considerations¶

Parallelization: Currently sequential; could parallelize per-conversation processing
Deduplication: SHA-256 hashing is I/O bound; cache results
Memory usage: Loads all conversations into memory; could stream for large exports

Next Steps¶

Review data models used in the pipeline
See CLI reference for running the converter
Check output structure for final archive layout