Conversion Pipeline¶
The converter follows an 11-step sequential pipeline to transform a ChatGPT export into a browsable Markdown archive.
Pipeline Overview¶
graph LR
A[Load] --> B[Index]
B --> C[Parse]
C --> D[Enrich]
D --> E[Resolve]
E --> F[Linearize]
F --> G[Filter]
G --> H[Render]
H --> I[Organize]
I --> J[Generate Indices]
J --> K[Validate]
Each step is implemented as a pure function in the pipeline/ module, operating on immutable data structures.
Step 1: Load¶
Module: pipeline/loader.py
Purpose: Parse core metadata files using Pydantic models
Inputs:
export_manifest.jsonuser.jsonuser_settings.jsonmessage_feedback.json
Outputs:
ExportManifestobjectUserobject (with optional PII redaction)UserSettingsobjectlist[MessageFeedback]
Key functions:
load_manifest(input_dir: Path) -> ExportManifestload_user(input_dir: Path, redact_pii: bool) -> Userload_user_settings(input_dir: Path) -> UserSettings | Noneload_feedback(input_dir: Path) -> list[MessageFeedback]
PII redaction:
When redact_pii=True, replaces:
email→"[REDACTED]"phone_number→Nonebirth_year→None
Step 2: Index¶
Module: pipeline/indexer.py
Purpose: Build file ID → path lookup table and SHA-256 deduplication index
Inputs:
ExportManifestobject- Export directory path
Outputs:
dict[str, str]: File ID → relative pathdict[str, list[str]]: SHA-256 hash → list of file IDs
Key functions:
build_file_index(manifest: ExportManifest) -> dict[str, str]hash_file(file_path: Path) -> strbuild_dedup_index(file_index: dict, input_dir: Path) -> dict[str, list[str]]
File ID extraction:
- Modern:
file-1tisCpYYvMfEMvXcf5uNTb-name.ext→file-1tisCpYYvMfEMvXcf5uNTb - Legacy:
file_000000006d18720cb0249e36a7f3d2d5-name.ext→file_000000006d18720cb0249e36a7f3d2d5
Step 3: Parse¶
Module: pipeline/parser.py
Purpose: Parse all conversations-*.json files into validated Pydantic models
Inputs:
- Export directory path
conversations-0.json,conversations-1.json, ...
Outputs:
list[Conversation]
Key functions:
parse_conversations(input_dir: Path) -> list[Conversation]
Behavior:
- Globs
conversations-*.jsonfiles - Sorts by partition number (e.g., 0, 1, 2)
- Validates each conversation with Pydantic
- Skips malformed entries (logs warning)
Step 4: Enrich¶
Module: pipeline/enricher.py
Purpose: Build feedback lookup index
Inputs:
list[MessageFeedback]
Outputs:
dict[str, list[MessageFeedback]]: Conversation ID → feedback records
Key functions:
build_feedback_index(feedback: list[MessageFeedback]) -> dict[str, list[MessageFeedback]]
Usage:
This index is used when generating metadata files to show which conversations received thumbs-up/thumbs-down ratings.
Step 5: Resolve¶
Module: pipeline/resolver.py
Purpose: Resolve file-service:// and sediment:// URIs to local file paths
Inputs:
- File index from Step 2
- Asset pointer URI (e.g.,
file-service://file-8Vk2ls8JSO2iOVBq87yJ880Q)
Outputs:
ResolvedAssetdataclass with source URI, resolved path, and status
Key functions:
strip_pointer_prefix(uri: str) -> strresolve_asset_pointer(file_id: str, file_index: dict) -> Path | Noneresolve_conversation_assets(conversation: Conversation, file_index: dict) -> list[ResolvedAsset]
Resolution process:
- Strip URI scheme prefix (
file-service://orsediment://) - Look up file ID in index
- If not found, return
None(logged as missing asset)
Example:
resolve_asset_pointer(
"file-8Vk2ls8JSO2iOVBq87yJ880Q",
file_index
)
# Returns: Path("file-8Vk2ls8JSO2iOVBq87yJ880Q-D2A52410.jpeg")
Step 6: Linearize¶
Module: pipeline/linearizer.py
Purpose: Walk message DAG from current_node to root, producing chronological message list
Inputs:
Conversationobject withmappingandcurrent_node
Outputs:
list[Message]in chronological order
Key functions:
linearize(conversation: Conversation) -> list[Message]
Algorithm:
def linearize(conversation: dict) -> list[dict]:
"""Walk from current_node to root, return messages in chronological order."""
path: list[dict] = []
node_id = conversation["current_node"]
mapping = conversation["mapping"]
visited = set() # Cycle detection
while node_id is not None:
if node_id in visited:
break # Cycle detected
visited.add(node_id)
node = mapping.get(node_id)
if node and node.get("message"):
path.append(node["message"])
node_id = node.get("parent") if node else None
path.reverse()
return path
Edge cases:
- Cycle detection: Tracks visited nodes; breaks on repeated ID
- Missing current_node: Falls back to DFS from root
- Root node: Always has
parent: nullandmessage: null
Step 7: Filter¶
Module: pipeline/filterer.py
Purpose: Apply visibility rules to remove hidden/system messages
Inputs:
list[Message]from linearizationinclude_thinkingflag
Outputs:
- Filtered
list[Message]
Key functions:
filter_messages(messages: list[Message], include_thinking: bool) -> list[Message]
Filtering rules:
| Condition | Action |
|---|---|
message.weight == 0.0 |
Skip (hidden/system) |
author.role == "system" and empty content |
Skip |
metadata.is_visually_hidden_from_conversation == true |
Skip |
content_type == "user_editable_context" |
Skip (custom instructions) |
content_type == "tether_browsing_display" |
Skip (loading placeholder) |
channel == "commentary" and include_thinking=False |
Skip (thinking blocks excluded by default) |
channel == "commentary" and include_thinking=True |
Include with special rendering |
Empty content detection:
For TextContent and MultimodalTextContent, checks if all parts are empty strings or whitespace.
Step 8: Render¶
Module: pipeline/renderer.py
Purpose: Convert filtered messages to Markdown using Jinja2 templates
Inputs:
- Filtered
list[Message] Conversationmetadata- File index and resolved assets
Outputs:
list[RenderedMessage]with Markdown strings
Key functions:
render_conversation(conversation: Conversation, messages: list[Message], config: ConverterConfig) -> str
Template: templates/conversation.md.j2
Rendering by content type:
| Content Type | Markdown Format |
|---|---|
text |
Join parts with newlines |
multimodal_text |
Text +  for images |
code |
Fenced code block with language |
thoughts |
Joined content from all thought objects |
sonic_webpage |
Blockquote with title, URL, and snippet |
tether_quote |
Blockquote with file title and extracted text |
execution_output |
Inline code or fenced block |
system_error |
Warning block with error name and text |
| Missing asset | <!-- MISSING ASSET: file-service://file-ID --> |
Front matter:
---
id: 17cd7535-aa77-4553-8c04-ee082d0d702f
title: Image Creation Prompt
created: 2024-08-24T07:33:32Z
updated: 2024-08-24T07:33:40Z
model: gpt-4o
---
Step 9: Organize¶
Module: pipeline/organizer.py
Purpose: Create conversation directories and copy media/attachment files
Inputs:
Conversationobject- Resolved assets
- Output directory
Outputs:
- Created directories and copied files
Key functions:
organize_conversation(conversation: Conversation, output_dir: Path) -> Pathorganize_attachments(conversation: Conversation, assets: list[ResolvedAsset], output_dir: Path)organize_dalle(dalle_assets: list[ResolvedAsset], output_dir: Path) -> list[DalleImage]
Directory structure:
archive/
└── conversations/
└── 2024-08-24_image-creation-prompt_17cd7535/
├── index.md
├── media/
│ └── 001-a1b2c3d4.jpeg
└── attachments/
└── document.pdf
Naming conventions:
- Conversation dir:
<YYYY-MM-DD>_<slug>_<short-id>/ - Media files:
<NNN>-<hash8>.<ext>(001-a1b2c3d4.jpeg) - Attachments: Sanitized original filename
Step 10: Generate Indices¶
Module: pipeline/index_generator.py
Purpose: Create index.md files for navigation
Inputs:
- All conversations
- Metadata objects
- Output directory
Outputs:
archive/index.md(root index)archive/conversations/index.md(conversation table)archive/dalle/index.md(DALL-E gallery)archive/metadata/*.md(metadata pages)
Key functions:
generate_root_index(output_dir: Path, stats: dict)generate_conversations_index(conversations: list[Conversation], output_dir: Path)generate_dalle_index(dalle_assets: list[ResolvedAsset], output_dir: Path)generate_metadata_files(user: User, settings: UserSettings, feedback: list[MessageFeedback], output_dir: Path)
Templates:
templates/root_index.md.j2templates/conversation_index.md.j2templates/dalle_index.md.j2templates/metadata/*.md.j2
Step 11: Validate¶
Module: pipeline/validator.py
Purpose: Check output integrity and log issues
Inputs:
- Output directory
Outputs:
ValidationReportdataclass
Key functions:
validate_output(output_dir: Path) -> ValidationReport
Validation checks:
- Missing assets: Scan for
<!-- MISSING ASSET: ... -->comments - Broken links: Check relative links point to existing files
- Empty conversations: Flag conversations with no visible messages
Report fields:
@dataclass
class ValidationReport:
missing_assets: list[str]
broken_links: list[tuple[str, str]] # (source_file, target)
empty_conversations: list[str]
Pipeline Orchestration¶
The ChatGPTExportConverter class in converter.py orchestrates all steps:
class ChatGPTExportConverter:
def run(self):
# Step 1: Load
manifest = load_manifest(self.input_dir)
user = load_user(self.input_dir, self.config.redact_pii)
# Step 2: Index
file_index = build_file_index(manifest)
# Step 3: Parse
conversations = parse_conversations(self.input_dir)
# Step 4-11: Process each conversation
for conv in conversations:
assets = resolve_conversation_assets(conv, file_index)
messages = linearize(conv)
filtered = filter_messages(messages, self.config.include_thinking)
markdown = render_conversation(conv, filtered, self.config)
organize_conversation(conv, self.output_dir)
# ...
# Generate indices and validate
generate_root_index(self.output_dir, stats)
report = validate_output(self.output_dir)
Error Handling¶
- Missing files: Logged as warnings; conversion continues
- Malformed JSON: Skipped with error log
- Validation errors: Reported in
ValidationReport; does not fail - Asset resolution failures: Placeholder comment in output
Performance Considerations¶
- Parallelization: Currently sequential; could parallelize per-conversation processing
- Deduplication: SHA-256 hashing is I/O bound; cache results
- Memory usage: Loads all conversations into memory; could stream for large exports
Next Steps¶
- Review data models used in the pipeline
- See CLI reference for running the converter
- Check output structure for final archive layout