ChatGPT Export Format¶
Understanding the structure of a ChatGPT data export is essential for troubleshooting conversion issues or extending the converter.
Overview¶
A ChatGPT data export is a ZIP archive produced by OpenAI's "Export data" feature under Settings → Data controls. It contains:
- All conversations (partitioned JSON files)
- User-uploaded files and DALL-E generations
- Canvas/Project artifacts
- User metadata and feedback records
The export is not a relational database. It's a denormalized dump where:
- Conversations are stored as partitioned JSON arrays
- File assets are scattered across multiple directories
- References between conversations and files use internal identifiers (
file-service://URIs) - Message graphs within conversations are DAGs (directed acyclic graphs), not flat lists
Root-Level Files¶
| File | Format | Purpose |
|---|---|---|
conversations-NNN.json |
JSON array | Partitioned conversation data (zero-indexed) |
export_manifest.json |
JSON object | Inventory of all exported files with paths and byte sizes |
user.json |
JSON object | Account profile (email, phone, subscription status) |
user_settings.json |
JSON array | Feature flags, model preferences, onboarding state |
message_feedback.json |
JSON array | Thumbs-up/thumbs-down ratings linked to conversations |
chat.html |
HTML | Interactive conversation browser (often 100+ MB) |
file-<ID>-<name>.<ext> |
Various | Root-level exported artifacts (modern format) |
file_<HEX>-<name>.<ext> |
Various | Root-level exported artifacts (legacy format) |
Directory Structure¶
| Directory Pattern | Contents |
|---|---|
<conversation-UUID>/image/ |
PNG images from a specific conversation |
dalle-generations/ |
DALL-E-generated WebP images |
user-<user-ID>/ |
Canvas/Project workspace files |
user-<user-ID>/<hex-project-ID>/mnt/data/ |
Individual project sandbox files (images, code) |
File Naming Conventions¶
Two naming generations coexist in exports:
Modern Format¶
Pattern: file-<Base62ID>-<descriptive-name>.<ext>
Example: file-1tisCpYYvMfEMvXcf5uNTb-python_guidelines.md
File ID extraction: Take the first two dash-separated tokens → file-1tisCpYYvMfEMvXcf5uNTb
Legacy Format¶
Pattern: file_<HexID>-<descriptive-name>.<ext>
Example: file_000000006d18720cb0249e36a7f3d2d5-Untitled 1.md
File ID extraction: Take everything before the first hyphen → file_000000006d18720cb0249e36a7f3d2d5
Both formats in one export
A single export can contain both modern and legacy file naming formats. The converter handles both automatically.
Key JSON Files¶
Export Manifest (export_manifest.json)¶
Authoritative inventory of all files:
{
"export_files": [
{
"path": "conversations-0.json",
"size_bytes": 1234567
},
{
"path": "file-8Vk2ls8JSO2iOVBq87yJ880Q-example.jpeg",
"size_bytes": 129054
}
]
}
Used by the converter to build the file ID → path lookup index.
User Profile (user.json)¶
Account information (contains PII):
{
"id": "user-abc123",
"email": "user@example.com",
"phone_number": "+1234567890",
"birth_year": 1990,
"chatgpt_plus_user": true
}
PII Fields
By default, email, phone_number, and birth_year are redacted during conversion. Use --no-redact-pii to preserve them.
User Settings (user_settings.json)¶
Feature flags and preferences:
[
{
"user_id": "user-abc123",
"settings": {
"training_allowed": false,
"developer_mode": false,
"voice_name": "glimmer",
"last_used_model_config": {
"slugs": {
"default": "gpt-4o",
"web": "gpt-4o",
"ios_app": "gpt-4o-mini"
}
}
}
}
]
Message Feedback (message_feedback.json)¶
Thumbs-up/thumbs-down ratings:
[
{
"id": "fb-uuid",
"conversation_id": "conv-uuid",
"rating": "thumbs_up",
"create_time": "2026-01-19T20:41:07.303611Z"
}
]
The conversation_id links feedback to specific conversations.
Conversations (conversations-NNN.json)¶
Each file contains a JSON array of conversation objects. Large exports are split into multiple partitioned files (conversations-0.json, conversations-1.json, etc.).
See Data Models for complete conversation structure.
Asset Reference Resolution¶
File Pointer URIs¶
Asset references use these URI schemes:
| Scheme | Format | Example |
|---|---|---|
file-service:// |
file-service://file-<ID> |
file-service://file-8Vk2ls8JSO2iOVBq87yJ880Q |
sediment:// |
sediment://file_<HEX> |
sediment://file_000000006d18720cb0249e36a7f3d2d5 |
The converter strips the scheme prefix and looks up the file ID in the pre-built index.
Resolution Priority¶
When resolving a file ID, the converter searches locations in this order:
- Root directory:
file-<ID>-* - Conversation image directories:
<conv-UUID>/image/file* - DALL-E directory:
dalle-generations/file-<ID>-* - User workspace directories:
user-<user-ID>/**/file* - Legacy root format:
file_<HEX>-*
Missing Assets¶
If a file ID cannot be resolved:
- A warning is logged with conversation ID, message ID, and unresolved pointer
- Markdown output includes:
<!-- MISSING ASSET: file-service://file-<ID> --> - Conversion continues (does not fail)
Conversation Structure¶
Each conversation in conversations-*.json contains:
- Metadata: Title, timestamps, model slug, flags
- Mapping: Dictionary of nodes forming a DAG
- Current node: Pointer to the active conversation leaf
Message DAG¶
Messages are organized as a directed acyclic graph (DAG):
- Root node: Always has
parent: nullandmessage: null - Branches: Created when users edit previous messages
- Current node: Points to the leaf of the primary conversation path
The linearizer walks from current_node back to the root to reconstruct the conversation timeline.
Common Export Sizes¶
| Account Activity | Export Size | Conversations | Files |
|---|---|---|---|
| Light user | 1-10 MB | 10-50 | 0-10 |
| Moderate user | 10-100 MB | 50-500 | 10-100 |
| Heavy user | 100 MB-1 GB | 500-5,000 | 100+ |
| Power user | 1-10 GB | 5,000+ | 1,000+ |
The chat.html file alone can exceed 100 MB for active accounts.
Next Steps¶
- Explore the data models for detailed JSON schemas
- Understand the conversion pipeline
- Review output structure conventions