Skip to content

Design decisions

This page explains the key architectural and library choices in the converter and the rationale behind each one.

Pydantic v2 for data models

ChatGPT export JSON is user data — it can have missing fields, unexpected types, extra keys, and entirely new content types introduced by OpenAI API changes. Pydantic v2 handles this robustly:

  • Strict validation with graceful fallback — a discriminated union routes unknown content_type values to FallbackContent rather than failing the whole conversion.
  • Coercion where sensible — string values are coerced to the declared Python types automatically (e.g. "true"bool). Timestamps are stored as raw float | None Unix epoch values throughout the models.
  • Forward compatibilitymodel_config = ConfigDict(extra="allow") preserves unrecognised fields without breaking validation.

The alternative of plain dict access would lose all of this and spread raw string handling throughout the pipeline code.

Cyclopts for the CLI

The CLI needs a small surface: two positional paths and three boolean flags. Cyclopts provides:

  • Parameter declarations directly on the function signature via Annotated — no separate schema
  • Automatic --flag / --no-flag toggle pairs for booleans
  • Help text from type annotations and docstrings

argparse or click would also work. Cyclopts was chosen for its minimal boilerplate and idiomatic modern Python typing integration (using Annotated and union types).

pydantic-settings for configuration

ConverterConfig needs to merge values from CLI flags, environment variables, and .env files with a clear precedence order. pydantic-settings provides exactly this:

  • Automatic CONVERTER_ prefix for environment variables
  • .env file loading without extra tooling
  • Type coercion from string environment values to Python types
  • A single class as the source of truth for every configuration field

Writing this by hand would require a custom merging layer and would duplicate all type information.

Jinja2 for Markdown rendering

Each content type produces different Markdown structure. Jinja2 templates let non-Python contributors edit the rendered output without touching Python code:

  • Templates live in src/chatgpt_to_markdown/templates/
  • Custom filters in renderer.py handle Python-level transformations
  • Changing output format is a template edit, not a code change

The alternative of f-strings or string concatenation in Python would tangle rendering logic into the pipeline code and make output changes harder to review.

src/ layout

The project uses the src/ layout (src/chatgpt_to_markdown/) to:

  • Prevent accidental imports of the local source tree during tests before installation
  • Make the installed package and the development tree identical in import behaviour
  • Follow PEP 517 / PEP 660 best practices for build isolation

Sequential pipeline (not parallel)

The 11 pipeline steps are sequential by design:

  • Each step consumes the output of the previous one (manifest → index → parsed conversations → resolved assets → linearised messages → etc.)
  • The design is easy to debug: a failure is isolated to a single numbered step with a clear input and output
  • Parallelising within a step (e.g., processing multiple conversations concurrently) is possible but adds complexity not yet justified by benchmarks

SHA-256 deduplication

The same image can appear in multiple conversations. Without deduplication, a power-user export with hundreds of repeated images produces a redundant archive. SHA-256:

  • Is deterministic — the same file always produces the same hash
  • Is collision-resistant for practical file sizes
  • Is fast enough for the file sizes typical in ChatGPT exports

Content-based addressing also makes the archive stable across re-runs: re-running the converter on the same export produces the same output hashes.

DAG linearisation

Message histories in ChatGPT exports are directed acyclic graphs, not linear lists. When a user edits a previous message, a new branch is created while the old one remains in the export data. The converter walks from current_node back to the root to reconstruct only the active conversation path — which is what the user actually saw and intended.

Cycle detection is included as a safeguard against malformed export data.