Design decisions¶
This page explains the key architectural and library choices in the converter and the rationale behind each one.
Pydantic v2 for data models¶
ChatGPT export JSON is user data — it can have missing fields, unexpected types, extra keys, and entirely new content types introduced by OpenAI API changes. Pydantic v2 handles this robustly:
- Strict validation with graceful fallback — a discriminated union routes unknown
content_typevalues toFallbackContentrather than failing the whole conversion. - Coercion where sensible — string values are coerced to the declared Python types automatically (e.g.
"true"→bool). Timestamps are stored as rawfloat | NoneUnix epoch values throughout the models. - Forward compatibility —
model_config = ConfigDict(extra="allow")preserves unrecognised fields without breaking validation.
The alternative of plain dict access would lose all of this and spread raw string handling throughout the pipeline code.
Cyclopts for the CLI¶
The CLI needs a small surface: two positional paths and three boolean flags. Cyclopts provides:
- Parameter declarations directly on the function signature via
Annotated— no separate schema - Automatic
--flag / --no-flagtoggle pairs for booleans - Help text from type annotations and docstrings
argparse or click would also work. Cyclopts was chosen for its minimal boilerplate and idiomatic modern Python typing integration (using Annotated and union types).
pydantic-settings for configuration¶
ConverterConfig needs to merge values from CLI flags, environment variables, and .env files with a clear precedence order. pydantic-settings provides exactly this:
- Automatic
CONVERTER_prefix for environment variables .envfile loading without extra tooling- Type coercion from string environment values to Python types
- A single class as the source of truth for every configuration field
Writing this by hand would require a custom merging layer and would duplicate all type information.
Jinja2 for Markdown rendering¶
Each content type produces different Markdown structure. Jinja2 templates let non-Python contributors edit the rendered output without touching Python code:
- Templates live in
src/chatgpt_to_markdown/templates/ - Custom filters in
renderer.pyhandle Python-level transformations - Changing output format is a template edit, not a code change
The alternative of f-strings or string concatenation in Python would tangle rendering logic into the pipeline code and make output changes harder to review.
src/ layout¶
The project uses the src/ layout (src/chatgpt_to_markdown/) to:
- Prevent accidental imports of the local source tree during tests before installation
- Make the installed package and the development tree identical in import behaviour
- Follow PEP 517 / PEP 660 best practices for build isolation
Sequential pipeline (not parallel)¶
The 11 pipeline steps are sequential by design:
- Each step consumes the output of the previous one (manifest → index → parsed conversations → resolved assets → linearised messages → etc.)
- The design is easy to debug: a failure is isolated to a single numbered step with a clear input and output
- Parallelising within a step (e.g., processing multiple conversations concurrently) is possible but adds complexity not yet justified by benchmarks
SHA-256 deduplication¶
The same image can appear in multiple conversations. Without deduplication, a power-user export with hundreds of repeated images produces a redundant archive. SHA-256:
- Is deterministic — the same file always produces the same hash
- Is collision-resistant for practical file sizes
- Is fast enough for the file sizes typical in ChatGPT exports
Content-based addressing also makes the archive stable across re-runs: re-running the converter on the same export produces the same output hashes.
DAG linearisation¶
Message histories in ChatGPT exports are directed acyclic graphs, not linear lists. When a user edits a previous message, a new branch is created while the old one remains in the export data. The converter walks from current_node back to the root to reconstruct only the active conversation path — which is what the user actually saw and intended.
Cycle detection is included as a safeguard against malformed export data.