AI Coding Assistant Training Data Extraction Toolkit
Overview
The AI Coding Assistant Training Data Extraction Toolkit is an open-source project that automatically discovers and extracts conversation histories, code contexts, and metadata from popular AI coding assistants.
Built for machine learning researchers, developers, and teams who want to understand or train on their own AI coding interactions, this toolkit provides a unified interface across multiple platforms.
Repository: github.com/0xSero/ai-data-extraction
The Problem
AI coding assistants have become essential to modern development workflows. But the data they generate-your prompts, the AI's responses, code diffs, tool executions-is locked away in platform-specific formats:
- SQLite databases with undocumented schemas
- JSONL session files scattered across directories
- Version-specific storage that changes with each update
Researchers and teams who want to:
- Fine-tune models on their own coding patterns
- Audit AI interactions for security or compliance
- Analyze productivity and workflow optimization
- Build custom tooling on top of conversation data
...were left manually reverse-engineering each tool.
Solution
The toolkit provides zero-dependency extraction across the most popular AI coding assistants:
Supported Platforms
| Platform | Storage Format | Versions Supported |
|---|---|---|
| Claude Code | JSONL sessions | All versions |
| Claude Desktop | JSONL | All versions |
| Cursor | SQLite + JSONL | v0.43 through v2.0+ |
| Codex | JSONL | All versions |
| Trae | SQLite | All versions |
| Windsurf | JSONL | All versions |
| Continue AI | SQLite | All versions |
Key Features
Auto-Discovery
- Identifies installations across macOS, Linux, and Windows
- Handles version-specific storage locations automatically
- No manual configuration required for standard setups
Complete Context Extraction
- User messages and AI responses
- Code diffs and suggested edits
- Multi-file contexts with line numbers and paths
- Tool use and execution results
- Timestamps and conversation metadata
Clean Output
- Organized, timestamped JSONL files
- Consistent schema across all platforms
- Ready for downstream ML training pipelines
Technical Implementation
Architecture
┌─────────────────────────────────────────────────────────┐
│ Auto-Discovery │
│ Scans OS-specific paths for AI assistant installations │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Platform Adapters │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Claude │ │ Cursor │ │ Codex │ │Windsurf │ ... │
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Unified Data Model │
│ Conversation → Messages → Code Context → Metadata │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ JSONL Output │
│ Timestamped, organized files for ML pipelines │
└─────────────────────────────────────────────────────────┘
Design Principles
Zero Dependencies
- Python 3.6+ standard library only
- No pip install, no virtual environment headaches
- Works immediately on any system with Python
Defensive Parsing
- Handles corrupted or incomplete session data gracefully
- Version detection for format migrations (especially Cursor v0.x → v2.x)
- Clear error messages when data can't be extracted
Privacy-Aware
- Outputs to local files only-no network calls
- Documentation emphasizes secret scanning before use
- Clear warnings about proprietary code in extracted data
Challenges Overcome
Cursor's Evolving Storage
Cursor changed storage formats significantly between versions:
- v0.43–1.x: SQLite with specific table structures
- v2.0+: Hybrid JSONL + SQLite approach
The toolkit detects the installed version and applies the correct extraction logic automatically.
Cross-Platform Path Resolution
AI assistants store data in wildly different locations:
- macOS:
~/Library/Application Support/... - Linux:
~/.config/...or~/.local/share/... - Windows:
%APPDATA%\...
Auto-discovery handles all three platforms with a unified interface.
Incomplete Session Data
Users frequently have partial sessions-interrupted conversations, crashed processes, or migrations between machines. The toolkit:
- Extracts what's available without failing
- Reports skipped entries with clear context
- Produces valid output even from corrupted inputs
Results
Community Adoption
- 121+ GitHub stars within weeks of release
- 14 forks with active community contributions
- Used by ML researchers, productivity analysts, and developer tool builders
Use Cases Enabled
- Fine-tuning: Teams training models on their own coding patterns
- Workflow Analysis: Understanding how developers interact with AI assistants
- Compliance Auditing: Reviewing AI interactions for sensitive data exposure
- Tool Building: Creating custom dashboards and analytics on top of conversation data
Lessons Learned
Zero-dependency pays off – Friction kills adoption. A single Python file that "just works" spreads faster than a full package with requirements.
Version detection is essential – Tools evolve quickly. Building in version awareness from day one saved countless support issues.
Privacy documentation matters – Being explicit about what data is extracted and how to handle it built trust with cautious users.
Community contributions scale – Open-sourcing the adapter pattern let the community add support for tools faster than any single developer could.
Future Directions
- Real-time sync: Stream new conversations as they happen
- Anonymization utilities: Built-in PII scrubbing for safer sharing
- Analysis dashboards: Visualizations for productivity patterns
- Additional platforms: VS Code Copilot, JetBrains AI, and others
This project reflects a core principle in 0xSero's work: find the friction, remove it, and make the tool disappear. When extracting AI coding data is as simple as running a Python script, researchers and teams can focus on what they're actually trying to build.
Explore the code: github.com/0xSero/ai-data-extraction
More Case Studies
MiniMax-M2 Proxy – Bridging 229B Models to Standard APIs
A translation proxy that enables MiniMax-M2 (229B MoE model) to work seamlessly with OpenAI and Anthropic SDKs through intelligent XML-to-JSON conversion.
Open Orchestra – Multi-Agent Orchestration System
A hub-and-spoke multi-agent orchestration plugin with Neo4j-backed memory, 22+ tool APIs, and built-in worker profiles for complex AI workflows.