Long-Document LLM Pipeline for Financial Research
GitHub Repo: Junyu06/LLM-Pipeline
Problem
Long-form documents (10k–20k words) need to be converted into structured outputs for researchers. Raw LLM outputs can be unreliable and cause silent data corruption in downstream systems.
Approach
- Designed a MapReduce-style LLM orchestration pipeline for 10k-20k word inputs
- Implemented schema-constrained generation with Pydantic and explicit validation boundaries
- Engineered idempotent, resumable stages with self-repair logic for partial-failure recovery
- Deployed local LLM inference to control cost and data flow
Results
- Consistently produced schema-valid structured outputs across long-context inputs
- Eliminated full-job reprocessing through stage-level recovery design
- Maintained predictable processing behavior across long-running workloads
Tech Stack
Python, Pydantic, semantic chunking, MapReduce-style pipeline, LLM orchestration, local LLM deployment