Architecture Overview
This project is a complete localization pipeline built to translate a massive, proprietary Java-based simulation application into Japanese. The software features an enormous amount of highly technical, lore-rich text spanning over 6,700 complex interactive nodes, environmental descriptions, and proprietary Java logic.
Unlike standard web applications, localizing a proprietary legacy Java engine requires reverse-engineering the text pipelines and establishing a robust workflow to translate massive datasets without breaking the application's compilation, UI rendering, or underlying script calls.
Localization Pipeline & Bytecode Injection Architecture
Hover or click components to inspect the offline processing pipeline or runtime execution layers.
INSPECT Localization Pipeline Inspector
Hover or click on any offline pipeline step (AST extraction, Translation Memory, Gemma engine, Critic loop) or runtime execution component (Static data overrides, Javassist agent) to display engineering specifications and live code/data templates here.
Engineering Challenges Solved
Bytecode Engineering & Hybrid Architecture Pivot
Localizing this application required overcoming a fundamental limitation: the proprietary UI engine used display strings as internal lookup keys, making dynamic runtime translation highly unstable.
To resolve this, I pivoted from a pure dynamic JVM agent to a two-layer hybrid delivery architecture:
- Static Data Override Layer (90% of content): Standard override bundles for structured CSVs, configuration JSONs, and dialogue tables are injected natively into the application's resource path, requiring no bytecode interception.
- Surgical Runtime Agent Layer (10% of content): A lightweight JVM agent utilizing Javassist constant pool manipulation is reserved exclusively to rewrite memory references of hardcoded string literals inside obfuscated classes (such as main menu labels and system warnings), ensuring 100% crash-free runtime substitution.
Modular CLI Tooling & Automated Translation Memory
Manually running multiple isolated cleanup and validation scripts proved unsustainable. I consolidated the entire workflow into a unified command-line orchestrator (app_localizer_cli) executing a deterministic 8-phase pipeline.
To reduce LLM token overhead, the pipeline integrates a local Translation Memory (TM) database in SQLite. Before routing strings to the translation model, the CLI checks the TM for exact and fuzzy matches (>85% similarity). Translated text blocks are automatically processed via an automated LLM critic loop that audits glossary compliance, length limits, and placeholder stability, writing verified results back to the TM database.
Strategic Outcomes
- Zero-Crash Localization: Successfully merged thousands of translated nodes back into the original application parameters and script calls completely untouched, verified via SHA-256 integrity checks.
- Performance Stability: Shifting 90% of translations to native static resource overrides eliminated classloader desyncs and memory footprint overhead.