Legacy Java Application Localization Pipeline

Architecture Overview

This project is a complete localization pipeline built to translate a massive, proprietary Java-based simulation application into Japanese. The software features an enormous amount of highly technical, lore-rich text spanning over 6,700 complex interactive nodes, environmental descriptions, and proprietary Java logic.

Unlike standard web applications, localizing a proprietary legacy Java engine requires reverse-engineering the text pipelines and establishing a robust workflow to translate massive datasets without breaking the application's compilation, UI rendering, or underlying script calls.

Localization Pipeline & Bytecode Injection Architecture

Hover or click components to inspect the offline processing pipeline or runtime execution layers.

INSPECT Localization Pipeline Inspector

Hover or click on any offline pipeline step (AST extraction, Translation Memory, Gemma engine, Critic loop) or runtime execution component (Static data overrides, Javassist agent) to display engineering specifications and live code/data templates here.

Engineering Challenges Solved

Bytecode Engineering & Hybrid Architecture Pivot

Localizing this application required overcoming a fundamental limitation: the proprietary UI engine used display strings as internal lookup keys, making dynamic runtime translation highly unstable.

To resolve this, I pivoted from a pure dynamic JVM agent to a two-layer hybrid delivery architecture:

Static Data Override Layer (90% of content): Standard override bundles for structured CSVs, configuration JSONs, and dialogue tables are injected natively into the application's resource path, requiring no bytecode interception.
Surgical Runtime Agent Layer (10% of content): A lightweight JVM agent utilizing Javassist constant pool manipulation is reserved exclusively to rewrite memory references of hardcoded string literals inside obfuscated classes (such as main menu labels and system warnings), ensuring 100% crash-free runtime substitution.

Modular CLI Tooling & Automated Translation Memory

Manually running multiple isolated cleanup and validation scripts proved unsustainable. I consolidated the entire workflow into a unified command-line orchestrator (app_localizer_cli) executing a deterministic 8-phase pipeline.

To reduce LLM token overhead, the pipeline integrates a local Translation Memory (TM) database in SQLite. Before routing strings to the translation model, the CLI checks the TM for exact and fuzzy matches (>85% similarity). Translated text blocks are automatically processed via an automated LLM critic loop that audits glossary compliance, length limits, and placeholder stability, writing verified results back to the TM database.

Strategic Outcomes

Zero-Crash Localization: Successfully merged thousands of translated nodes back into the original application parameters and script calls completely untouched, verified via SHA-256 integrity checks.
Performance Stability: Shifting 90% of translations to native static resource overrides eliminated classloader desyncs and memory footprint overhead.

View Original Repository (GitHub)