Multimodal Marine Assistant

An AI-powered assistant purpose-built for marine engineers, combining document intelligence and visual recognition. This system integrates OCR-parsed technical manuals, real-time image analysis, and a domain-specific Retrieval-Augmented Generation (RAG) pipeline trained on proprietary marine datasets. It delivers highly accurate, context-aware responses using a memory-enabled LLM. The assistant handles natural language queries, highlights reference text from engineering PDFs, and correlates visual inputs (e.g., engine parts, control panels) to ensure real-time, multimodal support in high-stakes environments.

Multimodal AI Integration
Domain-Specific RAG Architecture
Visual & Document Understanding
LLM Prompt Engineering
Real-Time Inference Pipeline Design

Multimodal interface for marine assistant with OCR output and image reference.

Solving operational gaps

Marine engineers often work in complex environments with outdated manuals, limited technical support, and no unified interface for documentation and visual data. We built an assistant that combines OCR, image processing, and language understanding to bridge this gap.

Engineers can upload snapshots of engine parts or query large PDF manuals, and the system returns precise, context-rich results with reference highlights and visual cues.

Layered view of structured document elements in marine manuals.

Annotated PDF showing figure callouts and semantic references.

Multimodal document understanding

The system parses scanned or structured PDFs using OCR, and maps figures and tables with text using object detection and layout modeling. Users can ask natural language questions like "Where is the camshaft location?" and receive both text and image answers, linked by figure IDs and page numbers.

System returning semantic response with image cross-reference.

Technical flow

● Utilized Azure Document Intelligence for high-accuracy OCR and layout-aware parsing of scanned ship manuals and technical PDFs.
● Custom-trained YOLOv8 model using Roboflow for detecting diagrams, machinery, and labeled figures in engineering manuals. Included extensive data augmentation, annotation pipelines, and class balancing.
● Integrated MediaPipe for real-time eye tracking and pose estimation during live interactions for behavior and engagement analytics.
● Employed Retrieval-Augmented Generation (RAG) architecture with DeepLake vector store and Azure OpenAI’s GPT models for precise semantic retrieval and context-aware response generation.
● Structured document metadata and visual content into unified schema to support multimodal grounding of answers (text + image-based references).
● Designed an asynchronous pipeline to extract figure IDs, text content, and page-level embeddings for scalable marine document ingestion.
● Real-time chat interface supports follow-up queries, memory retention, and visual highlighting of referenced manual sections for better explainability.

Project outcomes

The intelligent assistant significantly transformed the way marine technicians interact with complex ship manuals and troubleshooting workflows:

Reduced Query Resolution Time by 60%+: Enabled engineers to get instant, accurate answers from thousands of pages of technical documentation without manual searching.

Real-Time Visual Troubleshooting: Empowered ship staff to upload images from engine rooms and receive AI-guided diagnostics and object-specific insights—without internet dependency in hybrid deployments.

Frictionless Document Navigation: Delivered deep linking, page-level highlights, and figure references within parsed manuals for intuitive navigation of dense PDF materials.

Multimodal Intelligence Made Simple: Seamlessly combined image understanding, OCR-parsed content, and semantic search into a unified interface accessible even to non-technical users.

Operational Efficiency & Training Impact: Reduced onboarding time for new crew, improved accuracy in maintenance procedures, and decreased reliance on manual consultations with senior engineers.