DocuReel

Intro

DocuReel Transforming long-form documents into interactive video knowledge Context Modern knowledge is increasingly locked inside long documents: research papers, technical reports, corporate strategy documents, policy briefs, and internal knowledge bases. While these documents contain valuable insights, their format makes them difficult to consume quickly. Most readers skim the abstract or executive summary and rarely engage with the full content. DocuReel addresses this problem by converting static documents into short, narrated video summaries paired with a real-time conversational interface. Instead of reading a 40-page report, a user can watch a concise 30–60 second vertical video that highlights the key ideas and then interact with the document through voice. The system combines multimodal AI generation, distributed processing, and conversational agents to create a new interaction model for document consumption. ⸻ Problem Long-form documents suffer from three major usability issues: 1. Low information accessibility Reports and papers require significant time investment. Even motivated readers often abandon them early. 2. Static information delivery Traditional documents are passive. If a reader wants clarification or deeper insight, they must manually search through the document. 3. Poor knowledge transfer efficiency Professionals, researchers, and students frequently need to understand the essence of a document quickly before deciding whether deeper reading is worthwhile. DocuReel reframes document consumption by shifting from reading-first workflows to audiovisual summaries with conversational exploration. ⸻ System Overview DocuReel converts any PDF into two complementary outputs: 1. A short-form narrated video summary (30–60 seconds) 2. A live voice-based Q&A agent grounded in the document This interaction model allows users to rapidly understand a document and ask questions about specific details without manually searching through the text. The system operates as an asynchronous multi-agent pipeline deployed on cloud infrastructure. ⸻ High-Level Architecture The platform consists of two main subsystems: 1. Document-to-video generation pipeline 2. Live conversational interface The generation pipeline processes the document asynchronously, while the conversational interface enables real-time interaction with the extracted knowledge. User Uploads PDF ↓ Distributed Processing Pipeline ↓ Video Generation ↓ Video Stored in Cloud Storage ↓ Signed URL Delivered to Frontend ↓ Optional Voice Interaction with Document Agent ⸻ Multi-Agent Processing Pipeline DocuReel uses a four-agent pipeline orchestrated through Google’s Agent Development Kit (ADK). Each agent performs a specialized task in the document transformation process. Agent 1: Summary Extraction The first agent processes the uploaded PDF using Gemini 3.1 Pro to produce a concise semantic summary of the document. The goal of this stage is to extract: • core thesis or objective • key supporting arguments • important findings or conclusions • structural organization of the document The output becomes the foundation for video content generation. ⸻ Agent 2: In-Depth Knowledge Extraction While the summary agent runs, a second agent executes in parallel to extract deeper contextual information from the document. This includes: • important definitions • key technical details • supporting evidence • background context • frequently referenced concepts This structured information becomes the knowledge base used by the live conversational agent. Separating summary generation from deeper extraction improves both speed and answer quality during interactive sessions. ⸻ Agent 3: Video Script Generation Once the summary agent completes its output, the results are passed to the script generation agent. This agent converts the textual summary into a scene-based video script optimized for short-form content. The script includes: • narrative flow • scene segmentation • voice narration • emphasis points • visual cues for the video generator The objective is to transform static document content into a format suitable for short-form storytelling. ⸻ Agent 4: Video Generation The final agent converts the script into video content using Veo 3.1 Fast on Vertex AI. The system generates: • video visuals • synchronized narration • timing alignment between scenes This stage produces the raw video assets used to construct the final video summary. ⸻ Video Assembly After video generation, the assets are processed using FFmpeg to assemble the final output. This stage performs: • video segment stitching • audio synchronization • final encoding The completed video is then stored in Google Cloud Storage. ⸻ Storage and Delivery The final output video is stored in a cloud object storage bucket. To securely deliver the video to users, the system generates a signed URL which allows temporary access to the file without exposing the storage bucket publicly. The frontend receives the signed URL and streams the generated video to the user. ⸻ Live Interactive Agent In addition to the video summary, DocuReel supports real-time voice interaction with the document. When users activate the “Connect to Agent” feature, the system launches a conversational session using Gemini Live. The live agent uses the structured output from the in-depth extraction agent as contextual grounding. This allows the system to answer questions about the document in real time. Example interactions include: • clarifying specific concepts • explaining methodologies • summarizing particular sections • providing deeper interpretation of findings The conversation occurs over a persistent WebSocket connection that streams audio responses to the user. ⸻ Infrastructure Architecture DocuReel runs on a distributed cloud architecture designed to support asynchronous processing and long-running AI workloads. Frontend Hosted on Vercel, responsible for: • user interaction • document uploads • video playback • voice interaction controls Compute Layer The backend services run on Google Cloud Run, providing serverless containerized compute environments. This allows the system to scale dynamically depending on workload. Task Orchestration Video generation jobs are dispatched using Google Cloud Tasks. Cloud Tasks enables: • asynchronous execution • retry logic for failed jobs • distributed task scheduling • resilience against service restarts This architecture ensures that long-running AI generation pipelines execute reliably even under failure conditions. Storage All generated video outputs are stored in Google Cloud Storage, enabling scalable and durable storage for media assets. ⸻ Technology Stack Component Technology Frontend Vercel Backend Compute Google Cloud Run Task Queue Google Cloud Tasks Storage Google Cloud Storage AI Models Gemini 3.1 Pro, Gemini Live, Veo 3.1 Fast Video Processing FFmpeg ⸻ Key Design Decisions Parallel Information Extraction Running summary generation and deep extraction in parallel reduces latency while ensuring the live agent has access to detailed contextual knowledge. Asynchronous Processing AI video generation can take tens of seconds. Using Cloud Tasks decouples user requests from heavy compute operations, preventing timeouts and improving reliability. Multi-Agent Decomposition Dividing the system into specialized agents improves modularity and allows each stage to focus on a specific transformation. Cloud-Native Architecture Serverless infrastructure allows the system to scale dynamically without manual infrastructure management. ⸻ Potential Applications DocuReel can be applied across several domains: Research consumption Summarizing academic papers into digestible video explanations. Corporate knowledge management Transforming internal strategy documents into quick knowledge briefs. Education Helping students understand complex reading material through audiovisual summaries. Policy communication Converting long policy documents into accessible video explanations for public audiences. ⸻ Project Outcome DocuReel demonstrates how generative AI can transform the way people consume long-form information. By combining short-form video generation with conversational AI, the system provides a faster and more interactive way to explore complex documents. The project was developed as part of the Google Gemini AI Hackathon, exploring how multimodal models and distributed AI pipelines can reshape information accessibility.

2026

View Project

View Demo

Next work

Fine-Tuning LLaMA 3.1-8B for Mathematical Reasoning Verification