ai-multimodal

Community

Turn media into smart, actionable outputs—fast.

AuthorPhucMPham
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill enables end-to-end multimodal processing using Gemini and related models to analyze, transcribe, extract, and generate content from audio, images, video, and documents, reducing manual toil and enabling richer AI-driven media workflows.

Core Features & Use Cases

  • Audio/Video analysis: Transcription with timestamps, summarization, scene detection, and YouTube processing (up to hours long).
  • Image understanding: Captioning, object detection, segmentation, OCR, and multi-image comparisons.
  • Document understanding: PDF extraction of tables, forms, charts, and diagrams.
  • Generation: Text-to-image and text-to-video generation with Imagen 4 and Veo 3, plus editing and refinement.
  • Model & key management: Supports Google Gemini, Imagen, and Veo models with rotation and orchestration.

Quick Start

Example: Analyze an image python3 gemini_batch_process.py --task analyze --files sample.jpg

Dependency Matrix

Required Modules

google-genaipython-dotenv

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ai-multimodal
Download link: https://github.com/PhucMPham/threejs-christmas-tree/archive/main.zip#ai-multimodal

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository