ai-multimodal
CommunityTurn media into smart, actionable outputs—fast.
Software Engineering#multimodal#content-generation#Gemini API#video-analysis#image-analysis#audio-processing#document-extraction
AuthorPhucMPham
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill enables end-to-end multimodal processing using Gemini and related models to analyze, transcribe, extract, and generate content from audio, images, video, and documents, reducing manual toil and enabling richer AI-driven media workflows.
Core Features & Use Cases
- Audio/Video analysis: Transcription with timestamps, summarization, scene detection, and YouTube processing (up to hours long).
- Image understanding: Captioning, object detection, segmentation, OCR, and multi-image comparisons.
- Document understanding: PDF extraction of tables, forms, charts, and diagrams.
- Generation: Text-to-image and text-to-video generation with Imagen 4 and Veo 3, plus editing and refinement.
- Model & key management: Supports Google Gemini, Imagen, and Veo models with rotation and orchestration.
Quick Start
Example: Analyze an image python3 gemini_batch_process.py --task analyze --files sample.jpg
Dependency Matrix
Required Modules
google-genaipython-dotenv
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: ai-multimodal Download link: https://github.com/PhucMPham/threejs-christmas-tree/archive/main.zip#ai-multimodal Please download this .zip file, extract it, and install it in the .claude/skills/ directory.