ai-multimodal
CommunityAnalyze and transcribe media with Gemini.
AuthorBoneTheDeveloper
Version1.0.0
Installs0
System Documentation
What problem does it solve?
In many teams, extracting structured insights from multimedia content is time-consuming and error-prone. This Skill automates media understanding by analyzing images, audio, and video, performing transcription and OCR, and optionally generating new assets (images/videos) using Google Gemini's multimodal API to accelerate research, education, marketing, and content workflows.
Core Features & Use Cases
- Vision and audio analysis: captioning, object detection, OCR, transcription, and multimodal reasoning for datasets, media libraries, and reports.
- Media generation: produce complementary images with Imagen 4 and short videos with Veo 3 to augment presentations, tutorials, and marketing assets.
- Workflow integration: batch processing, API key rotation, centralized resolver usage, robust error handling, and support for scripts, references, and assets.
Quick Start
Provide sample media (image/audio/video) and a prompt to analyze it and return captions, transcripts, or generated assets.
Dependency Matrix
Required Modules
google-genaipython-dotenvPillowpypdfmarkdownpython-docxdocx2pdf
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: ai-multimodal Download link: https://github.com/BoneTheDeveloper/Electronic-Contact-Contact-Book/archive/main.zip#ai-multimodal Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.