ai-multimodal

Community

Analyze and transcribe media with Gemini.

AuthorBoneTheDeveloper
Version1.0.0
Installs0

System Documentation

What problem does it solve?

In many teams, extracting structured insights from multimedia content is time-consuming and error-prone. This Skill automates media understanding by analyzing images, audio, and video, performing transcription and OCR, and optionally generating new assets (images/videos) using Google Gemini's multimodal API to accelerate research, education, marketing, and content workflows.

Core Features & Use Cases

  • Vision and audio analysis: captioning, object detection, OCR, transcription, and multimodal reasoning for datasets, media libraries, and reports.
  • Media generation: produce complementary images with Imagen 4 and short videos with Veo 3 to augment presentations, tutorials, and marketing assets.
  • Workflow integration: batch processing, API key rotation, centralized resolver usage, robust error handling, and support for scripts, references, and assets.

Quick Start

Provide sample media (image/audio/video) and a prompt to analyze it and return captions, transcripts, or generated assets.

Dependency Matrix

Required Modules

google-genaipython-dotenvPillowpypdfmarkdownpython-docxdocx2pdf

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: ai-multimodal
Download link: https://github.com/BoneTheDeveloper/Electronic-Contact-Contact-Book/archive/main.zip#ai-multimodal

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.