AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsjapanese clip vit b 16

japanese clip vit b 16

by rinna

Open source · 42k downloads · 24 likes

1.7
(24 reviews)EmbeddingAPI & Local
About

The *Japanese CLIP ViT B/16* model is a Japanese version of CLIP, specifically designed to understand and connect images with Japanese textual descriptions. It combines an image encoder based on the ViT-B/16 architecture with a Japanese text encoder, enabling consistent comparison and retrieval of visual and textual content. This model excels in tasks such as text-based image search, image classification, or caption generation by leveraging training data tailored to Japanese. What sets it apart is its ability to operate effectively within a Japanese linguistic context while relying on the robust technical foundations of existing models. It is particularly aimed at developers and researchers looking to integrate multimodal functionalities in Japanese into their applications.

Documentation

rinna/japanese-clip-vit-b-16

rinna-icon

This is a Japanese CLIP (Contrastive Language-Image Pre-Training) model trained by rinna Co., Ltd..

Please see japanese-clip for the other available models.

How to use the model

  1. Install package
Shell
$ pip install git+https://github.com/rinnakk/japanese-clip.git
  1. Run
Python
import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"


model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    max_seq_len=77,
    device=device,
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time
)

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

Model architecture

The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer BERT as a text encoder. The image encoder was initialized from the AugReg vit-base-patch16-224 model.

Training

The model was trained on CC12M translated the captions to Japanese.

Release date

May 12, 2022

How to cite

Bibtex
@misc{rinna-japanese-clip-vit-b-16,
    title = {rinna/japanese-clip-vit-b-16},
    author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-clip-vit-b-16}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

License

The Apache 2.0 license

Capabilities & Tags
transformerspytorchsafetensorsclipzero-shot-image-classificationfeature-extractionvisionja
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
1.7

Try japanese clip vit b 16

Access the model directly