by rinna
Open source · 42k downloads · 24 likes
The *Japanese CLIP ViT B/16* model is a Japanese version of CLIP, specifically designed to understand and connect images with Japanese textual descriptions. It combines an image encoder based on the ViT-B/16 architecture with a Japanese text encoder, enabling consistent comparison and retrieval of visual and textual content. This model excels in tasks such as text-based image search, image classification, or caption generation by leveraging training data tailored to Japanese. What sets it apart is its ability to operate effectively within a Japanese linguistic context while relying on the robust technical foundations of existing models. It is particularly aimed at developers and researchers looking to integrate multimodal functionalities in Japanese into their applications.

This is a Japanese CLIP (Contrastive Language-Image Pre-Training) model trained by rinna Co., Ltd..
Please see japanese-clip for the other available models.
$ pip install git+https://github.com/rinnakk/japanese-clip.git
import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device)
tokenizer = ja_clip.load_tokenizer()
img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
texts=["犬", "猫", "象"],
max_seq_len=77,
device=device,
tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time
)
with torch.no_grad():
image_features = model.get_image_features(image)
text_features = model.get_text_features(**encodings)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) # prints: [[1.0, 0.0, 0.0]]
The model was trained a ViT-B/16 Transformer architecture as an image encoder and uses a 12-layer BERT as a text encoder. The image encoder was initialized from the AugReg vit-base-patch16-224 model.
The model was trained on CC12M translated the captions to Japanese.
May 12, 2022
@misc{rinna-japanese-clip-vit-b-16,
title = {rinna/japanese-clip-vit-b-16},
author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},
url = {https://huggingface.co/rinna/japanese-clip-vit-b-16}
}
@inproceedings{sawada2024release,
title = {Release of Pre-Trained Models for the {J}apanese Language},
author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
month = {5},
year = {2024},
pages = {13898--13905},
url = {https://aclanthology.org/2024.lrec-main.1213},
note = {\url{https://arxiv.org/abs/2404.01657}}
}