by line-corporation
Open source · 21k downloads · 29 likes
The *clip-japanese-base* model is a Japanese version of CLIP, specifically designed to understand and connect text and images in Japanese. It excels in tasks such as zero-shot image classification, searching for images based on text or vice versa, thanks to its ability to associate textual descriptions with visual content. Trained on a billion image-text pairs from the web, it offers a nuanced understanding of Japanese linguistic and cultural subtleties. Its use cases include visual content analysis, automatic moderation, and enhancing multimodal search engines. What sets it apart is its robustness on Japanese data, combined with a high-performance architecture tailored to the language's specificities.
This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.
pip install pillow requests sentencepiece transformers torch timm
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
# [[1., 0., 0.]]
The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.
| Model | Image Encoder Params | Text Encoder params | STAIR Captions (R@1) | Recruit Datasets (acc@1) | ImageNet-1K (acc@1) |
|---|---|---|---|---|---|
| Ours | 86M(Eva02-B) | 100M(BERT) | 0.30 | 0.89 | 0.58 |
| Stable-ja-clip | 307M(ViT-L) | 100M(BERT) | 0.24 | 0.77 | 0.68 |
| Rinna-ja-clip | 86M(ViT-B) | 100M(BERT) | 0.13 | 0.54 | 0.56 |
| Laion-clip | 632M(ViT-H) | 561M(XLM-RoBERTa) | 0.30 | 0.83 | 0.58 |
| Hakuhodo-ja-clip | 632M(ViT-H) | 100M(BERT) | 0.21 | 0.82 | 0.46 |
The Apache License, Version 2.0
@misc{clip-japanese-base,
title = {CLIP Japanese Base},
author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
url = {https://huggingface.co/line-corporation/clip-japanese-base},
}