by microsoft
Open source · 132k downloads · 67 likes
UniXcoder is an AI model developed by Microsoft that leverages multimodal data, such as code comments and abstract syntax trees (AST), to learn rich representations of code. It stands out for its ability to operate in three modes: as an encoder alone for tasks like code search, as a decoder alone for code completion, and as an encoder-decoder for applications such as function name prediction, API recommendation, or code summarization. This versatile model enhances the precision of code understanding and generation by integrating both syntactic structure and semantic context. Its use cases span static analysis, development assistance, and automation of code-related tasks.
UniXcoder is a unified cross-modal pre-trained model that leverages multimodal data (i.e. code comment and AST) to pretrain code representation.
We implement a class to use UniXcoder and you can follow the code to build UniXcoder. You can download the class by
wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py
import torch
from unixcoder import UniXcoder
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = UniXcoder("microsoft/unixcoder-base")
model.to(device)
In the following, we will give zero-shot examples for several tasks under different mode, including code search (encoder-only), code completion (decoder-only), function name prediction (encoder-decoder) , API recommendation (encoder-decoder), code summarization (encoder-decoder).
For encoder-only mode, we give an example of code search.
Here, we give an example to obtain code fragment embedding from CodeBERT.
# Encode maximum function
func = "def f(a,b): if a>b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,max_func_embedding = model(source_ids)
# Encode minimum function
func = "def f(a,b): if a<b: return a else return b"
tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,min_func_embedding = model(source_ids)
# Encode NL
nl = "return maximum value"
tokens_ids = model.tokenize([nl],max_length=512,mode="<encoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
tokens_embeddings,nl_embedding = model(source_ids)
print(max_func_embedding.shape)
print(max_func_embedding)
torch.Size([1, 768])
tensor([[ 8.6533e-01, -1.9796e+00, -8.6849e-01, 4.2652e-01, -5.3696e-01,
-1.5521e-01, 5.3770e-01, 3.4199e-01, 3.6305e-01, -3.9391e-01,
-1.1816e+00, 2.6010e+00, -7.7133e-01, 1.8441e+00, 2.3645e+00,
...,
-2.9188e+00, 1.2555e+00, -1.9953e+00, -1.9795e+00, 1.7279e+00,
6.4590e-01, -5.2769e-02, 2.4965e-01, 2.3962e-02, 5.9996e-02,
2.5659e+00, 3.6533e+00, 2.0301e+00]], device='cuda:0',
grad_fn=<DivBackward0>)
Now, we calculate cosine similarity between NL and two functions. Although the difference of two functions is only a operator (< and >), UniXcoder can distinguish them.
# Normalize embedding
norm_max_func_embedding = torch.nn.functional.normalize(max_func_embedding, p=2, dim=1)
norm_min_func_embedding = torch.nn.functional.normalize(min_func_embedding, p=2, dim=1)
norm_nl_embedding = torch.nn.functional.normalize(nl_embedding, p=2, dim=1)
max_func_nl_similarity = torch.einsum("ac,bc->ab",norm_max_func_embedding,norm_nl_embedding)
min_func_nl_similarity = torch.einsum("ac,bc->ab",norm_min_func_embedding,norm_nl_embedding)
print(max_func_nl_similarity)
print(min_func_nl_similarity)
tensor([[0.3002]], device='cuda:0', grad_fn=<ViewBackward>)
tensor([[0.1881]], device='cuda:0', grad_fn=<ViewBackward>)
For decoder-only mode, we give an example of code completion.
context = """
def f(data,file_path):
# write json data into file_path in python language
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<decoder-only>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=True, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print(context+predictions[0][0])
def f(data,file_path):
# write json data into file_path in python language
data = json.dumps(data)
with open(file_path, 'w') as f:
f.write(data)
For encoder-decoder mode, we give two examples including: function name prediction, API recommendation, code summarization.
context = """
def <mask0>(data,file_path):
data = json.dumps(data)
with open(file_path, 'w') as f:
f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])
['write_json', 'write_file', 'to_json']
context = """
def write_json(data,file_path):
data = <mask0>(data)
with open(file_path, 'w') as f:
f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])
['json.dumps', 'json.loads', 'str']
context = """
# <mask0>
def write_json(data,file_path):
data = json.dumps(data)
with open(file_path, 'w') as f:
f.write(data)
"""
tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
source_ids = torch.tensor(tokens_ids).to(device)
prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
predictions = model.decode(prediction_ids)
print([x.replace("<mask0>","").strip() for x in predictions[0]])
['Write JSON to file', 'Write json to file', 'Write a json file']
If you use this code or UniXcoder, please consider citing us.
@article{guo2022unixcoder,
title={UniXcoder: Unified Cross-Modal Pre-training for Code Representation},
author={Guo, Daya and Lu, Shuai and Duan, Nan and Wang, Yanlin and Zhou, Ming and Yin, Jian},
journal={arXiv preprint arXiv:2203.03850},
year={2022}
}