by microsoft
Open source · 296k downloads · 284 likes
CodeBERT-base is a pre-trained language model specifically designed to understand and generate both computer code and natural language text. It excels in tasks such as code search, generating documentation from code, and code completion by leveraging a joint understanding of both types of data. Its unique approach, combining masked language modeling and discriminative objectives for real tokens, enables it to capture complex relationships between programming languages and textual descriptions. The model stands out for its versatility, capable of processing multiple programming languages while maintaining high performance. It is particularly useful for developers and researchers looking to automate code-related tasks or enhance programming assistance tools.
Pretrained weights for CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
The model is trained on bi-modal data (documents & code) of CodeSearchNet
This model is initialized with Roberta-base and trained with MLM+RTD objective (cf. the paper).
Please see the official repository for scripts that support "code search" and "code-to-document generation".
@misc{feng2020codebert,
title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and Xiaocheng Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and Ming Zhou},
year={2020},
eprint={2002.08155},
archivePrefix={arXiv},
primaryClass={cs.CL}
}