par llmware
Open source · 32k downloads · 17 likes
industry-bert-contracts-v0.1 est un modèle de langage spécialisé dans le domaine juridique et contractuel, basé sur une architecture BERT fine-tunée pour générer des embeddings sémantiques. Conçu comme une solution clé en main, il excelle dans l'analyse et la compréhension de documents contractuels en anglais, offrant des représentations vectorielles précises pour des tâches comme la recherche d'informations, la classification ou la comparaison de clauses. Son entraînement sur un large corpus de contrats publics lui confère une expertise spécifique aux enjeux et au vocabulaire des documents juridiques, le distinguant des modèles généralistes. Idéal pour automatiser des processus métiers dans les secteurs du droit, de la finance ou de la conformité, il permet d'extraire des insights pertinents tout en réduisant les erreurs d'interprétation. Contrairement aux modèles génériques, il est optimisé pour capturer les nuances et les subtilités propres aux textes contractuels.
industry-bert-contracts-v0.1 is part of a series of industry-fine-tuned sentence_transformer embedding models.
industry-bert-contracts-v0.1 is a domain fine-tuned BERT-based 768-parameter Sentence Transformer model, intended to as a "drop-in" substitute for contractual and legal domains. This model was trained on a wide range of publicly available commercial contracts, including open source contract datasets.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("llmware/industry-bert-contracts-v0.1") model = AutoModel.from_pretrained("llmware/industry-bert-contracts-v0.1")
This is a semantic embedding model, fine-tuned on public domain contracts and related documents. Results may vary if used outside of this domain, and like any embedding model, there is always the potential for anomalies in the vector embedding space. No specific safeguards have put in place for safety or mitigate potential bias in the dataset.
This model was fine-tuned using a custom self-supervised procedure and custom dataset that combined contrastive techniques with stochastic injections of distortions in the samples. The methodology was derived, adapted and inspired primarily from three research papers cited below: TSDAE (Reimers), DeClutr (Giorgi), and Contrastive Tension (Carlsson).
Custom self-supervised training protocol used to train the model, which was derived and inspired by the following papers:
@article{wang-2021-TSDAE, title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning", author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna", journal= "arXiv preprint arXiv:2104.06979", month = "4", year = "2021", url = "https://arxiv.org/abs/2104.06979", }
@inproceedings{giorgi-etal-2021-declutr, title = {{D}e{CLUTR}: Deep Contrastive Learning for Unsupervised Textual Representations}, author = {Giorgi, John and Nitski, Osvald and Wang, Bo and Bader, Gary}, year = 2021, month = aug, booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)}, publisher = {Association for Computational Linguistics}, address = {Online}, pages = {879--895}, doi = {10.18653/v1/2021.acl-long.72}, url = {https://aclanthology.org/2021.acl-long.72} }
@article{Carlsson-2021-CT, title = {Semantic Re-tuning with Contrastive Tension}, author= {Fredrik Carlsson, Amaru Cuba Gyllensten, Evangelia Gogoulou, Erik Ylipää Hellqvist, Magnus Sahlgren}, year= {2021}, month= {"January"} Published: 12 Jan 2021, Last Modified: 05 May 2023 }
Darren Oberst @ llmware