SentenceTransformers¶
formed.integrations.sentence_transformers.analyzers
¶
formed.integrations.sentence_transformers.utils
¶
load_sentence_transformer
cached
¶
load_sentence_transformer(model_name_or_path, **kwargs)
Source code in src/formed/integrations/sentence_transformers/utils.py
10 11 12 13 14 15 16 17 | |
formed.integrations.sentence_transformers.workflow
¶
Workflow steps for Sentence Transformers integration.
This module provides workflow steps for loading, training, and converting sentence transformer models.
Available Steps
sentence_transformers::load: Load a pre-trained sentence transformer model.sentence_transformers::train: Train a sentence transformer model.sentence_transformers::convert_tokenizer: Convert a sentence transformer tokenizer to a formed Tokenizer (requires ml integration).
SentenceTransformerFormat
¶
Bases: Generic[SentenceTransformerT], Format[SentenceTransformerT]
identifier
property
¶
identifier
Get the unique identifier for this format.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Format identifier string. |
write
¶
write(artifact, directory)
Source code in src/formed/integrations/sentence_transformers/workflow.py
42 43 | |
read
¶
read(directory)
Source code in src/formed/integrations/sentence_transformers/workflow.py
45 46 | |
is_default_of
classmethod
¶
is_default_of(obj)
Check if this format is the default for the given object type.
| PARAMETER | DESCRIPTION |
|---|---|
obj
|
Object to check.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if this format should be used by default for this type. |
Source code in src/formed/workflow/format.py
101 102 103 104 105 106 107 108 109 110 111 112 | |
load_pretrained_model
¶
load_pretrained_model(model_name_or_path, **kwargs)
Load a pre-trained sentence transformer model.
| PARAMETER | DESCRIPTION |
|---|---|
model_name_or_path
|
Model identifier or path to model directory.
TYPE:
|
**kwargs
|
Additional arguments to pass to SentenceTransformer constructor.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
SentenceTransformer
|
Loaded SentenceTransformer model. |
Source code in src/formed/integrations/sentence_transformers/workflow.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | |
train_sentence_transformer
¶
train_sentence_transformer(
model,
loss,
args,
dataset=None,
loss_modifier=None,
data_collator=None,
tokenizer=None,
evaluator=None,
callbacks=None,
model_init=None,
compute_metrics=None,
optimizers=(None, None),
preprocess_logits_for_metrics=None,
train_dataset_key="train",
eval_dataset_key="validation",
)
Train a sentence transformer model.
This step trains a SentenceTransformer model using the provided loss function, datasets, and training arguments.
| PARAMETER | DESCRIPTION |
|---|---|
model
|
SentenceTransformer model to train.
TYPE:
|
loss
|
Loss function(s) for training (single or mapping by dataset key).
TYPE:
|
args
|
Training arguments configuration.
TYPE:
|
dataset
|
Training/validation datasets.
TYPE:
|
loss_modifier
|
Optional modifier(s) to apply to the loss function.
TYPE:
|
data_collator
|
Optional data collator for batching.
TYPE:
|
tokenizer
|
Optional tokenizer.
TYPE:
|
evaluator
|
Optional evaluator(s) for validation.
TYPE:
|
callbacks
|
Optional training callbacks.
TYPE:
|
model_init
|
Optional model initialization function.
TYPE:
|
compute_metrics
|
Optional metrics computation function.
TYPE:
|
optimizers
|
Optional optimizer and learning rate scheduler.
TYPE:
|
preprocess_logits_for_metrics
|
Optional logits preprocessing function.
TYPE:
|
train_dataset_key
|
Key for training dataset split.
TYPE:
|
eval_dataset_key
|
Key for evaluation dataset split.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
SentenceTransformer
|
Trained SentenceTransformer model. |
Source code in src/formed/integrations/sentence_transformers/workflow.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
convert_tokenizer
¶
convert_tokenizer(
model_name_or_path,
pad_token=VALUE,
unk_token=VALUE,
bos_token=VALUE,
eos_token=VALUE,
freeze=True,
accessor=None,
characters=None,
text_vector=None,
token_vectors=None,
)
Convert a sentence transformer model's tokenizer to a formed Tokenizer.
This step extracts the tokenizer from a sentence transformer model and converts it into a formed Tokenizer with specified special tokens.
| PARAMETER | DESCRIPTION |
|---|---|
model_name_or_path
|
Model identifier or path to model directory.
TYPE:
|
pad_token
|
Padding token (uses model default if not specified).
TYPE:
|
unk_token
|
Unknown token (uses model default if not specified).
TYPE:
|
bos_token
|
Beginning-of-sequence token (uses model default if not specified).
TYPE:
|
eos_token
|
End-of-sequence token (uses model default if not specified).
TYPE:
|
freeze
|
Whether to freeze the vocabulary.
TYPE:
|
accessor
|
Optional accessor for token extraction.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Tokenizer
|
Converted formed Tokenizer. |
| RAISES | DESCRIPTION |
|---|---|
AssertionError
|
If pad_token is not specified and not available in the model. |
Source code in src/formed/integrations/sentence_transformers/workflow.py
197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 | |