Causal Language Modeling with 🤗 Transformers¶
This tutorial shows you how to fine-tune a pre-trained language model using formed's Transformers integration. Unlike custom model development, this tutorial focuses on using formed's built-in workflow steps to orchestrate training with Hugging Face Transformers.
What you'll build: A causal language model fine-tuned on question-answer data, using DistilGPT-2 as the base model.
What you'll learn:
- Loading datasets with the
datasetsintegration - Tokenizing text data for language modeling
- Fine-tuning transformer models with
transformers::train_model - Configuring training arguments and data collators
- Tracking experiments with MLflow integration
Prerequisites¶
Install formed with required integrations:
pip install formed[transformers,datasets,mlflow]
Note: The transformers integration provides seamless access to Hugging Face's ecosystem, including pre-trained models, tokenizers, and training utilities.
What is Causal Language Modeling?¶
Causal language modeling (CLM) trains models to predict the next token given previous tokens. This is the training objective used by GPT-style models.
Key characteristics:
- Models see only left context (previous tokens)
- Commonly used for text generation tasks
- Training uses the language modeling head with cross-entropy loss
Project Setup¶
Create a new directory:
mkdir causallm_tutorial
cd causallm_tutorial
We'll create two files:
config.jsonnet- Workflow configurationformed.yml- Project settings
Step 1: Configure Project Settings¶
Create formed.yml:
workflow:
organizer:
type: mlflow
log_execution_metrics: true
required_modules:
- formed.integrations.mlflow
- formed.integrations.datasets
- formed.integrations.transformers
What this does:
- MLflow organizer: Automatically tracks all experiments, metrics, and model artifacts
- Required modules: Imports integrations that provide workflow steps
datasets: Load data from Hugging Face Hub or local filestransformers: Tokenization and model trainingmlflow: Experiment tracking
Step 2: Define the Workflow¶
Create config.jsonnet:
// Define model and tokenizer configurations at the top for reusability
local base_model = {
type: 'transformers:AutoModelForCausalLM.from_pretrained',
pretrained_model_name_or_path: 'distilbert/distilgpt2',
};
local tokenizer = {
type: 'transformers:AutoTokenizer.from_pretrained',
pretrained_model_name_or_path: base_model.pretrained_model_name_or_path,
pad_token: '<|endoftext|>',
};
{
steps: {
// Step 1: Load dataset from Hugging Face Hub
train_dataset: {
type: 'datasets::load',
path: 'sentence-transformers/eli5',
split: 'train[:10000]',
},
// Step 2: Tokenize text data
tokenized_dataset: {
type: 'transformers::tokenize',
dataset: { type: 'ref', ref: 'train_dataset' },
tokenizer: tokenizer,
text_column: 'answer',
},
// Step 3: Train the model
trained_model: {
type: 'transformers::train_model',
model: base_model,
dataset: { type: 'ref', ref: 'tokenized_dataset' },
// Training arguments (passed to transformers.TrainingArguments)
args: {
per_device_train_batch_size: 8,
per_device_eval_batch_size: 8,
learning_rate: 2e-5,
warmup_ratio: 0.1,
num_train_epochs: 3,
fp16: false,
bf16: true, // Use bfloat16 for training (requires compatible hardware)
report_to: 'none', // Don't report to external trackers (we use MLflow)
do_train: true,
do_eval: false, // No validation set in this example
save_strategy: 'steps',
save_steps: 100,
save_total_limit: 2,
eval_strategy: 'no',
logging_strategy: 'steps',
logging_first_step: true,
logging_steps: 10,
},
// Data collator for language modeling
data_collator: {
type: 'transformers:DataCollatorForLanguageModeling',
tokenizer: tokenizer,
mlm: false, // Use causal LM (not masked LM)
},
// Processing class for tokenization during training
processing_class: tokenizer,
// Callbacks for experiment tracking
callbacks: [
{
type: 'formed.integrations.transformers.training:MlflowTrainerCallback',
},
],
},
},
}
Let's break down each component:
Step 1: Loading Data with datasets::load¶
train_dataset: {
type: 'datasets::load',
path: 'sentence-transformers/eli5',
split: 'train[:10000]',
}
What happens:
- Loads the ELI5 (Explain Like I'm 5) dataset from Hugging Face Hub
- Takes first 10,000 examples from the training split
- Returns a
datasets.Datasetobject
Key parameters:
path: Dataset name on Hugging Face Hub or path to local datasetsplit: Which split to load (supports slice notation liketrain[:1000])- Additional kwargs are passed to
datasets.load_dataset()
Dataset format: The ELI5 dataset contains question-answer pairs. We'll use the answer field for language modeling.
Step 2: Tokenizing with transformers::tokenize¶
tokenized_dataset: {
type: 'transformers::tokenize',
dataset: { type: 'ref', ref: 'train_dataset' },
tokenizer: tokenizer,
text_column: 'answer',
}
What happens:
- Applies tokenization to the specified text column
- Converts text to token IDs compatible with the model
- Removes the original text column (keeps only token IDs)
- Returns a tokenized
datasets.Dataset
Key parameters:
dataset: Input dataset (reference to previous step)tokenizer: Tokenizer configuration or pre-loaded tokenizertext_column: Name of the column containing text to tokenizepadding: Padding strategy (default: False, padding handled by data collator)truncation: Whether to truncate sequencesmax_length: Maximum sequence length
Tokenizer configuration:
local tokenizer = {
type: 'transformers:AutoTokenizer.from_pretrained',
pretrained_model_name_or_path: 'distilbert/distilgpt2',
pad_token: '<|endoftext|>',
};
This loads DistilGPT-2's tokenizer and sets the padding token (GPT-2 doesn't have one by default).
Step 3: Training with transformers::train_model¶
trained_model: {
type: 'transformers::train_model',
model: base_model,
dataset: { type: 'ref', ref: 'tokenized_dataset' },
args: { ... },
data_collator: { ... },
processing_class: tokenizer,
callbacks: [ ... ],
}
What happens:
- Initializes the model from the pre-trained checkpoint
- Creates a
transformers.Trainerwith specified arguments - Trains the model on the tokenized dataset
- Saves checkpoints according to
save_strategy - Returns the trained model
Key parameters:
Model Configuration¶
model: {
type: 'transformers:AutoModelForCausalLM.from_pretrained',
pretrained_model_name_or_path: 'distilbert/distilgpt2',
}
Uses AutoModelForCausalLM to load a model with a causal language modeling head.
Training Arguments¶
The args field accepts any parameters from transformers.TrainingArguments:
Batch size and epochs:
per_device_train_batch_size: Batch size per GPU/CPUnum_train_epochs: Number of training epochs
Optimization:
learning_rate: Learning rate for optimizer (default: 5e-5)warmup_ratio: Fraction of steps for learning rate warmup
Mixed precision:
fp16: Use float16 (older GPUs)bf16: Use bfloat16 (newer GPUs, more stable)
Checkpointing:
save_strategy: When to save ("steps", "epoch", "no")save_steps: Save checkpoint every N stepssave_total_limit: Keep only N most recent checkpoints
Logging:
logging_strategy: When to log ("steps", "epoch")logging_steps: Log every N stepslogging_first_step: Whether to log after first step
Data Collator¶
data_collator: {
type: 'transformers:DataCollatorForLanguageModeling',
tokenizer: tokenizer,
mlm: false,
}
The data collator handles batching and prepares labels:
DataCollatorForLanguageModeling:
mlm: false: Causal language modeling (predict next token)mlm: true: Masked language modeling (BERT-style, predict masked tokens)
For causal LM:
- Creates labels by shifting
input_idsone position to the right - Applies padding to create uniform batch size
- Handles attention masks automatically
Callbacks¶
callbacks: [
{
type: 'formed.integrations.transformers.training:MlflowTrainerCallback',
},
]
MlflowTrainerCallback:
- Logs training metrics to MLflow automatically
- Tracks loss, learning rate, and other training stats
- Integrates seamlessly with formed's MLflow organizer
Step 3: Run the Workflow¶
Execute the workflow:
formed workflow run config.jsonnet --execution-id causallm-distilgpt2
What happens during execution:
- Dataset loading: Downloads and caches the ELI5 dataset
- Tokenization: Tokenizes all examples and caches results
- Training: Runs training loop with Hugging Face Trainer
- Logs metrics every 10 steps
- Saves checkpoints every 100 steps
- Uses bfloat16 mixed precision
- Model saving: Caches the trained model by fingerprint
Step 4: View Training Results¶
Launch MLflow UI:
mlflow ui
Open http://localhost:5000 to see:
Metrics:
- Training loss curve
- Learning rate schedule
- Steps per second
Parameters:
- All training arguments
- Model architecture
- Dataset configuration
Artifacts:
- Trained model checkpoints
- Tokenizer files
- Training logs
Understanding the Components¶
The datasets Integration¶
The datasets integration provides workflow steps for working with Hugging Face datasets:
Available steps:
datasets::load- Load datasets from Hub or local filesdatasets::compose- Combine multiple datasets into DatasetDictdatasets::concatenate- Concatenate datasetsdatasets::train_test_split- Split dataset into train/test
Benefits:
- Automatic caching of downloaded datasets
- Memory-efficient processing with Apache Arrow
- Seamless integration with transformers
The transformers Integration¶
The transformers integration wraps Hugging Face Transformers for workflow use:
Key steps:
transformers::tokenize- Tokenize text datatransformers::train_model- Train models with Trainer APItransformers::load_model- Load pre-trained modelstransformers::load_tokenizer- Load tokenizerstransformers::convert_tokenizer- Convert to formed's Tokenizer format
Benefits:
- Access to thousands of pre-trained models
- Battle-tested training infrastructure
- Automatic gradient accumulation, mixed precision, and distributed training
Data Collators¶
Data collators prepare batches during training:
DataCollatorForLanguageModeling:
- Handles causal and masked language modeling
- Creates labels automatically from inputs
- Applies dynamic padding for efficiency
Other common collators:
DataCollatorWithPadding- Simple padding without label generationDataCollatorForSeq2Seq- For encoder-decoder modelsDataCollatorForTokenClassification- For NER and similar tasks
Customization Examples¶
Use a Different Model¶
Replace DistilGPT-2 with another model:
local base_model = {
type: 'transformers:AutoModelForCausalLM.from_pretrained',
pretrained_model_name_or_path: 'gpt2', // or 'gpt2-medium', 'EleutherAI/gpt-neo-125M', etc.
};
Add Validation Set¶
Split the dataset and enable evaluation:
{
steps: {
raw_dataset: {
type: 'datasets::load',
path: 'sentence-transformers/eli5',
split: 'train[:10000]',
},
// Split into train and validation
split_dataset: {
type: 'datasets::train_test_split',
dataset: { type: 'ref', ref: 'raw_dataset' },
test_size: 0.1,
seed: 42,
},
// Tokenize both splits
tokenized_train: {
type: 'transformers::tokenize',
dataset: { type: 'ref', ref: 'split_dataset.train' },
tokenizer: tokenizer,
text_column: 'answer',
},
tokenized_val: {
type: 'transformers::tokenize',
dataset: { type: 'ref', ref: 'split_dataset.test' },
tokenizer: tokenizer,
text_column: 'answer',
},
// Combine for training
dataset: {
type: 'datasets::compose',
train: { type: 'ref', ref: 'tokenized_train' },
validation: { type: 'ref', ref: 'tokenized_val' },
},
trained_model: {
type: 'transformers::train_model',
// ...
dataset: { type: 'ref', ref: 'dataset' },
args: {
// ...
do_eval: true,
eval_strategy: 'steps',
eval_steps: 100,
},
},
},
}
Adjust Training Settings¶
Longer training with more frequent evaluation:
args: {
num_train_epochs: 5,
eval_strategy: 'steps',
eval_steps: 50,
logging_steps: 5,
}
Larger batch size with gradient accumulation:
args: {
per_device_train_batch_size: 4,
gradient_accumulation_steps: 4, // Effective batch size: 16
learning_rate: 1e-5,
}
Different optimizer:
trained_model: {
type: 'transformers::train_model',
// ...
args: {
// ...
optim: 'adamw_torch', // or 'adafactor', 'adamw_8bit', etc.
weight_decay: 0.01,
},
}
Custom Learning Rate Schedule¶
args: {
learning_rate: 5e-5,
lr_scheduler_type: 'cosine',
warmup_steps: 500,
}
Truncate Long Sequences¶
tokenized_dataset: {
type: 'transformers::tokenize',
dataset: { type: 'ref', ref: 'train_dataset' },
tokenizer: tokenizer,
text_column: 'answer',
truncation: true,
max_length: 512,
}
Using the Trained Model¶
After training, you can load the cached model for inference:
from formed.settings import load_formed_settings
from formed.workflow import WorkflowExecutionID
# Load the workflow execution
settings = load_formed_settings("./formed.yml")
organizer = settings.workflow.organizer
context = organizer.get(WorkflowExecutionID("your-execution-id"))
# Get the trained model from cache
model_step_id = context.info.graph["trained_model"]
model = context.cache[model_step_id]
# Load tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilgpt2")
# Generate text
inputs = tokenizer("The meaning of life is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))
Or reference the model in a new workflow step:
{
steps: {
// ... training steps ...
// Generate text using trained model
generated_text: {
type: 'my_custom::generate',
model: { type: 'ref', ref: 'trained_model' },
prompts: ['The meaning of life is', 'Once upon a time'],
max_length: 100,
},
},
}
Key Takeaways¶
Workflow Steps:
datasets::loadloads datasets from Hugging Face Hubtransformers::tokenizeprepares text for model inputtransformers::train_modelorchestrates training with Trainer API- All steps are cached by fingerprint for reproducibility
Training Configuration:
- TrainingArguments control all aspects of training
- Data collators handle batch preparation and label creation
- Callbacks enable custom logging and monitoring
MLflow Integration:
- Automatic experiment tracking
- Metrics, parameters, and artifacts logged transparently
- Easy comparison across training runs
Workflow Benefits:
- No custom Python code needed for standard tasks
- Configuration-driven experimentation
- Automatic caching and dependency management
- Seamless integration with Hugging Face ecosystem
Next Steps¶
Fine-tune on Custom Data¶
Replace the dataset with your own:
train_dataset: {
type: 'datasets::load',
path: '/path/to/your/dataset.jsonl',
}
Your data should be in a format supported by Hugging Face datasets (JSON, CSV, Parquet, etc.).
Try Masked Language Modeling¶
For BERT-style models:
local base_model = {
type: 'transformers:AutoModelForMaskedLM.from_pretrained',
pretrained_model_name_or_path: 'bert-base-uncased',
};
// ...
data_collator: {
type: 'transformers:DataCollatorForLanguageModeling',
tokenizer: tokenizer,
mlm: true,
mlm_probability: 0.15,
}
Further Reading¶
- Text Classification Tutorial: Build custom models with PyTorch
- Transformers Documentation: Complete Hugging Face Transformers guide
- Workflow Guide: Advanced workflow patterns
- MLflow Integration: Experiment tracking details
For more examples, see examples/causallm/ in the repository.