Huggingface Skill

Hugging face transformers documentation, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

Working with huggingface
Asking about huggingface features or APIs
Implementing huggingface solutions
Debugging huggingface code
Learning huggingface best practices

Quick Reference

Common Patterns

Pattern 1: Transformers documentation Configuration Transformers 🏡 View all docsAWS Trainium & InferentiaAccelerateArgillaAutoTrainBitsandbytesChat UIDataset viewerDatasetsDeploying on AWSDiffusersDistilabelEvaluateGoogle CloudGoogle TPUsGradioHubHub Python LibraryHuggingface.jsInference Endpoints (dedicated)Inference ProvidersKernelsLeRobotLeaderboardsLightevalMicrosoft AzureOptimumPEFTReachy MiniSafetensorsSentence TransformersTRLTasksText Embeddings InferenceText Generation InferenceTokenizersTrackioTransformersTransformers.jssmolagentstimm Search documentation mainv5.0.0v4.57.6v4.56.2v4.55.4v4.53.3v4.52.3v4.51.3v4.50.0v4.49.0v4.48.2v4.47.1v4.46.3v4.45.2v4.44.2v4.43.4v4.42.4v4.41.2v4.40.2v4.39.3v4.38.2v4.37.2v4.36.1v4.35.2v4.34.1v4.33.3v4.32.1v4.31.0v4.30.0v4.29.1v4.28.1v4.27.2v4.26.1v4.25.1v4.24.0v4.23.1v4.22.2v4.21.3v4.20.1v4.19.4v4.18.0v4.17.0v4.16.2v4.15.0v4.14.1v4.13.0v4.12.5v4.11.3v4.10.1v4.9.2v4.8.2v4.7.0v4.6.0v4.5.1v4.4.2v4.3.3v4.2.2v4.1.1v4.0.1v3.5.1v3.4.0v3.3.1v3.2.0v3.1.0v3.0.2v2.11.0v2.10.0v2.9.1v2.8.0v2.7.0v2.6.0v2.5.1v2.4.1v2.3.0v2.2.2v2.1.1v2.0.0v1.2.0v1.1.0v1.0.0doc-builder-html ARDEENESFRHIITJAKOPTZH Get started Transformers Installation Quickstart Base classes Inference Training Quantization Ecosystem integrations Resources Contribute API Main Classes Auto Classes Backbones Callbacks Configuration Data Collator Logging Models Text Generation Optimization Model outputs PEFT Pipelines Processors Quantization Tokenizer Trainer DeepSpeed ExecuTorch Feature Extractor Image Processor Video Processor Kernels Models Internal helpers Reference Join the Hugging Face community and get access to the augmented documentation experience Collaborate on models, datasets and Spaces Faster examples with accelerated inference Switch between documentation themes Sign Up to get started Copy page Configuration The base class PreTrainedConfig implements the common methods for loading/saving a configuration either from a local file or directory, or from a pretrained model configuration provided by the library (downloaded from HuggingFace’s AWS S3 repository). Each derived config class implements model specific attributes. Common attributes present in all config classes are: hidden_size, num_attention_heads, and num_hidden_layers. Text models further implement: vocab_size. PreTrainedConfig class transformers.PreTrainedConfig < source > ( output_hidden_states: bool = False output_attentions: bool = False return_dict: bool = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False architectures: list[str] | None = None id2label: dict[int, str] | None = None label2id: dict[str, int] | None = None num_labels: int | None = None problem_type: str | None = None **kwargs ) Parameters name_or_path (str, optional, defaults to "") — Store the string that was passed to PreTrainedModel.from_pretrained() as pretrained_model_name_or_path if the configuration was created with such a method. output_hidden_states (bool, optional, defaults to False) — Whether or not the model should return all hidden-states. output_attentions (bool, optional, defaults to False) — Whether or not the model should returns all attentions. return_dict (bool, optional, defaults to True) — Whether or not the model should return a ModelOutput instead of a plain tuple. is_encoder_decoder (bool, optional, defaults to False) — Whether the model is used as an encoder/decoder or not. chunk_size_feed_forward (int, optional, defaults to 0) — The chunk size of all feed forward layers in the residual attention blocks. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?. Parameters for fine-tuning tasks architectures (list[str], optional) — Model architectures that can be used with the model pretrained weights. id2label (dict[int, str], optional) — A map from index (for instance prediction index, or target index) to label. label2id (dict[str, int], optional) — A map from label to index for the model. num_labels (int, optional) — Number of labels to use in the last layer added to the model, typically for a classification task. problem_type (str, optional) — Problem type for XxxForSequenceClassification models. Can be one of "regression", "single_label_classification" or "multi_label_classification". PyTorch specific parameters dtype (str, optional) — The dtype of the weights. This attribute can be used to initialize the model to a non-default dtype (which is normally float32) and thus allow for optimal storage allocation. For example, if the saved model is float16, ideally we want to load it back using the minimal amount of memory needed to load float16 weights. Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations. A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does not load the model weights. It only affects the model’s configuration. Class attributes (overridden by derived classes): model_type (str) — An identifier for the model type, serialized into the JSON file, and used to recreate the correct object in AutoConfig. has_no_defaults_at_init (bool) — Whether the config class can be initialized without providing input arguments. Some configurations requires inputs to be defined at init and have no default values, usually these are composite configs, (but not necessarily) such as EncoderDecoderConfig or ~RagConfig. They have to be initialized from two or more configs of type PreTrainedConfig. keys_to_ignore_at_inference (list[str]) — A list of keys to ignore by default when looking at dictionary outputs of the model during inference. attribute_map (dict[str, str]) — A dict that maps model specific attribute names to the standardized naming of attributes. base_model_tp_plan (dict[str, Any]) — A dict that maps sub-modules FQNs of a base model to a tensor parallel plan applied to the sub-module when model.tensor_parallel is called. base_model_pp_plan (dict[str, tuple[list[str]]]) — A dict that maps child-modules of a base model to a pipeline parallel plan that enables users to place the child-module on the appropriate device. Common attributes (present in all subclasses): vocab_size (int) — The number of tokens in the vocabulary, which is also the first dimension of the embeddings matrix (this attribute may be missing for models that don’t have a text modality like ViT). hidden_size (int) — The hidden size of the model. num_attention_heads (int) — The number of attention heads used in the multi-head attention layers of the model. num_hidden_layers (int) — The number of blocks in the model. Setting parameters for sequence generation in the model config is deprecated. For backward compatibility, loading some of them will still be possible, but attempting to overwrite them will throw an exception — you should set them in a [~transformers.GenerationConfig]. Check the documentation of [~transformers.GenerationConfig] for more information about the individual parameters. push_to_hub < source > ( repo_id: str commit_message: str | None = None commit_description: str | None = None private: bool | None = None token: bool | str | None = None revision: str | None = None create_pr: bool = False max_shard_size: int | str | None = '50GB' tags: list[str] | None = None ) Parameters repo_id (str) — The name of the repository you want to push your config to. It should contain your organization name when pushing to a given organization. commit_message (str, optional) — Message to commit while pushing. Will default to "Upload config". commit_description (str, optional) — The description of the commit that will be created private (bool, optional) — Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists. token (bool or str, optional) — The token to use as HTTP bearer authorization for remote files. If True (default), will use the token generated when running hf auth login (stored in ~/.huggingface). revision (str, optional) — Branch to push the uploaded files to. create_pr (bool, optional, defaults to False) — Whether or not to create a PR with the uploaded files or directly commit. max_shard_size (int or str, optional, defaults to "50GB") — Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like "5MB"). tags (list[str], optional) — List of tags to push on the Hub. Upload the configuration file to the 🤗 Model Hub. Examples: Copied from transformers import AutoConfig config = AutoConfig.from_pretrained("google-bert/bert-base-cased") # Push the config to your namespace with the name "my-finetuned-bert". config.push_to_hub("my-finetuned-bert") # Push the config to an organization with the name "my-finetuned-bert". config.push_to_hub("huggingface/my-finetuned-bert") dict_dtype_to_str < source > ( d: dict ) Checks whether the passed dictionary and its nested dicts have a dtype key and if it’s not None, converts torch.dtype to a string of just the type. For example, torch.float32 get converted into “float32” string, which can then be stored in the json format. from_dict < source > ( config_dict: dict **kwargs ) → PreTrainedConfig Parameters config_dict (dict[str, Any]) — Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the get_config_dict() method. kwargs (dict[str, Any]) — Additional parameters from which to initialize the configuration object. Returns PreTrainedConfig The configuration object instantiated from those parameters. Instantiates a PreTrainedConfig from a Python dictionary of parameters. from_json_file < source > ( json_file: str | os.PathLike ) → PreTrainedConfig Parameters json_file (str or os.PathLike) — Path to the JSON file containing the parameters. Returns PreTrainedConfig The configuration object instantiated from that JSON file. Instantiates a PreTrainedConfig from the path to a JSON file of parameters. from_pretrained < source > ( pretrained_model_name_or_path: str | os.PathLike cache_dir: str | os.PathLike | None = None force_download: bool = False local_files_only: bool = False token: str | bool | None = None revision: str = 'main' **kwargs ) → PreTrainedConfig Parameters pretrained_model_name_or_path (str or os.PathLike) — This can be either: a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co. a path to a directory containing a configuration file saved using the save_pretrained() method, e.g., ./my_model_directory/. a path or url to a saved configuration JSON file, e.g., ./my_model_directory/configuration.json. cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used. force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist. proxies (dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request. token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running hf auth login (stored in ~/.huggingface). revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git. To test a pull request you made on the Hub, you can pass revision="refs/pr/<pr_number>". return_unused_kwargs (bool, optional, defaults to False) — If False, then this function returns just the final configuration object. If True, then this functions returns a Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored. subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here. kwargs (dict[str, Any], optional) — The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter. Returns PreTrainedConfig The configuration object instantiated from this pretrained model. Instantiate a PreTrainedConfig (or a derived class) from a pretrained model configuration. Examples: Copied # We can't instantiate directly the base class PreTrainedConfig so let's show the examples on a # derived class: BertConfig config = BertConfig.from_pretrained( "google-bert/bert-base-uncased" ) # Download configuration from huggingface.co and cache. config = BertConfig.from_pretrained( "./test/saved_model/" ) # E.g. config (or model) was saved using save_pretrained('./test/saved_model/') config = BertConfig.from_pretrained("./test/saved_model/my_configuration.json") config = BertConfig.from_pretrained("google-bert/bert-base-uncased", output_attentions=True, foo=False) assert config.output_attentions == True config, unused_kwargs = BertConfig.from_pretrained( "google-bert/bert-base-uncased", output_attentions=True, foo=False, return_unused_kwargs=True ) assert config.output_attentions == True assert unused_kwargs == {"foo": False} get_config_dict < source > ( pretrained_model_name_or_path: str | os.PathLike **kwargs ) → tuple[Dict, Dict] Parameters pretrained_model_name_or_path (str or os.PathLike) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. Returns tuple[Dict, Dict] The dictionary(ies) that will be used to instantiate the configuration object. From a pretrained_model_name_or_path, resolve to a dictionary of parameters, to be used for instantiating a PreTrainedConfig using from_dict. get_text_config < source > ( decoder = None encoder = None ) Parameters decoder (Optional[bool], optional) — If set to True, then only search for decoder config names. encoder (Optional[bool], optional) — If set to True, then only search for encoder config names. Returns the text config related to the text input (encoder) or text output (decoder) of the model. The decoder and encoder input arguments can be used to specify which end of the model we are interested in, which is useful on models that have both text input and output modalities. There are three possible outcomes of using this method: On most models, it returns the original config instance itself. On newer (2024+) composite models, it returns the text section of the config, which is nested under a set of valid names. On older (2023-) composite models, it discards decoder-only parameters when encoder=True and vice-versa. register_for_auto_class < source > ( auto_class = 'AutoConfig' ) Parameters auto_class (str or type, optional, defaults to "AutoConfig") — The auto class to register this new configuration with. Register this class with a given auto class. This should only be used for custom configurations as the ones in the library are already mapped with AutoConfig. save_pretrained < source > ( save_directory: str | os.PathLike push_to_hub: bool = False **kwargs ) Parameters save_directory (str or os.PathLike) — Directory where the configuration JSON file will be saved (will be created if it does not exist). push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace). kwargs (dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method. Save a configuration object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method. to_dict < source > ( ) → dict[str, Any] Returns dict[str, Any] Dictionary of all the attributes that make up this configuration instance. Serializes this instance to a Python dictionary. to_diff_dict < source > ( ) → dict[str, Any] Returns dict[str, Any] Dictionary of all the attributes that make up this configuration instance. Removes all attributes from the configuration that correspond to the default config attributes for better readability, while always retaining the config attribute from the class. Serializes to a Python dictionary. to_json_file < source > ( json_file_path: str | os.PathLike use_diff: bool = True ) Parameters json_file_path (str or os.PathLike) — Path to the JSON file in which this configuration instance’s parameters will be saved. use_diff (bool, optional, defaults to True) — If set to True, only the difference between the config instance and the default PreTrainedConfig() is serialized to JSON file. Save this instance to a JSON file. to_json_string < source > ( use_diff: bool = True ) → str Parameters use_diff (bool, optional, defaults to True) — If set to True, only the difference between the config instance and the default PreTrainedConfig() is serialized to JSON string. Returns str String containing all the attributes that make up this configuration instance in JSON format. Serializes this instance to a JSON string. update < source > ( config_dict: dict ) Parameters config_dict (dict[str, Any]) — Dictionary of attributes that should be updated for this class. Updates attributes of this class with attributes from config_dict. update_from_string < source > ( update_str: str ) Parameters update_str (str) — String with attributes that should be updated for this class. Updates attributes of this class with attributes from update_str. The expected format is ints, floats and strings as is, and for booleans use true or false. For example: “n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index” The keys to change have to already exist in the config object. Update on GitHub ←Callbacks Data Collator→ Configuration PreTrainedConfig

hidden_size

Pattern 2: class transformers.PreTrainedConfig < source > ( output_hidden_states: bool = False output_attentions: bool = False return_dict: bool = True dtype: typing.Union[str, ForwardRef('torch.dtype'), NoneType] = None chunk_size_feed_forward: int = 0 is_encoder_decoder: bool = False architectures: list[str] | None = None id2label: dict[int, str] | None = None label2id: dict[str, int] | None = None num_labels: int | None = None problem_type: str | None = None **kwargs ) Parameters name_or_path (str, optional, defaults to "") — Store the string that was passed to PreTrainedModel.from_pretrained() as pretrained_model_name_or_path if the configuration was created with such a method. output_hidden_states (bool, optional, defaults to False) — Whether or not the model should return all hidden-states. output_attentions (bool, optional, defaults to False) — Whether or not the model should returns all attentions. return_dict (bool, optional, defaults to True) — Whether or not the model should return a ModelOutput instead of a plain tuple. is_encoder_decoder (bool, optional, defaults to False) — Whether the model is used as an encoder/decoder or not. chunk_size_feed_forward (int, optional, defaults to 0) — The chunk size of all feed forward layers in the residual attention blocks. A chunk size of 0 means that the feed forward layer is not chunked. A chunk size of n means that the feed forward layer processes n < sequence_length embeddings at a time. For more information on feed forward chunking, see How does Feed Forward Chunking work?. Parameters for fine-tuning tasks architectures (list[str], optional) — Model architectures that can be used with the model pretrained weights. id2label (dict[int, str], optional) — A map from index (for instance prediction index, or target index) to label. label2id (dict[str, int], optional) — A map from label to index for the model. num_labels (int, optional) — Number of labels to use in the last layer added to the model, typically for a classification task. problem_type (str, optional) — Problem type for XxxForSequenceClassification models. Can be one of "regression", "single_label_classification" or "multi_label_classification". PyTorch specific parameters dtype (str, optional) — The dtype of the weights. This attribute can be used to initialize the model to a non-default dtype (which is normally float32) and thus allow for optimal storage allocation. For example, if the saved model is float16, ideally we want to load it back using the minimal amount of memory needed to load float16 weights. Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations. A configuration file can be loaded and saved to disk. Loading the configuration file and using this file to initialize a model does not load the model weights. It only affects the model’s configuration. Class attributes (overridden by derived classes): model_type (str) — An identifier for the model type, serialized into the JSON file, and used to recreate the correct object in AutoConfig. has_no_defaults_at_init (bool) — Whether the config class can be initialized without providing input arguments. Some configurations requires inputs to be defined at init and have no default values, usually these are composite configs, (but not necessarily) such as EncoderDecoderConfig or ~RagConfig. They have to be initialized from two or more configs of type PreTrainedConfig. keys_to_ignore_at_inference (list[str]) — A list of keys to ignore by default when looking at dictionary outputs of the model during inference. attribute_map (dict[str, str]) — A dict that maps model specific attribute names to the standardized naming of attributes. base_model_tp_plan (dict[str, Any]) — A dict that maps sub-modules FQNs of a base model to a tensor parallel plan applied to the sub-module when model.tensor_parallel is called. base_model_pp_plan (dict[str, tuple[list[str]]]) — A dict that maps child-modules of a base model to a pipeline parallel plan that enables users to place the child-module on the appropriate device. Common attributes (present in all subclasses): vocab_size (int) — The number of tokens in the vocabulary, which is also the first dimension of the embeddings matrix (this attribute may be missing for models that don’t have a text modality like ViT). hidden_size (int) — The hidden size of the model. num_attention_heads (int) — The number of attention heads used in the multi-head attention layers of the model. num_hidden_layers (int) — The number of blocks in the model. Setting parameters for sequence generation in the model config is deprecated. For backward compatibility, loading some of them will still be possible, but attempting to overwrite them will throw an exception — you should set them in a [~transformers.GenerationConfig]. Check the documentation of [~transformers.GenerationConfig] for more information about the individual parameters. push_to_hub < source > ( repo_id: str commit_message: str | None = None commit_description: str | None = None private: bool | None = None token: bool | str | None = None revision: str | None = None create_pr: bool = False max_shard_size: int | str | None = '50GB' tags: list[str] | None = None ) Parameters repo_id (str) — The name of the repository you want to push your config to. It should contain your organization name when pushing to a given organization. commit_message (str, optional) — Message to commit while pushing. Will default to "Upload config". commit_description (str, optional) — The description of the commit that will be created private (bool, optional) — Whether to make the repo private. If None (default), the repo will be public unless the organization’s default is private. This value is ignored if the repo already exists. token (bool or str, optional) — The token to use as HTTP bearer authorization for remote files. If True (default), will use the token generated when running hf auth login (stored in ~/.huggingface). revision (str, optional) — Branch to push the uploaded files to. create_pr (bool, optional, defaults to False) — Whether or not to create a PR with the uploaded files or directly commit. max_shard_size (int or str, optional, defaults to "50GB") — Only applicable for models. The maximum size for a checkpoint before being sharded. Checkpoints shard will then be each of size lower than this size. If expressed as a string, needs to be digits followed by a unit (like "5MB"). tags (list[str], optional) — List of tags to push on the Hub. Upload the configuration file to the 🤗 Model Hub. Examples: Copied from transformers import AutoConfig config = AutoConfig.from_pretrained("google-bert/bert-base-cased") # Push the config to your namespace with the name "my-finetuned-bert". config.push_to_hub("my-finetuned-bert") # Push the config to an organization with the name "my-finetuned-bert". config.push_to_hub("huggingface/my-finetuned-bert") dict_dtype_to_str < source > ( d: dict ) Checks whether the passed dictionary and its nested dicts have a dtype key and if it’s not None, converts torch.dtype to a string of just the type. For example, torch.float32 get converted into “float32” string, which can then be stored in the json format. from_dict < source > ( config_dict: dict **kwargs ) → PreTrainedConfig Parameters config_dict (dict[str, Any]) — Dictionary that will be used to instantiate the configuration object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the get_config_dict() method. kwargs (dict[str, Any]) — Additional parameters from which to initialize the configuration object. Returns PreTrainedConfig The configuration object instantiated from those parameters. Instantiates a PreTrainedConfig from a Python dictionary of parameters. from_json_file < source > ( json_file: str | os.PathLike ) → PreTrainedConfig Parameters json_file (str or os.PathLike) — Path to the JSON file containing the parameters. Returns PreTrainedConfig The configuration object instantiated from that JSON file. Instantiates a PreTrainedConfig from the path to a JSON file of parameters. from_pretrained < source > ( pretrained_model_name_or_path: str | os.PathLike cache_dir: str | os.PathLike | None = None force_download: bool = False local_files_only: bool = False token: str | bool | None = None revision: str = 'main' **kwargs ) → PreTrainedConfig Parameters pretrained_model_name_or_path (str or os.PathLike) — This can be either: a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co. a path to a directory containing a configuration file saved using the save_pretrained() method, e.g., ./my_model_directory/. a path or url to a saved configuration JSON file, e.g., ./my_model_directory/configuration.json. cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used. force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist. proxies (dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request. token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running hf auth login (stored in ~/.huggingface). revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git. To test a pull request you made on the Hub, you can pass revision="refs/pr/<pr_number>". return_unused_kwargs (bool, optional, defaults to False) — If False, then this function returns just the final configuration object. If True, then this functions returns a Tuple(config, unused_kwargs) where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part of kwargs which has not been used to update config and is otherwise ignored. subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here. kwargs (dict[str, Any], optional) — The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by the return_unused_kwargs keyword parameter. Returns PreTrainedConfig The configuration object instantiated from this pretrained model. Instantiate a PreTrainedConfig (or a derived class) from a pretrained model configuration. Examples: Copied # We can't instantiate directly the base class PreTrainedConfig so let's show the examples on a # derived class: BertConfig config = BertConfig.from_pretrained( "google-bert/bert-base-uncased" ) # Download configuration from huggingface.co and cache. config = BertConfig.from_pretrained( "./test/saved_model/" ) # E.g. config (or model) was saved using save_pretrained('./test/saved_model/') config = BertConfig.from_pretrained("./test/saved_model/my_configuration.json") config = BertConfig.from_pretrained("google-bert/bert-base-uncased", output_attentions=True, foo=False) assert config.output_attentions == True config, unused_kwargs = BertConfig.from_pretrained( "google-bert/bert-base-uncased", output_attentions=True, foo=False, return_unused_kwargs=True ) assert config.output_attentions == True assert unused_kwargs == {"foo": False} get_config_dict < source > ( pretrained_model_name_or_path: str | os.PathLike **kwargs ) → tuple[Dict, Dict] Parameters pretrained_model_name_or_path (str or os.PathLike) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. Returns tuple[Dict, Dict] The dictionary(ies) that will be used to instantiate the configuration object. From a pretrained_model_name_or_path, resolve to a dictionary of parameters, to be used for instantiating a PreTrainedConfig using from_dict. get_text_config < source > ( decoder = None encoder = None ) Parameters decoder (Optional[bool], optional) — If set to True, then only search for decoder config names. encoder (Optional[bool], optional) — If set to True, then only search for encoder config names. Returns the text config related to the text input (encoder) or text output (decoder) of the model. The decoder and encoder input arguments can be used to specify which end of the model we are interested in, which is useful on models that have both text input and output modalities. There are three possible outcomes of using this method: On most models, it returns the original config instance itself. On newer (2024+) composite models, it returns the text section of the config, which is nested under a set of valid names. On older (2023-) composite models, it discards decoder-only parameters when encoder=True and vice-versa. register_for_auto_class < source > ( auto_class = 'AutoConfig' ) Parameters auto_class (str or type, optional, defaults to "AutoConfig") — The auto class to register this new configuration with. Register this class with a given auto class. This should only be used for custom configurations as the ones in the library are already mapped with AutoConfig. save_pretrained < source > ( save_directory: str | os.PathLike push_to_hub: bool = False **kwargs ) Parameters save_directory (str or os.PathLike) — Directory where the configuration JSON file will be saved (will be created if it does not exist). push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace). kwargs (dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method. Save a configuration object to the directory save_directory, so that it can be re-loaded using the from_pretrained() class method. to_dict < source > ( ) → dict[str, Any] Returns dict[str, Any] Dictionary of all the attributes that make up this configuration instance. Serializes this instance to a Python dictionary. to_diff_dict < source > ( ) → dict[str, Any] Returns dict[str, Any] Dictionary of all the attributes that make up this configuration instance. Removes all attributes from the configuration that correspond to the default config attributes for better readability, while always retaining the config attribute from the class. Serializes to a Python dictionary. to_json_file < source > ( json_file_path: str | os.PathLike use_diff: bool = True ) Parameters json_file_path (str or os.PathLike) — Path to the JSON file in which this configuration instance’s parameters will be saved. use_diff (bool, optional, defaults to True) — If set to True, only the difference between the config instance and the default PreTrainedConfig() is serialized to JSON file. Save this instance to a JSON file. to_json_string < source > ( use_diff: bool = True ) → str Parameters use_diff (bool, optional, defaults to True) — If set to True, only the difference between the config instance and the default PreTrainedConfig() is serialized to JSON string. Returns str String containing all the attributes that make up this configuration instance in JSON format. Serializes this instance to a JSON string. update < source > ( config_dict: dict ) Parameters config_dict (dict[str, Any]) — Dictionary of attributes that should be updated for this class. Updates attributes of this class with attributes from config_dict. update_from_string < source > ( update_str: str ) Parameters update_str (str) — String with attributes that should be updated for this class. Updates attributes of this class with attributes from update_str. The expected format is ints, floats and strings as is, and for booleans use true or false. For example: “n_embd=10,resid_pdrop=0.2,scale_attn_weights=false,summary_type=cls_index” The keys to change have to already exist in the config object.

str

Pattern 3: Transformers documentation BERT Transformers 🏡 View all docsAWS Trainium & InferentiaAccelerateArgillaAutoTrainBitsandbytesChat UIDataset viewerDatasetsDeploying on AWSDiffusersDistilabelEvaluateGoogle CloudGoogle TPUsGradioHubHub Python LibraryHuggingface.jsInference Endpoints (dedicated)Inference ProvidersKernelsLeRobotLeaderboardsLightevalMicrosoft AzureOptimumPEFTReachy MiniSafetensorsSentence TransformersTRLTasksText Embeddings InferenceText Generation InferenceTokenizersTrackioTransformersTransformers.jssmolagentstimm Search documentation mainv5.0.0v4.57.6v4.56.2v4.55.4v4.53.3v4.52.3v4.51.3v4.50.0v4.49.0v4.48.2v4.47.1v4.46.3v4.45.2v4.44.2v4.43.4v4.42.4v4.41.2v4.40.2v4.39.3v4.38.2v4.37.2v4.36.1v4.35.2v4.34.1v4.33.3v4.32.1v4.31.0v4.30.0v4.29.1v4.28.1v4.27.2v4.26.1v4.25.1v4.24.0v4.23.1v4.22.2v4.21.3v4.20.1v4.19.4v4.18.0v4.17.0v4.16.2v4.15.0v4.14.1v4.13.0v4.12.5v4.11.3v4.10.1v4.9.2v4.8.2v4.7.0v4.6.0v4.5.1v4.4.2v4.3.3v4.2.2v4.1.1v4.0.1v3.5.1v3.4.0v3.3.1v3.2.0v3.1.0v3.0.2v2.11.0v2.10.0v2.9.1v2.8.0v2.7.0v2.6.0v2.5.1v2.4.1v2.3.0v2.2.2v2.1.1v2.0.0v1.2.0v1.1.0v1.0.0doc-builder-html ARDEENESFRHIITJAKOPTZH Get started Transformers Installation Quickstart Base classes Inference Training Quantization Ecosystem integrations Resources Contribute API Main Classes Models Text models AFMoE ALBERT Apertus Arcee Bamba BART BARThez BARTpho BERT BertGeneration BertJapanese BERTweet BigBird BigBirdPegasus BioGpt BitNet Blenderbot Blenderbot Small BLOOM BLT ByT5 CamemBERT CANINE CodeGen CodeLlama Cohere Cohere2 ConvBERT CPM CPMANT CTRL DBRX DeBERTa DeBERTa-v2 DeepSeek-V2 DeepSeek-V3 DialoGPT DiffLlama DistilBERT Doge dots1 DPR ELECTRA Encoder Decoder Models ERNIE Ernie4_5 Ernie4_5_MoE ESM EXAONE-4.0 Falcon Falcon3 FalconH1 FalconMamba FLAN-T5 FLAN-UL2 FlauBERT FlexOlmo FNet FSMT Funnel Transformer Fuyu Gemma Gemma2 GLM-4 GLM-4-0414 GLM-4.5, GLM-4.6, GLM-4.7 GLM-4.7-Flash GLM-Image GPT GPT Neo GPT NeoX GPT NeoX Japanese GPT-J GPT2 GPTBigCode GptOss GPTSw3 Granite GraniteMoe GraniteMoeHybrid GraniteMoeShared Helium HerBERT HunYuanDenseV1 HunYuanMoEV1 I-BERT Jais2 Jamba JetMoe LED LFM2 LFM2Moe LLaMA Llama2 Llama3 LongCatFlash Longformer LongT5 LUKE M2M100 MADLAD-400 Mamba Mamba2 MarianMT MarkupLM MBart and MBart-50 MegatronBERT MegatronGPT2 MiniMax MiniMax-M2 Ministral Ministral3 Mistral Mixtral mLUKE MobileBERT ModernBert ModernBERTDecoder MPNet MPT MRA MT5 MVP myt5 NanoChat Nemotron NLLB NLLB-MoE Nyströmformer OLMo OLMo2 Olmo3 OLMoE OPT Pegasus PEGASUS-X Persimmon Phi Phi-3 PhiMoE PhoBERT PLBart ProphetNet Qwen2 Qwen2MoE Qwen3 Qwen3MoE Qwen3Next RAG RecurrentGemma Reformer RemBERT RoBERTa RoBERTa-PreLayerNorm RoCBert RoFormer RWKV Seed-Oss SolarOpen Splinter SqueezeBERT StableLm Starcoder2 SwitchTransformers T5 T5Gemma T5Gemma2 T5v1.1 UL2 UMT5 VaultGemma X-MOD XGLM XLM XLM-RoBERTa XLM-RoBERTa-XL XLM-V XLNet xLSTM YOSO Zamba Zamba2 Vision models Audio models Video models Multimodal models Reinforcement learning models Time series models Internal helpers Reference Join the Hugging Face community and get access to the augmented documentation experience Collaborate on models, datasets and Spaces Faster examples with accelerated inference Switch between documentation themes Sign Up to get started This model was released on 2018-10-11 and added to Hugging Face Transformers on 2020-11-16. Copy page BERT BERT is a bidirectional transformer pretrained on unlabeled text to predict masked tokens in a sentence and to predict whether one sentence follows another. The main idea is that by randomly masking some tokens, the model can train on text to the left and right, giving it a more thorough understanding. BERT is also very versatile because its learned language representations can be adapted for other NLP tasks by fine-tuning an additional layer or head. You can find all the original BERT checkpoints under the BERT collection. Click on the BERT models in the right sidebar for more examples of how to apply BERT to different language tasks. The example below demonstrates how to predict the [MASK] token with Pipeline, AutoModel, and from the command line. Pipeline AutoModel transformers CLI Copied import torch from transformers import pipeline pipeline = pipeline( task="fill-mask", model="google-bert/bert-base-uncased", dtype=torch.float16, device=0 ) pipeline("Plants create [MASK] through a process known as photosynthesis.") Notes Inputs should be padded on the right because BERT uses absolute position embeddings. BertConfig class transformers.BertConfig < source > ( vocab_size = 30522 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.1 attention_probs_dropout_prob = 0.1 max_position_embeddings = 512 type_vocab_size = 2 initializer_range = 0.02 layer_norm_eps = 1e-12 pad_token_id = 0 use_cache = True classifier_dropout = None is_decoder = False add_cross_attention = False bos_token_id = None eos_token_id = None tie_word_embeddings = True **kwargs ) Parameters vocab_size (int, optional, defaults to 30522) — Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling BertModel. hidden_size (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer. num_hidden_layers (int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder. num_attention_heads (int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder. intermediate_size (int, optional, defaults to 3072) — Dimensionality of the “intermediate” (often named feed-forward) layer in the Transformer encoder. hidden_act (str or Callable, optional, defaults to "gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported. hidden_dropout_prob (float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. attention_probs_dropout_prob (float, optional, defaults to 0.1) — The dropout ratio for the attention probabilities. max_position_embeddings (int, optional, defaults to 512) — The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). type_vocab_size (int, optional, defaults to 2) — The vocabulary size of the token_type_ids passed when calling BertModel. initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. layer_norm_eps (float, optional, defaults to 1e-12) — The epsilon used by the layer normalization layers. is_decoder (bool, optional, defaults to False) — Whether the model is used as a decoder or not. If False, the model is used as an encoder. use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True. classifier_dropout (float, optional) — The dropout ratio for the classification head. This is the configuration class to store the configuration of a BertModel. It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT google-bert/bert-base-uncased architecture. Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information. Examples: Copied >>> from transformers import BertConfig, BertModel >>> # Initializing a BERT google-bert/bert-base-uncased style configuration >>> configuration = BertConfig() >>> # Initializing a model (with random weights) from the google-bert/bert-base-uncased style configuration >>> model = BertModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config BertTokenizer class transformers.BertTokenizer < source > ( vocab: str | dict[str, int] | None = None do_lower_case: bool = False unk_token: str = '[UNK]' sep_token: str = '[SEP]' pad_token: str = '[PAD]' cls_token: str = '[CLS]' mask_token: str = '[MASK]' tokenize_chinese_chars: bool = True strip_accents: bool | None = None **kwargs ) Parameters vocab (str or dict[str, int], optional) — Custom vocabulary dictionary. If not provided, vocabulary is loaded from vocab_file. do_lower_case (bool, optional, defaults to False) — Whether or not to lowercase the input when tokenizing. unk_token (str, optional, defaults to "[UNK]") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. sep_token (str, optional, defaults to "[SEP]") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. pad_token (str, optional, defaults to "[PAD]") — The token used for padding, for example when batching sequences of different lengths. cls_token (str, optional, defaults to "[CLS]") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. mask_token (str, optional, defaults to "[MASK]") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. tokenize_chinese_chars (bool, optional, defaults to True) — Whether or not to tokenize Chinese characters. strip_accents (bool, optional) — Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT). Construct a BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece. This tokenizer inherits from TokenizersBackend which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. get_special_tokens_mask < source > ( token_ids_0: list[int] token_ids_1: list[int] | None = None already_has_special_tokens: bool = False ) → A list of integers in the range [0, 1] Parameters token_ids_0 — List of IDs for the (possibly already formatted) sequence. token_ids_1 — Unused when already_has_special_tokens=True. Must be None in that case. already_has_special_tokens — Whether the sequence is already formatted with special tokens. Returns A list of integers in the range [0, 1] 1 for a special token, 0 for a sequence token. Retrieve sequence ids from a token list that has no special tokens added. For fast tokenizers, data collators call this with already_has_special_tokens=True to build a mask over an already-formatted sequence. In that case, we compute the mask by checking membership in all_special_ids. save_vocabulary < source > ( save_directory: str filename_prefix: str | None = None ) BertTokenizerLegacy class transformers.BertTokenizerLegacy < source > ( vocab_file do_lower_case = True do_basic_tokenize = True never_split = None unk_token = '[UNK]' sep_token = '[SEP]' pad_token = '[PAD]' cls_token = '[CLS]' mask_token = '[MASK]' tokenize_chinese_chars = True strip_accents = None clean_up_tokenization_spaces = True **kwargs ) Parameters vocab_file (str) — File containing the vocabulary. do_lower_case (bool, optional, defaults to True) — Whether or not to lowercase the input when tokenizing. do_basic_tokenize (bool, optional, defaults to True) — Whether or not to do basic tokenization before WordPiece. never_split (Iterable, optional) — Collection of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True unk_token (str, optional, defaults to "[UNK]") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. sep_token (str, optional, defaults to "[SEP]") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. pad_token (str, optional, defaults to "[PAD]") — The token used for padding, for example when batching sequences of different lengths. cls_token (str, optional, defaults to "[CLS]") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. mask_token (str, optional, defaults to "[MASK]") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. tokenize_chinese_chars (bool, optional, defaults to True) — Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see this issue). strip_accents (bool, optional) — Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT). clean_up_tokenization_spaces (bool, optional, defaults to True) — Whether or not to cleanup spaces after decoding, cleanup consists in removing potential artifacts like extra spaces. Construct a BERT tokenizer. Based on WordPiece. This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. build_inputs_with_special_tokens < source > ( token_ids_0: list token_ids_1: list[int] | None = None ) → List[int] Parameters token_ids_0 (List[int]) — List of IDs to which the special tokens will be added. token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs. Returns List[int] List of input IDs with the appropriate special tokens. Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format: single sequence: [CLS] X [SEP] pair of sequences: [CLS] A [SEP] B [SEP] convert_tokens_to_string < source > ( tokens ) Converts a sequence of tokens (string) in a single string. get_special_tokens_mask < source > ( token_ids_0: list token_ids_1: list[int] | None = None already_has_special_tokens: bool = False ) → List[int] Parameters token_ids_0 (List[int]) — List of IDs. token_ids_1 (List[int], optional) — Optional second list of IDs for sequence pairs. already_has_special_tokens (bool, optional, defaults to False) — Whether or not the token list is already formatted with special tokens for the model. Returns List[int] A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method. BertTokenizerFast class transformers.BertTokenizer < source > ( vocab: str | dict[str, int] | None = None do_lower_case: bool = False unk_token: str = '[UNK]' sep_token: str = '[SEP]' pad_token: str = '[PAD]' cls_token: str = '[CLS]' mask_token: str = '[MASK]' tokenize_chinese_chars: bool = True strip_accents: bool | None = None **kwargs ) Parameters vocab (str or dict[str, int], optional) — Custom vocabulary dictionary. If not provided, vocabulary is loaded from vocab_file. do_lower_case (bool, optional, defaults to False) — Whether or not to lowercase the input when tokenizing. unk_token (str, optional, defaults to "[UNK]") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. sep_token (str, optional, defaults to "[SEP]") — The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. pad_token (str, optional, defaults to "[PAD]") — The token used for padding, for example when batching sequences of different lengths. cls_token (str, optional, defaults to "[CLS]") — The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. mask_token (str, optional, defaults to "[MASK]") — The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. tokenize_chinese_chars (bool, optional, defaults to True) — Whether or not to tokenize Chinese characters. strip_accents (bool, optional) — Whether or not to strip all accents. If this option is not specified, then it will be determined by the value for lowercase (as in the original BERT). Construct a BERT tokenizer (backed by HuggingFace’s tokenizers library). Based on WordPiece. This tokenizer inherits from TokenizersBackend which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. BertModel class transformers.BertModel < source > ( config add_pooling_layer = True ) Parameters config (BertModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. add_pooling_layer (bool, optional, defaults to True) — Whether to add a pooling layer The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of cross-attention is added between the self-attention layers, following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. To behave as an decoder the model needs to be initialized with the is_decoder argument of the configuration set to True. To be used in a Seq2Seq model, the model needs to initialized with both is_decoder argument and add_cross_attention set to True; an encoder_hidden_states is then expected as an input to the forward pass. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None encoder_hidden_states: torch.Tensor | None = None encoder_attention_mask: torch.Tensor | None = None past_key_values: transformers.cache_utils.Cache | None = None use_cache: bool | None = None cache_position: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder. encoder_attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True. Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default. The model will output the same cache format that is fed as input. If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length). use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). cache_position (torch.Tensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. Returns transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions or tuple(torch.FloatTensor) A transformers.modeling_outputs.BaseModelOutputWithPoolingAndCrossAttentions or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining. hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True and config.add_cross_attention=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights of the decoder’s cross-attention layer, after the attention softmax, used to compute the weighted average in the cross-attention heads. past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide. Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding. The BertModel forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. BertForPreTraining class transformers.BertForPreTraining < source > ( config ) Parameters config (BertForPreTraining) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None labels: torch.Tensor | None = None next_sentence_label: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.bert.modeling_bert.BertForPreTrainingOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size] next_sentence_label (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring) Indices should be in [0, 1]: 0 indicates sequence B is a continuation of sequence A, 1 indicates sequence B is a random sequence. Returns transformers.models.bert.modeling_bert.BertForPreTrainingOutput or tuple(torch.FloatTensor) A transformers.models.bert.modeling_bert.BertForPreTrainingOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (optional, returned when labels is provided, torch.FloatTensor of shape (1,)) — Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. prediction_logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). seq_relationship_logits (torch.FloatTensor of shape (batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax). hidden_states (tuple[torch.FloatTensor] | None.hidden_states, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple[torch.FloatTensor] | None.attentions, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForPreTraining forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForPreTraining >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForPreTraining.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) >>> prediction_logits = outputs.prediction_logits >>> seq_relationship_logits = outputs.seq_relationship_logits BertLMHeadModel class transformers.BertLMHeadModel < source > ( config ) Parameters config (BertLMHeadModel) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. Bert Model with a language modeling head on top for CLM fine-tuning. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None encoder_hidden_states: torch.Tensor | None = None encoder_attention_mask: torch.Tensor | None = None labels: torch.Tensor | None = None past_key_values: transformers.cache_utils.Cache | None = None use_cache: bool | None = None cache_position: torch.Tensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder. encoder_attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels n [0, ..., config.vocab_size] past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True. Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default. The model will output the same cache format that is fed as input. If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length). use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). cache_position (torch.Tensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. logits_to_keep (Union[int, torch.Tensor], optional, defaults to 0) — If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length). Returns transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or tuple(torch.FloatTensor) A transformers.modeling_outputs.CausalLMOutputWithCrossAttentions or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction). logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. cross_attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Cross attentions weights after the attention softmax, used to compute the weighted average in the cross-attention heads. past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide. Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see past_key_values input) to speed up sequential decoding. The BertLMHeadModel forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> import torch >>> from transformers import AutoTokenizer, BertLMHeadModel >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertLMHeadModel.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs, labels=inputs["input_ids"]) >>> loss = outputs.loss >>> logits = outputs.logits BertForMaskedLM class transformers.BertForMaskedLM < source > ( config ) Parameters config (BertForMaskedLM) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. The Bert Model with a language modeling head on top.” This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None encoder_hidden_states: torch.Tensor | None = None encoder_attention_mask: torch.Tensor | None = None labels: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder. encoder_attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size] Returns transformers.modeling_outputs.MaskedLMOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.MaskedLMOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Masked language modeling (MLM) loss. logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForMaskedLM forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForMaskedLM >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForMaskedLM.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("The capital of France is <mask>.", return_tensors="pt") >>> with torch.no_grad(): ... logits = model(**inputs).logits >>> # retrieve index of <mask> >>> mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0] >>> predicted_token_id = logits[0, mask_token_index].argmax(axis=-1) >>> tokenizer.decode(predicted_token_id) ... >>> labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"] >>> # mask labels of non-<mask> tokens >>> labels = torch.where(inputs.input_ids == tokenizer.mask_token_id, labels, -100) >>> outputs = model(**inputs, labels=labels) >>> round(outputs.loss.item(), 2) ... BertForNextSentencePrediction class transformers.BertForNextSentencePrediction < source > ( config ) Parameters config (BertForNextSentencePrediction) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. Bert Model with a next sentence prediction (classification) head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None labels: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.NextSentencePredictorOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring). Indices should be in [0, 1]: 0 indicates sequence B is a continuation of sequence A, 1 indicates sequence B is a random sequence. Returns transformers.modeling_outputs.NextSentencePredictorOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.NextSentencePredictorOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when next_sentence_label is provided) — Next sequence prediction (classification) loss. logits (torch.FloatTensor of shape (batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForNextSentencePrediction forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForNextSentencePrediction >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForNextSentencePrediction.from_pretrained("google-bert/bert-base-uncased") >>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." >>> next_sentence = "The sky is blue due to the shorter wavelength of blue light." >>> encoding = tokenizer(prompt, next_sentence, return_tensors="pt") >>> outputs = model(**encoding, labels=torch.LongTensor([1])) >>> logits = outputs.logits >>> assert logits[0, 0] < logits[0, 1] # next sentence was random BertForSequenceClassification class transformers.BertForSequenceClassification < source > ( config ) Parameters config (BertForSequenceClassification) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. Bert Model transformer with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for GLUE tasks. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None labels: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy). Returns transformers.modeling_outputs.SequenceClassifierOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.SequenceClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification (or regression if config.num_labels==1) loss. logits (torch.FloatTensor of shape (batch_size, config.num_labels)) — Classification (or regression if config.num_labels==1) scores (before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForSequenceClassification forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example of single-label classification: Copied >>> import torch >>> from transformers import AutoTokenizer, BertForSequenceClassification >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> with torch.no_grad(): ... logits = model(**inputs).logits >>> predicted_class_id = logits.argmax().item() >>> model.config.id2label[predicted_class_id] ... >>> # To train a model on num_labels classes, you can pass num_labels=num_labels to .from_pretrained(...) >>> num_labels = len(model.config.id2label) >>> model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", num_labels=num_labels) >>> labels = torch.tensor([1]) >>> loss = model(**inputs, labels=labels).loss >>> round(loss.item(), 2) ... Example of multi-label classification: Copied >>> import torch >>> from transformers import AutoTokenizer, BertForSequenceClassification >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForSequenceClassification.from_pretrained("google-bert/bert-base-uncased", problem_type="multi_label_classification") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> with torch.no_grad(): ... logits = model(**inputs).logits >>> predicted_class_ids = torch.arange(0, logits.shape[-1])[torch.sigmoid(logits).squeeze(dim=0) > 0.5] >>> # To train a model on num_labels classes, you can pass num_labels=num_labels to .from_pretrained(...) >>> num_labels = len(model.config.id2label) >>> model = BertForSequenceClassification.from_pretrained( ... "google-bert/bert-base-uncased", num_labels=num_labels, problem_type="multi_label_classification" ... ) >>> labels = torch.sum( ... torch.nn.functional.one_hot(predicted_class_ids[None, :].clone(), num_classes=num_labels), dim=1 ... ).to(torch.float) >>> loss = model(**inputs, labels=labels).loss BertForMultipleChoice class transformers.BertForMultipleChoice < source > ( config ) Parameters config (BertForMultipleChoice) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. The Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None labels: torch.Tensor | None = None kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.MultipleChoiceModelOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.LongTensor of shape (batch_size, num_choices, sequence_length)) — Indices of input sequence tokens in the vocabulary. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.LongTensor of shape (batch_size, num_choices, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.LongTensor of shape (batch_size, num_choices, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.max_position_embeddings - 1]. What are position IDs? inputs_embeds (torch.FloatTensor of shape (batch_size, num_choices, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the multiple choice classification loss. Indices should be in [0, ..., num_choices-1] where num_choices is the size of the second dimension of the input tensors. (See input_ids above) Returns transformers.modeling_outputs.MultipleChoiceModelOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.MultipleChoiceModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification loss. logits (torch.FloatTensor of shape (batch_size, num_choices)) — num_choices is the second dimension of the input tensors. (see input_ids above). Classification scores (before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForMultipleChoice forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForMultipleChoice >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForMultipleChoice.from_pretrained("google-bert/bert-base-uncased") >>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced." >>> choice0 = "It is eaten with a fork and a knife." >>> choice1 = "It is eaten while held in the hand." >>> labels = torch.tensor(0).unsqueeze(0) # choice0 is correct (according to Wikipedia ;)), batch size 1 >>> encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True) >>> outputs = model({k: v.unsqueeze(0) for k, v in encoding.items()}, labels=labels) # batch size is 1 >>> # the linear classifier still needs to be trained >>> loss = outputs.loss >>> logits = outputs.logits BertForTokenClassification class transformers.BertForTokenClassification < source > ( config ) Parameters config (BertForTokenClassification) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. The Bert transformer with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None labels: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the token classification loss. Indices should be in [0, ..., config.num_labels - 1]. Returns transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification loss. logits (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForTokenClassification forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForTokenClassification >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForTokenClassification.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer( ... "HuggingFace is a company based in Paris and New York", add_special_tokens=False, return_tensors="pt" ... ) >>> with torch.no_grad(): ... logits = model(**inputs).logits >>> predicted_token_class_ids = logits.argmax(-1) >>> # Note that tokens are classified rather then input words which means that >>> # there might be more predicted token classes than words. >>> # Multiple token classes might account for the same word >>> predicted_tokens_classes = [model.config.id2label[t.item()] for t in predicted_token_class_ids[0]] >>> predicted_tokens_classes ... >>> labels = predicted_token_class_ids >>> loss = model(**inputs, labels=labels).loss >>> round(loss.item(), 2) ... BertForQuestionAnswering class transformers.BertForQuestionAnswering < source > ( config ) Parameters config (BertForQuestionAnswering) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. The Bert transformer with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None start_positions: torch.Tensor | None = None end_positions: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. start_positions (torch.Tensor of shape (batch_size,), optional) — Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss. end_positions (torch.Tensor of shape (batch_size,), optional) — Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss. Returns transformers.modeling_outputs.QuestionAnsweringModelOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.QuestionAnsweringModelOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. start_logits (torch.FloatTensor of shape (batch_size, sequence_length)) — Span-start scores (before SoftMax). end_logits (torch.FloatTensor of shape (batch_size, sequence_length)) — Span-end scores (before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForQuestionAnswering forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForQuestionAnswering >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForQuestionAnswering.from_pretrained("google-bert/bert-base-uncased") >>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" >>> inputs = tokenizer(question, text, return_tensors="pt") >>> with torch.no_grad(): ... outputs = model(**inputs) >>> answer_start_index = outputs.start_logits.argmax() >>> answer_end_index = outputs.end_logits.argmax() >>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1] >>> tokenizer.decode(predict_answer_tokens, skip_special_tokens=True) ... >>> # target is "nice puppet" >>> target_start_index = torch.tensor([14]) >>> target_end_index = torch.tensor([15]) >>> outputs = model(**inputs, start_positions=target_start_index, end_positions=target_end_index) >>> loss = outputs.loss >>> round(loss.item(), 2) ... Bert specific outputs class transformers.models.bert.modeling_bert.BertForPreTrainingOutput < source > ( loss: torch.FloatTensor | None = None prediction_logits: torch.FloatTensor | None = None seq_relationship_logits: torch.FloatTensor | None = None hidden_states: tuple[torch.FloatTensor] | None = None attentions: tuple[torch.FloatTensor] | None = None ) Parameters loss (optional, returned when labels is provided, torch.FloatTensor of shape (1,)) — Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. prediction_logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). seq_relationship_logits (torch.FloatTensor of shape (batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax). hidden_states (tuple[torch.FloatTensor] | None.hidden_states, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple[torch.FloatTensor] | None.attentions, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. Output type of BertForPreTraining. Update on GitHub ←BARTpho BertGeneration→ BERT Notes BertConfig BertTokenizer BertTokenizerLegacy BertTokenizerFast BertModel BertForPreTraining BertLMHeadModel BertForMaskedLM BertForNextSentencePrediction BertForSequenceClassification BertForMultipleChoice BertForTokenClassification BertForQuestionAnswering Bert specific outputs

[MASK]

Pattern 4: class transformers.BertForPreTraining < source > ( config ) Parameters config (BertForPreTraining) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. Bert Model with two heads on top as done during the pretraining: a masked language modeling head and a next sentence prediction (classification) head. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.Tensor | None = None attention_mask: torch.Tensor | None = None token_type_ids: torch.Tensor | None = None position_ids: torch.Tensor | None = None inputs_embeds: torch.Tensor | None = None labels: torch.Tensor | None = None next_sentence_label: torch.Tensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.bert.modeling_bert.BertForPreTrainingOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? token_type_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Segment token indices to indicate first and second portions of the inputs. Indices are selected in [0, 1]: 0 corresponds to a sentence A token, 1 corresponds to a sentence B token. What are token type IDs? position_ids (torch.Tensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? inputs_embeds (torch.Tensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size] next_sentence_label (torch.LongTensor of shape (batch_size,), optional) — Labels for computing the next sequence prediction (classification) loss. Input should be a sequence pair (see input_ids docstring) Indices should be in [0, 1]: 0 indicates sequence B is a continuation of sequence A, 1 indicates sequence B is a random sequence. Returns transformers.models.bert.modeling_bert.BertForPreTrainingOutput or tuple(torch.FloatTensor) A transformers.models.bert.modeling_bert.BertForPreTrainingOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (BertConfig) and inputs. loss (optional, returned when labels is provided, torch.FloatTensor of shape (1,)) — Total loss as the sum of the masked language modeling loss and the next sequence prediction (classification) loss. prediction_logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). seq_relationship_logits (torch.FloatTensor of shape (batch_size, 2)) — Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax). hidden_states (tuple[torch.FloatTensor] | None.hidden_states, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple[torch.FloatTensor] | None.attentions, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The BertForPreTraining forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, BertForPreTraining >>> import torch >>> tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased") >>> model = BertForPreTraining.from_pretrained("google-bert/bert-base-uncased") >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt") >>> outputs = model(**inputs) >>> prediction_logits = outputs.prediction_logits >>> seq_relationship_logits = outputs.seq_relationship_logits

masked language modeling

Pattern 5: class transformers.AlignConfig < source > ( text_config = None vision_config = None projection_dim = 640 temperature_init_value = 1.0 initializer_range = 0.02 **kwargs ) Parameters text_config (dict, optional) — Dictionary of configuration options used to initialize AlignTextConfig. vision_config (dict, optional) — Dictionary of configuration options used to initialize AlignVisionConfig. projection_dim (int, optional, defaults to 640) — Dimensionality of text and vision projection layers. temperature_init_value (float, optional, defaults to 1.0) — The initial value of the temperature parameter. Default is used as per the original ALIGN implementation. initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. kwargs (optional) — Dictionary of keyword arguments. AlignConfig is the configuration class to store the configuration of a AlignModel. It is used to instantiate a ALIGN model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the ALIGN kakaobrain/align-base architecture. Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information. Example: Copied >>> from transformers import AlignConfig, AlignModel >>> # Initializing a AlignConfig with kakaobrain/align-base style configuration >>> configuration = AlignConfig() >>> # Initializing a AlignModel (with random weights) from the kakaobrain/align-base style configuration >>> model = AlignModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config >>> # We can also initialize a AlignConfig from a AlignTextConfig and a AlignVisionConfig >>> from transformers import AlignTextConfig, AlignVisionConfig >>> # Initializing ALIGN Text and Vision configurations >>> config_text = AlignTextConfig() >>> config_vision = AlignVisionConfig() >>> config = AlignConfig(text_config=config_text, vision_config=config_vision)

dict

Pattern 6: Transformers documentation Callbacks Transformers 🏡 View all docsAWS Trainium & InferentiaAccelerateArgillaAutoTrainBitsandbytesChat UIDataset viewerDatasetsDeploying on AWSDiffusersDistilabelEvaluateGoogle CloudGoogle TPUsGradioHubHub Python LibraryHuggingface.jsInference Endpoints (dedicated)Inference ProvidersKernelsLeRobotLeaderboardsLightevalMicrosoft AzureOptimumPEFTReachy MiniSafetensorsSentence TransformersTRLTasksText Embeddings InferenceText Generation InferenceTokenizersTrackioTransformersTransformers.jssmolagentstimm Search documentation mainv5.0.0v4.57.6v4.56.2v4.55.4v4.53.3v4.52.3v4.51.3v4.50.0v4.49.0v4.48.2v4.47.1v4.46.3v4.45.2v4.44.2v4.43.4v4.42.4v4.41.2v4.40.2v4.39.3v4.38.2v4.37.2v4.36.1v4.35.2v4.34.1v4.33.3v4.32.1v4.31.0v4.30.0v4.29.1v4.28.1v4.27.2v4.26.1v4.25.1v4.24.0v4.23.1v4.22.2v4.21.3v4.20.1v4.19.4v4.18.0v4.17.0v4.16.2v4.15.0v4.14.1v4.13.0v4.12.5v4.11.3v4.10.1v4.9.2v4.8.2v4.7.0v4.6.0v4.5.1v4.4.2v4.3.3v4.2.2v4.1.1v4.0.1v3.5.1v3.4.0v3.3.1v3.2.0v3.1.0v3.0.2v2.11.0v2.10.0v2.9.1v2.8.0v2.7.0v2.6.0v2.5.1v2.4.1v2.3.0v2.2.2v2.1.1v2.0.0v1.2.0v1.1.0v1.0.0doc-builder-html ARDEENESFRHIITJAKOPTZH Get started Transformers Installation Quickstart Base classes Inference Training Quantization Ecosystem integrations Resources Contribute API Main Classes Auto Classes Backbones Callbacks Configuration Data Collator Logging Models Text Generation Optimization Model outputs PEFT Pipelines Processors Quantization Tokenizer Trainer DeepSpeed ExecuTorch Feature Extractor Image Processor Video Processor Kernels Models Internal helpers Reference Join the Hugging Face community and get access to the augmented documentation experience Collaborate on models, datasets and Spaces Faster examples with accelerated inference Switch between documentation themes Sign Up to get started Copy page Callbacks Callbacks are objects that can customize the behavior of the training loop in the PyTorch Trainer that can inspect the training loop state (for progress reporting, logging on TensorBoard or other ML platforms…) and take decisions (like early stopping). Callbacks are “read only” pieces of code, apart from the TrainerControl object they return, they cannot change anything in the training loop. For customizations that require changes in the training loop, you should subclass Trainer and override the methods you need (see trainer for examples). By default, TrainingArguments.report_to is set to "all", so a Trainer will use the following callbacks. DefaultFlowCallback which handles the default behavior for logging, saving and evaluation. PrinterCallback or ProgressCallback to display progress and print the logs (the first one is used if you deactivate tqdm through the TrainingArguments, otherwise it’s the second one). TensorBoardCallback if tensorboard is accessible (either through PyTorch >= 1.4 or tensorboardX). TrackioCallback if trackio is installed. WandbCallback if wandb is installed. CometCallback if comet_ml is installed. MLflowCallback if mlflow is installed. AzureMLCallback if azureml-sdk is installed. CodeCarbonCallback if codecarbon is installed. ClearMLCallback if clearml is installed. DagsHubCallback if dagshub is installed. FlyteCallback if flyte is installed. DVCLiveCallback if dvclive is installed. SwanLabCallback if swanlab is installed. If a package is installed but you don’t wish to use the accompanying integration, you can change TrainingArguments.report_to to a list of just those integrations you want to use (e.g. ["azure_ml", "wandb"]). The main class that implements callbacks is TrainerCallback. It gets the TrainingArguments used to instantiate the Trainer, can access that Trainer’s internal state via TrainerState, and can take some actions on the training loop via TrainerControl. Available Callbacks Here is the list of the available TrainerCallback in the library: class transformers.integrations.CometCallback < source > ( ) A TrainerCallback that sends the logs to Comet ML. setup < source > ( args state model ) Setup the optional Comet integration. Environment: COMET_MODE (str, optional, default to get_or_create): Control whether to create and log to a new Comet experiment or append to an existing experiment. It accepts the following values: get_or_create: Decides automatically depending if COMET_EXPERIMENT_KEY is set and whether an Experiment with that key already exists or not. create: Always create a new Comet Experiment. get: Always try to append to an Existing Comet Experiment. Requires COMET_EXPERIMENT_KEY to be set. COMET_START_ONLINE (bool, optional): Whether to create an online or offline Experiment. COMET_PROJECT_NAME (str, optional): Comet project name for experiments. COMET_LOG_ASSETS (str, optional, defaults to TRUE): Whether or not to log training assets (checkpoints, etc), to Comet. Can be TRUE, or FALSE. For a number of configurable items in the environment, see here. class transformers.DefaultFlowCallback < source > ( ) A TrainerCallback that handles the default flow of the training loop for logs, evaluation and checkpoints. class transformers.PrinterCallback < source > ( ) A bare TrainerCallback that just prints the logs. class transformers.ProgressCallback < source > ( max_str_len: int = 100 ) A TrainerCallback that displays the progress of training or evaluation. You can modify max_str_len to control how long strings are truncated when logging. class transformers.EarlyStoppingCallback < source > ( early_stopping_patience: int = 1 early_stopping_threshold: float | None = 0.0 ) Parameters early_stopping_patience (int) — Use with metric_for_best_model to stop training when the specified metric worsens for early_stopping_patience evaluation calls. early_stopping_threshold(float, optional) — Use with TrainingArguments metric_for_best_model and early_stopping_patience to denote how much the specified metric must improve to satisfy early stopping conditions. ` A TrainerCallback that handles early stopping. This callback depends on TrainingArguments argument load_best_model_at_end functionality to set best_metric in TrainerState. Note that if the TrainingArguments argument save_steps differs from eval_steps, the early stopping will not occur until the next save step. class transformers.integrations.TensorBoardCallback < source > ( tb_writer = None ) Parameters tb_writer (SummaryWriter, optional) — The writer to use. Will instantiate one if not set. A TrainerCallback that sends the logs to TensorBoard. Environment: TENSORBOARD_LOGGING_DIR (str, optional, defaults to None): The logging dir to log the results. Default value is os.path.join(args.output_dir, default_logdir()) class transformers.integrations.TrackioCallback < source > ( ) A TrainerCallback that logs metrics to Trackio. It records training metrics, model (and PEFT) configuration, and GPU memory usage. If nvidia-ml-py is installed, GPU power consumption is also tracked. Requires: Copied pip install trackio setup < source > ( args state model **kwargs ) Setup the optional Trackio integration. To customize the setup you can also set the arguments project, trackio_space_id and hub_private_repo in TrainingArguments. Please refer to the docstring of for more details. class transformers.integrations.WandbCallback < source > ( ) A TrainerCallback that logs metrics, media, model checkpoints to Weight and Biases. setup < source > ( args state model **kwargs ) Setup the optional Weights & Biases (wandb) integration. One can subclass and override this method to customize the setup if needed. Find more information here. You can also override the following environment variables: Environment: WANDB_LOG_MODEL (str, optional, defaults to "false"): Whether to log model and checkpoints during training. Can be "end", "checkpoint" or "false". If set to "end", the model will be uploaded at the end of training. If set to "checkpoint", the checkpoint will be uploaded every args.save_steps . If set to "false", the model will not be uploaded. Use along with load_best_model_at_end() to upload best model. WANDB_WATCH (str, optional defaults to "false"): Can be "gradients", "all", "parameters", or "false". Set to "all" to log gradients and parameters. WANDB_PROJECT (str, optional, defaults to "huggingface"): Set this to a custom string to store results in a different project. class transformers.integrations.MLflowCallback < source > ( ) A TrainerCallback that sends the logs to MLflow. Can be disabled by setting environment variable DISABLE_MLFLOW_INTEGRATION = TRUE. setup < source > ( args state model ) Setup the optional MLflow integration. Environment: HF_MLFLOW_LOG_ARTIFACTS (str, optional): Whether to use MLflow .log_artifact() facility to log artifacts. This only makes sense if logging to a remote server, e.g. s3 or GCS. If set to True or 1, will copy each saved checkpoint on each save in TrainingArguments’s output_dir to the local or remote artifact storage. Using it without a remote storage will just copy the files to your artifact location. MLFLOW_TRACKING_URI (str, optional): Whether to store runs at a specific path or remote server. Unset by default, which skips setting the tracking URI entirely. MLFLOW_EXPERIMENT_NAME (str, optional, defaults to None): Whether to use an MLflow experiment_name under which to launch the run. Default to None which will point to the Default experiment in MLflow. Otherwise, it is a case sensitive name of the experiment to be activated. If an experiment with this name does not exist, a new experiment with this name is created. MLFLOW_TAGS (str, optional): A string dump of a dictionary of key/value pair to be added to the MLflow run as tags. Example: os.environ['MLFLOW_TAGS']='{"release.candidate": "RC1", "release.version": "2.2.0"}'. MLFLOW_NESTED_RUN (str, optional): Whether to use MLflow nested runs. If set to True or 1, will create a nested run inside the current run. MLFLOW_RUN_ID (str, optional): Allow to reattach to an existing run which can be useful when resuming training from a checkpoint. When MLFLOW_RUN_ID environment variable is set, start_run attempts to resume a run with the specified run ID and other parameters are ignored. MLFLOW_FLATTEN_PARAMS (str, optional, defaults to False): Whether to flatten the parameters dictionary before logging. MLFLOW_MAX_LOG_PARAMS (int, optional): Set the maximum number of parameters to log in the run. class transformers.integrations.AzureMLCallback < source > ( azureml_run = None ) A TrainerCallback that sends the logs to AzureML. class transformers.integrations.CodeCarbonCallback < source > ( ) A TrainerCallback that tracks the CO2 emission of training. class transformers.integrations.ClearMLCallback < source > ( ) A TrainerCallback that sends the logs to ClearML. Environment: CLEARML_PROJECT (str, optional, defaults to HuggingFace Transformers): ClearML project name. CLEARML_TASK (str, optional, defaults to Trainer): ClearML task name. CLEARML_LOG_MODEL (bool, optional, defaults to False): Whether to log models as artifacts during training. class transformers.integrations.DagsHubCallback < source > ( ) A TrainerCallback that logs to DagsHub. Extends MLflowCallback setup < source > ( *args **kwargs ) Setup the DagsHub’s Logging integration. Environment: HF_DAGSHUB_LOG_ARTIFACTS (str, optional): Whether to save the data and model artifacts for the experiment. Default to False. class transformers.integrations.FlyteCallback < source > ( save_log_history: bool = True sync_checkpoints: bool = True ) Parameters save_log_history (bool, optional, defaults to True) — When set to True, the training logs are saved as a Flyte Deck. sync_checkpoints (bool, optional, defaults to True) — When set to True, checkpoints are synced with Flyte and can be used to resume training in the case of an interruption. A TrainerCallback that sends the logs to Flyte. NOTE: This callback only works within a Flyte task. Example: Copied # Note: This example skips over some setup steps for brevity. from flytekit import current_context, task @task def train_hf_transformer(): cp = current_context().checkpoint trainer = Trainer(..., callbacks=[FlyteCallback()]) output = trainer.train(resume_from_checkpoint=cp.restore()) class transformers.integrations.DVCLiveCallback < source > ( live: typing.Optional[typing.Any] = None log_model: typing.Union[typing.Literal['all'], bool, NoneType] = None **kwargs ) Parameters live (dvclive.Live, optional, defaults to None) — Optional Live instance. If None, a new instance will be created using **kwargs. log_model (Union[Literal[“all”], bool], optional, defaults to None) — Whether to use dvclive.Live.log_artifact() to log checkpoints created by Trainer. If set to True, the final checkpoint is logged at the end of training. If set to "all", the entire TrainingArguments’s output_dir is logged at each checkpoint. A TrainerCallback that sends the logs to DVCLive. Use the environment variables below in setup to configure the integration. To customize this callback beyond those environment variables, see here. setup < source > ( args state model ) Setup the optional DVCLive integration. To customize this callback beyond the environment variables below, see here. Environment: HF_DVCLIVE_LOG_MODEL (str, optional): Whether to use dvclive.Live.log_artifact() to log checkpoints created by Trainer. If set to True or 1, the final checkpoint is logged at the end of training. If set to all, the entire TrainingArguments’s output_dir is logged at each checkpoint. class transformers.integrations.SwanLabCallback < source > ( ) A TrainerCallback that logs metrics, media, model checkpoints to SwanLab. setup < source > ( args state model **kwargs ) Setup the optional SwanLab (swanlab) integration. One can subclass and override this method to customize the setup if needed. Find more information here. You can also override the following environment variables. Find more information about environment variables here Environment: SWANLAB_API_KEY (str, optional, defaults to None): Cloud API Key. During login, this environment variable is checked first. If it doesn’t exist, the system checks if the user is already logged in. If not, the login process is initiated. If a string is passed to the login interface, this environment variable is ignored. If the user is already logged in, this environment variable takes precedence over locally stored login information. SWANLAB_PROJECT (str, optional, defaults to None): Set this to a custom string to store results in a different project. If not specified, the name of the current running directory is used. SWANLAB_LOG_DIR (str, optional, defaults to swanlog): This environment variable specifies the storage path for log files when running in local mode. By default, logs are saved in a folder named swanlog under the working directory. SWANLAB_MODE (Literal["local", "cloud", "disabled"], optional, defaults to cloud): SwanLab’s parsing mode, which involves callbacks registered by the operator. Currently, there are three modes: local, cloud, and disabled. Note: Case-sensitive. Find more information here SWANLAB_LOG_MODEL (str, optional, defaults to None): SwanLab does not currently support the save mode functionality.This feature will be available in a future release SWANLAB_WEB_HOST (str, optional, defaults to None): Web address for the SwanLab cloud environment for private version (its free) SWANLAB_API_HOST (str, optional, defaults to None): API address for the SwanLab cloud environment for private version (its free) TrainerCallback class transformers.TrainerCallback < source > ( ) Parameters args (TrainingArguments) — The training arguments used to instantiate the Trainer. state (TrainerState) — The current state of the Trainer. control (TrainerControl) — The object that is returned to the Trainer and can be used to make some decisions. model (PreTrainedModel or torch.nn.Module) — The model being trained. processing_class ([PreTrainedTokenizer or BaseImageProcessor or ProcessorMixin or FeatureExtractionMixin]) — The processing class used for encoding the data. Can be a tokenizer, a processor, an image processor or a feature extractor. optimizer (torch.optim.Optimizer) — The optimizer used for the training steps. lr_scheduler (torch.optim.lr_scheduler.LambdaLR) — The scheduler used for setting the learning rate. train_dataloader (torch.utils.data.DataLoader, optional) — The current dataloader used for training. eval_dataloader (torch.utils.data.DataLoader, optional) — The current dataloader used for evaluation. metrics (dict[str, float]) — The metrics computed by the last evaluation phase. Those are only accessible in the event on_evaluate. logs (dict[str, float]) — The values to log. Those are only accessible in the event on_log. A class for objects that will inspect the state of the training loop at some events and take some decisions. At each of those events the following arguments are available: The control object is the only one that can be changed by the callback, in which case the event that changes it should return the modified version. The argument args, state and control are positionals for all events, all the others are grouped in kwargs. You can unpack the ones you need in the signature of the event using them. As an example, see the code of the simple PrinterCallback. Example: Copied class PrinterCallback(TrainerCallback): def on_log(self, args, state, control, logs=None, **kwargs): _ = logs.pop("total_flos", None) if state.is_local_process_zero: print(logs) on_epoch_begin < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the beginning of an epoch. on_epoch_end < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the end of an epoch. on_evaluate < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called after an evaluation phase. on_init_end < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the end of the initialization of the Trainer. on_log < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called after logging the last logs. on_optimizer_step < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called after the optimizer step but before gradients are zeroed out. Useful for monitoring gradients. on_pre_optimizer_step < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called before the optimizer step but after gradient clipping. Useful for monitoring gradients. on_predict < source > ( args: TrainingArguments state: TrainerState control: TrainerControl metrics **kwargs ) Event called after a successful prediction. on_prediction_step < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called after a prediction step. on_push_begin < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called before pushing the model to the hub, at the beginning of Trainer.push_to_hub and Trainer._push_from_checkpoint. on_save < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called after a checkpoint save. on_step_begin < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the beginning of a training step. If using gradient accumulation, one training step might take several inputs. on_step_end < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the end of a training step. If using gradient accumulation, one training step might take several inputs. on_substep_end < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the end of an substep during gradient accumulation. on_train_begin < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the beginning of training. on_train_end < source > ( args: TrainingArguments state: TrainerState control: TrainerControl **kwargs ) Event called at the end of training. Here is an example of how to register a custom callback with the PyTorch Trainer: Copied class MyCallback(TrainerCallback): "A callback that prints a message at the beginning of training" def on_train_begin(self, args, state, control, **kwargs): print("Starting training") trainer = Trainer( model, args, train_dataset=train_dataset, eval_dataset=eval_dataset, callbacks=[MyCallback], # We can either pass the callback class this way or an instance of it (MyCallback()) ) Another way to register a callback is to call trainer.add_callback() as follows: Copied trainer = Trainer(...) trainer.add_callback(MyCallback) # Alternatively, we can pass an instance of the callback class trainer.add_callback(MyCallback()) TrainerState class transformers.TrainerState < source > ( epoch: float | None = None global_step: int = 0 max_steps: int = 0 logging_steps: int = 500 eval_steps: int = 500 save_steps: int = 500 train_batch_size: int | None = None num_train_epochs: int = 0 num_input_tokens_seen: int = 0 total_flos: float = 0 log_history: list = None best_metric: float | None = None best_global_step: int | None = None best_model_checkpoint: str | None = None is_local_process_zero: bool = True is_world_process_zero: bool = True is_hyper_param_search: bool = False trial_name: str | None = None trial_params: dict[str, str | float | int | bool] | None = None stateful_callbacks: list['TrainerCallback'] | None = None ) Parameters epoch (float, optional) — Only set during training, will represent the epoch the training is at (the decimal part being the percentage of the current epoch completed). global_step (int, optional, defaults to 0) — During training, represents the number of update steps completed. max_steps (int, optional, defaults to 0) — The number of update steps to do during the current training. logging_steps (int, optional, defaults to 500) — Log every X updates steps eval_steps (int, optional) — Run an evaluation every X steps. save_steps (int, optional, defaults to 500) — Save checkpoint every X updates steps. train_batch_size (int, optional) — The batch size for the training dataloader. Only needed when auto_find_batch_size has been used. num_input_tokens_seen (int, optional, defaults to 0) — When tracking the inputs tokens, the number of tokens seen during training (number of input tokens, not the number of prediction tokens). total_flos (float, optional, defaults to 0) — The total number of floating operations done by the model since the beginning of training (stored as floats to avoid overflow). log_history (list[dict[str, float]], optional) — The list of logs done since the beginning of training. best_metric (float, optional) — When tracking the best model, the value of the best metric encountered so far. best_global_step (int, optional) — When tracking the best model, the step at which the best metric was encountered. Used for setting best_model_checkpoint. best_model_checkpoint (str, optional) — When tracking the best model, the value of the name of the checkpoint for the best model encountered so far. is_local_process_zero (bool, optional, defaults to True) — Whether or not this process is the local (e.g., on one machine if training in a distributed fashion on several machines) main process. is_world_process_zero (bool, optional, defaults to True) — Whether or not this process is the global main process (when training in a distributed fashion on several machines, this is only going to be True for one process). is_hyper_param_search (bool, optional, defaults to False) — Whether we are in the process of a hyper parameter search using Trainer.hyperparameter_search. This will impact the way data will be logged in TensorBoard. stateful_callbacks (list[StatefulTrainerCallback], optional) — Callbacks attached to the Trainer that should have their states be saved or restored. Relevant callbacks should implement a state and from_state function. A class containing the Trainer inner state that will be saved along the model and optimizer when checkpointing and passed to the TrainerCallback. In all this class, one step is to be understood as one update step. When using gradient accumulation, one update step may require several forward and backward passes: if you use gradient_accumulation_steps=n, then one update step requires going through n batches. compute_steps < source > ( args max_steps ) Calculates and stores the absolute value for logging, eval, and save steps based on if it was a proportion or not. init_training_references < source > ( trainer max_steps num_train_epochs trial ) Stores the initial training references needed in self load_from_json < source > ( json_path: str ) Create an instance from the content of json_path. save_to_json < source > ( json_path: str ) Save the content of this instance in JSON format inside json_path. TrainerControl class transformers.TrainerControl < source > ( should_training_stop: bool = False should_epoch_stop: bool = False should_save: bool = False should_evaluate: bool = False should_log: bool = False ) Parameters should_training_stop (bool, optional, defaults to False) — Whether or not the training should be interrupted. If True, this variable will not be set back to False. The training will just stop. should_epoch_stop (bool, optional, defaults to False) — Whether or not the current epoch should be interrupted. If True, this variable will be set back to False at the beginning of the next epoch. should_save (bool, optional, defaults to False) — Whether or not the model should be saved at this step. If True, this variable will be set back to False at the beginning of the next step. should_evaluate (bool, optional, defaults to False) — Whether or not the model should be evaluated at this step. If True, this variable will be set back to False at the beginning of the next step. should_log (bool, optional, defaults to False) — Whether or not the logs should be reported at this step. If True, this variable will be set back to False at the beginning of the next step. A class that handles the Trainer control flow. This class is used by the TrainerCallback to activate some switches in the training loop. Update on GitHub ←Backbones Configuration→ Callbacks Available Callbacks TrainerCallback TrainerState TrainerControl

TrainingArguments.report_to

Pattern 7: class transformers.integrations.MLflowCallback < source > ( ) A TrainerCallback that sends the logs to MLflow. Can be disabled by setting environment variable DISABLE_MLFLOW_INTEGRATION = TRUE. setup < source > ( args state model ) Setup the optional MLflow integration. Environment: HF_MLFLOW_LOG_ARTIFACTS (str, optional): Whether to use MLflow .log_artifact() facility to log artifacts. This only makes sense if logging to a remote server, e.g. s3 or GCS. If set to True or 1, will copy each saved checkpoint on each save in TrainingArguments’s output_dir to the local or remote artifact storage. Using it without a remote storage will just copy the files to your artifact location. MLFLOW_TRACKING_URI (str, optional): Whether to store runs at a specific path or remote server. Unset by default, which skips setting the tracking URI entirely. MLFLOW_EXPERIMENT_NAME (str, optional, defaults to None): Whether to use an MLflow experiment_name under which to launch the run. Default to None which will point to the Default experiment in MLflow. Otherwise, it is a case sensitive name of the experiment to be activated. If an experiment with this name does not exist, a new experiment with this name is created. MLFLOW_TAGS (str, optional): A string dump of a dictionary of key/value pair to be added to the MLflow run as tags. Example: os.environ['MLFLOW_TAGS']='{"release.candidate": "RC1", "release.version": "2.2.0"}'. MLFLOW_NESTED_RUN (str, optional): Whether to use MLflow nested runs. If set to True or 1, will create a nested run inside the current run. MLFLOW_RUN_ID (str, optional): Allow to reattach to an existing run which can be useful when resuming training from a checkpoint. When MLFLOW_RUN_ID environment variable is set, start_run attempts to resume a run with the specified run ID and other parameters are ignored. MLFLOW_FLATTEN_PARAMS (str, optional, defaults to False): Whether to flatten the parameters dictionary before logging. MLFLOW_MAX_LOG_PARAMS (int, optional): Set the maximum number of parameters to log in the run.

DISABLE_MLFLOW_INTEGRATION = TRUE

Pattern 8: Transformers documentation Apertus Transformers 🏡 View all docsAWS Trainium & InferentiaAccelerateArgillaAutoTrainBitsandbytesChat UIDataset viewerDatasetsDeploying on AWSDiffusersDistilabelEvaluateGoogle CloudGoogle TPUsGradioHubHub Python LibraryHuggingface.jsInference Endpoints (dedicated)Inference ProvidersKernelsLeRobotLeaderboardsLightevalMicrosoft AzureOptimumPEFTReachy MiniSafetensorsSentence TransformersTRLTasksText Embeddings InferenceText Generation InferenceTokenizersTrackioTransformersTransformers.jssmolagentstimm Search documentation mainv5.0.0v4.57.6v4.56.2v4.55.4v4.53.3v4.52.3v4.51.3v4.50.0v4.49.0v4.48.2v4.47.1v4.46.3v4.45.2v4.44.2v4.43.4v4.42.4v4.41.2v4.40.2v4.39.3v4.38.2v4.37.2v4.36.1v4.35.2v4.34.1v4.33.3v4.32.1v4.31.0v4.30.0v4.29.1v4.28.1v4.27.2v4.26.1v4.25.1v4.24.0v4.23.1v4.22.2v4.21.3v4.20.1v4.19.4v4.18.0v4.17.0v4.16.2v4.15.0v4.14.1v4.13.0v4.12.5v4.11.3v4.10.1v4.9.2v4.8.2v4.7.0v4.6.0v4.5.1v4.4.2v4.3.3v4.2.2v4.1.1v4.0.1v3.5.1v3.4.0v3.3.1v3.2.0v3.1.0v3.0.2v2.11.0v2.10.0v2.9.1v2.8.0v2.7.0v2.6.0v2.5.1v2.4.1v2.3.0v2.2.2v2.1.1v2.0.0v1.2.0v1.1.0v1.0.0doc-builder-html ARDEENESFRHIITJAKOPTZH Get started Transformers Installation Quickstart Base classes Inference Training Quantization Ecosystem integrations Resources Contribute API Main Classes Models Text models AFMoE ALBERT Apertus Arcee Bamba BART BARThez BARTpho BERT BertGeneration BertJapanese BERTweet BigBird BigBirdPegasus BioGpt BitNet Blenderbot Blenderbot Small BLOOM BLT ByT5 CamemBERT CANINE CodeGen CodeLlama Cohere Cohere2 ConvBERT CPM CPMANT CTRL DBRX DeBERTa DeBERTa-v2 DeepSeek-V2 DeepSeek-V3 DialoGPT DiffLlama DistilBERT Doge dots1 DPR ELECTRA Encoder Decoder Models ERNIE Ernie4_5 Ernie4_5_MoE ESM EXAONE-4.0 Falcon Falcon3 FalconH1 FalconMamba FLAN-T5 FLAN-UL2 FlauBERT FlexOlmo FNet FSMT Funnel Transformer Fuyu Gemma Gemma2 GLM-4 GLM-4-0414 GLM-4.5, GLM-4.6, GLM-4.7 GLM-4.7-Flash GLM-Image GPT GPT Neo GPT NeoX GPT NeoX Japanese GPT-J GPT2 GPTBigCode GptOss GPTSw3 Granite GraniteMoe GraniteMoeHybrid GraniteMoeShared Helium HerBERT HunYuanDenseV1 HunYuanMoEV1 I-BERT Jais2 Jamba JetMoe LED LFM2 LFM2Moe LLaMA Llama2 Llama3 LongCatFlash Longformer LongT5 LUKE M2M100 MADLAD-400 Mamba Mamba2 MarianMT MarkupLM MBart and MBart-50 MegatronBERT MegatronGPT2 MiniMax MiniMax-M2 Ministral Ministral3 Mistral Mixtral mLUKE MobileBERT ModernBert ModernBERTDecoder MPNet MPT MRA MT5 MVP myt5 NanoChat Nemotron NLLB NLLB-MoE Nyströmformer OLMo OLMo2 Olmo3 OLMoE OPT Pegasus PEGASUS-X Persimmon Phi Phi-3 PhiMoE PhoBERT PLBart ProphetNet Qwen2 Qwen2MoE Qwen3 Qwen3MoE Qwen3Next RAG RecurrentGemma Reformer RemBERT RoBERTa RoBERTa-PreLayerNorm RoCBert RoFormer RWKV Seed-Oss SolarOpen Splinter SqueezeBERT StableLm Starcoder2 SwitchTransformers T5 T5Gemma T5Gemma2 T5v1.1 UL2 UMT5 VaultGemma X-MOD XGLM XLM XLM-RoBERTa XLM-RoBERTa-XL XLM-V XLNet xLSTM YOSO Zamba Zamba2 Vision models Audio models Video models Multimodal models Reinforcement learning models Time series models Internal helpers Reference Join the Hugging Face community and get access to the augmented documentation experience Collaborate on models, datasets and Spaces Faster examples with accelerated inference Switch between documentation themes Sign Up to get started This model was released on 2025-09-02 and added to Hugging Face Transformers on 2025-08-28. Copy page Apertus Overview Apertus is a family of large language models from the Swiss AI Initiative. Coming soon The example below demonstrates how to generate text with Pipeline or the AutoModel, and from the command line. Pipeline AutoModel transformers CLI Copied import torch from transformers import pipeline pipeline = pipeline( task="text-generation", model="swiss-ai/Apertus-8B", dtype=torch.bfloat16, device=0 ) pipeline("Plants create energy through a process known as") ApertusConfig class transformers.ApertusConfig < source > ( vocab_size: int | None = 131072 hidden_size: int | None = 4096 intermediate_size: int | None = 14336 num_hidden_layers: int | None = 32 num_attention_heads: int | None = 32 num_key_value_heads: int | None = None hidden_act: str | None = 'xielu' max_position_embeddings: int | None = 65536 initializer_range: float | None = 0.02 rms_norm_eps: float | None = 1e-05 use_cache: bool | None = True pad_token_id: int | None = 3 bos_token_id: int | None = 1 eos_token_id: int | None = 2 tie_word_embeddings: bool | None = False rope_parameters: transformers.modeling_rope_utils.RopeParameters | None = {'rope_type': 'llama3', 'rope_theta': 12000000.0, 'factor': 8.0, 'original_max_position_embeddings': 8192, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0} attention_bias: bool | None = False attention_dropout: float | None = 0.0 **kwargs ) Parameters vocab_size (int, optional, defaults to 131072) — Vocabulary size of the Apertus model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling ApertusModel hidden_size (int, optional, defaults to 4096) — Dimension of the hidden representations. intermediate_size (int, optional, defaults to 14336) — Dimension of the MLP representations. num_hidden_layers (int, optional, defaults to 32) — Number of hidden layers in the Transformer decoder. num_attention_heads (int, optional, defaults to 32) — Number of attention heads for each attention layer in the Transformer decoder. num_key_value_heads (int, optional) — This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out this paper. If it is not specified, will default to num_attention_heads. hidden_act (str or function, optional, defaults to "xielu") — The non-linear activation function (function or string) in the decoder. max_position_embeddings (int, optional, defaults to 65536) — The maximum sequence length that this model might ever be used with. Apertus supports up to 65536 tokens. initializer_range (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. rms_norm_eps (float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. use_cache (bool, optional, defaults to True) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True. pad_token_id (int, optional, defaults to 3) — Padding token id. bos_token_id (int, optional, defaults to 1) — Beginning of stream token id. eos_token_id (int, optional, defaults to 2) — End of stream token id. tie_word_embeddings (bool, optional, defaults to False) — Whether to tie weight embeddings rope_parameters (RopeParameters, optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value for rope_theta and optionally parameters used for scaling in case you want to use RoPE with longer max_position_embeddings. attention_bias (bool, optional, defaults to False) — Whether to use a bias in the query, key, value and output projection layers during self-attention. attention_dropout (float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. This is the configuration class to store the configuration of a ApertusModel. It is used to instantiate a Apertus model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Apertus-8B. e.g. swiss-ai/Apertus-8B Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information. Copied >>> from transformers import ApertusModel, ApertusConfig >>> # Initializing a Apertus-8B style configuration >>> configuration = ApertusConfig() >>> # Initializing a model from the Apertus-8B style configuration >>> model = ApertusModel(configuration) >>> # Accessing the model configuration >>> configuration = model.config ApertusModel class transformers.ApertusModel < source > ( config: ApertusConfig ) Parameters config (ApertusConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. The bare Apertus Model outputting raw hidden-states without any specific head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None cache_position: torch.LongTensor | None = None use_cache: bool | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor) Parameters input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True. Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default. The model will output the same cache format that is fed as input. If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length). inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). Returns transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor) A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (ApertusConfig) and inputs. last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. If past_key_values is used only the last hidden-state of the sequences of shape (batch_size, 1, hidden_size) is output. past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide. Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=True in the cross-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding. hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The ApertusModel forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. ApertusForCausalLM class transformers.ApertusForCausalLM < source > ( config ) Parameters config (ApertusForCausalLM) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights. The Apertus Model for causal language modeling. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.) This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior. forward < source > ( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor) Parameters input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True. Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default. The model will output the same cache format that is fed as input. If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length). inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]. use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). cache_position (torch.LongTensor of shape (sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily to position_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. logits_to_keep (Union[int, torch.Tensor], optional, defaults to 0) — If an int, compute logits for the last logits_to_keep tokens. If 0, calculate logits for all input_ids (special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If a torch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length). Returns transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor) A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (ApertusConfig) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Language modeling loss (for next-token prediction). logits (torch.FloatTensor of shape (batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). past_key_values (Cache, optional, returned when use_cache=True is passed or when config.use_cache=True) — It is a Cache instance. For more details, see our kv cache guide. Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_values input) to speed up sequential decoding. hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The ApertusForCausalLM forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Example: Copied >>> from transformers import AutoTokenizer, ApertusForCausalLM >>> model = ApertusForCausalLM.from_pretrained("swiss-ai/Apertus-8B") >>> tokenizer = AutoTokenizer.from_pretrained("swiss-ai/Apertus-8B") >>> prompt = "Hey, are you conscious? Can you talk to me?" >>> inputs = tokenizer(prompt, return_tensors="pt") >>> # Generate >>> generate_ids = model.generate(inputs.input_ids, max_length=30) >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ApertusForTokenClassification class transformers.ApertusForTokenClassification < source > ( config ) forward < source > ( input_ids: torch.LongTensor | None = None attention_mask: torch.Tensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None use_cache: bool | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor) Parameters input_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default. Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. What are input IDs? attention_mask (torch.Tensor of shape (batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: 1 for tokens that are not masked, 0 for tokens that are masked. What are attention masks? position_ids (torch.LongTensor of shape (batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range [0, config.n_positions - 1]. What are position IDs? past_key_values (~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in the past_key_values returned by the model at a previous stage of decoding, when use_cache=True or config.use_cache=True. Only Cache instance is allowed as input, see our kv cache guide. If no past_key_values are passed, DynamicCache will be initialized by default. The model will output the same cache format that is fed as input. If past_key_values are used, the user is expected to input only unprocessed input_ids (those that don’t have their past key value states given to this model) of shape (batch_size, unprocessed_length) instead of all input_ids of shape (batch_size, sequence_length). inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model’s internal embedding lookup matrix. labels (torch.LongTensor of shape (batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]. use_cache (bool, optional) — If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values). Returns transformers.modeling_outputs.TokenClassifierOutput or tuple(torch.FloatTensor) A transformers.modeling_outputs.TokenClassifierOutput or a tuple of torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (None) and inputs. loss (torch.FloatTensor of shape (1,), optional, returned when labels is provided) — Classification loss. logits (torch.FloatTensor of shape (batch_size, sequence_length, config.num_labels)) — Classification scores (before SoftMax). hidden_states (tuple(torch.FloatTensor), optional, returned when output_hidden_states=True is passed or when config.output_hidden_states=True) — Tuple of torch.FloatTensor (one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. attentions (tuple(torch.FloatTensor), optional, returned when output_attentions=True is passed or when config.output_attentions=True) — Tuple of torch.FloatTensor (one for each layer) of shape (batch_size, num_heads, sequence_length, sequence_length). Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. The GenericForTokenClassification forward method, overrides the call special method. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them. Update on GitHub ←ALBERT Arcee→ Apertus Overview ApertusConfig ApertusModel ApertusForCausalLM ApertusForTokenClassification

import torch
from transformers import pipeline

pipeline = pipeline(
    task="text-generation",
    model="swiss-ai/Apertus-8B",
    dtype=torch.bfloat16,
    device=0
)
pipeline("Plants create energy through a process known as")

Reference Files

This skill includes comprehensive documentation in references/:

transformers.md - Transformers documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

Detailed explanations
Code examples with language annotations
Links to original documentation
Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

This skill was automatically generated from official documentation
Reference files preserve the structure and examples from source docs
Code examples include language detection for better syntax highlighting
Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation:

Re-run the scraper with the same configuration
The skill will be rebuilt with the latest information

Huggingface Skill

Hugging face transformers documentation, generated from official documentation.

When to Use This Skill

This skill should be triggered when:

Working with huggingface
Asking about huggingface features or APIs
Implementing huggingface solutions
Debugging huggingface code
Learning huggingface best practices

Quick Reference

Common Patterns

hidden_size

str

[MASK]

masked language modeling

dict

TrainingArguments.report_to

DISABLE_MLFLOW_INTEGRATION = TRUE

import torch
from transformers import pipeline

pipeline = pipeline(
    task="text-generation",
    model="swiss-ai/Apertus-8B",
    dtype=torch.bfloat16,
    device=0
)
pipeline("Plants create energy through a process known as")

Reference Files

This skill includes comprehensive documentation in references/:

transformers.md - Transformers documentation

Use view to read specific reference files when detailed information is needed.

Working with This Skill

For Beginners

Start with the getting_started or tutorials reference files for foundational concepts.

For Specific Features

Use the appropriate category reference file (api, guides, etc.) for detailed information.

For Code Examples

The quick reference section above contains common patterns extracted from the official docs.

Resources

references/

Organized documentation extracted from official sources. These files contain:

Detailed explanations
Code examples with language annotations
Links to original documentation
Table of contents for quick navigation

scripts/

Add helper scripts here for common automation tasks.

assets/

Add templates, boilerplate, or example projects here.

Notes

This skill was automatically generated from official documentation
Reference files preserve the structure and examples from source docs
Code examples include language detection for better syntax highlighting
Quick reference patterns are extracted from common usage examples in the docs

Updating

To refresh this skill with updated documentation:

Re-run the scraper with the same configuration
The skill will be rebuilt with the latest information

Adoption

omosb1-sys/huggingface

$ install --global

Security Scan Results

SKILL.md

Huggingface Skill

When to Use This Skill

Quick Reference

Common Patterns

Reference Files

Working with This Skill

For Beginners

For Specific Features

For Code Examples

Resources

references/

scripts/

assets/

Notes

Updating

Related Skills

omosb1-sys/epl-data-pipeline

omosb1-sys/report-polisher

omosb1-sys/react

omosb1-sys/pytorch

omosb1-sys/huggingface

$ install --global

Security Scan Results

SKILL.md

Huggingface Skill

When to Use This Skill

Quick Reference

Common Patterns

Reference Files

Working with This Skill

For Beginners

For Specific Features

For Code Examples

Resources

references/

scripts/

assets/

Notes

Updating

Related Skills

omosb1-sys/epl-data-pipeline

omosb1-sys/report-polisher

omosb1-sys/react

omosb1-sys/pytorch