The library that turned model definitions into infrastructure

huggingface/transformers began life as a practical way to load pretrained NLP models. It is now closer to shared infrastructure for model definitions. The README calls it the model-definition framework for text, vision, audio, video, and multimodal models, covering both inference and training. That wording matters. Transformers is not only a pile of model wrappers; it is the contract many tools use to agree on how a model is configured, loaded, tokenized, and exposed.

The important shift is ecosystem position. If a model definition lands in Transformers, it can often move into training stacks such as DeepSpeed and FSDP, inference engines such as vLLM and SGLang, and adjacent runtimes such as llama.cpp or MLX. The library is now a hub-facing model compatibility layer as much as an end-user API.

As of 2026-06, the repository had 161,488 stars, 33,468 forks, and 2,411 open issues. The latest release observed during writing was v5.11.0, published on 2026-06-10. That release added new model support such as DiffusionGemma and DeepSeek-V3.2, plus kernel, parallelization, and continuous batching work. Treat these numbers and release names as writing snapshots; the data card on this page is the refreshable layer.

Install

The current README says Transformers works with Python 3.10+ and PyTorch 2.4+. The documented torch install path is:

pip install "transformers[torch]"

The README also documents uv:

uv pip install "transformers[torch]"

For source installs, clone the repository and install the torch extra from the checkout:

git clone https://github.com/huggingface/transformers.git
cd transformers
pip install '.[torch]'

Do not read that as a generic “pip install and production is done” recipe. It installs the library. It does not choose your serving engine, GPU memory policy, batching strategy, quantization format, tokenizer behavior, or model license.

First use: Pipeline is the front door

The README still makes pipeline the fastest entry point. It handles preprocessing, model loading, and postprocessing for a task:

from transformers import pipeline

generator = pipeline(task="text-generation", model="Qwen/Qwen2.5-1.5B")
generator("the secret to baking a really good cake is ")

That is exactly what Transformers is good at: take a model name from the Hub, hide the first layer of loading details, and get a working inference path for text, audio, vision, or multimodal input.

The value is not that every production system should call pipeline forever. The value is that the same model definition can be learned, tested, adapted, and then moved into a more specialized path. pipeline is a front door. It is not a throughput strategy.

What Transformers is, and what it is not

Transformers is strongest when you need model coverage, model loading, tokenizer behavior, examples, and a common interface over many architectures. It is the place you go when a new model family appears and you want a Pythonic way to inspect it, run it, fine-tune it, or understand its inputs.

It is not a generic neural-network toolbox. The README says the model files are intentionally not refactored into abstract building blocks, because researchers need to iterate on individual model code. It is also not the best generic training loop. The README points people toward Accelerate for broader machine-learning loops. And it is not automatically the best serving layer for high-throughput LLM deployment.

That boundary is the main thing README skimmers miss. Transformers gives you compatibility and definitions. For raw training primitives, you still need PyTorch or TensorFlow. For production LLM serving, you often end up with vLLM, SGLang, Text Generation Inference, llama.cpp, or a managed endpoint.

Recent issues show the cost of model coverage

The issue tracker reads like a map of the library’s scope. Recent open issues mention different SAM3 video performance from the original implementation (#46493), CPU and CUDA output differences for PPDocLayoutV3 (#46506), CJK stop strings missed by byte-level tokenizers (#46519), dtype being silently ignored for composite multimodal checkpoints (#46459), CodeLlama tokenizer round-trip regressions (#46491), DeepSeek-Coder tokenizer output problems in v5 (#46489), and cache behavior around chunked prefill (#46421).

These are not random bugs in a small package. They are the natural failure modes of a library that tries to support new architectures, tokenizers, processors, quantization paths, multimodal inputs, and inference optimizations at the same time.

The release notes point in the same direction. Version 5.11.0 includes new model families, fp8 and fp4 Triton kernel work, fixes in Qwen vision-language model parallel beam search, continuous batching documentation, vLLM smoke tests in CI, and many compatibility fixes. That is a maintenance profile for a library sitting between research code and production engines.

Compared with PyTorch, TensorFlow, vLLM, and llama.cpp

huggingface/transformers had 161,488 stars as of 2026-06. It is Python, Apache-2.0 licensed, and best understood as the model-definition and model-loading layer.

pytorch/pytorch had 100,653 stars as of 2026-06. It is the tensor and autograd framework under much of modern deep learning. You use it to build and train models at a lower level than Transformers.

tensorflow/tensorflow had 195,618 stars as of 2026-06. It remains a broad ML framework with a different ecosystem and deployment history. If you are choosing primitives and training infrastructure, compare TensorFlow with PyTorch, not with Transformers.

vllm-project/vllm had 82,512 stars as of 2026-06. It is a high-throughput LLM serving engine. It is what you compare when the question is tokens per second, batching, memory, and serving APIs.

ggml-org/llama.cpp had 116,025 stars as of 2026-06. It is a C and C++ inference runtime that shines in local and edge-oriented LLM inference, often with GGUF models and quantized weights.

The clean mental model: Transformers defines and loads many models; PyTorch and TensorFlow provide ML primitives; vLLM and llama.cpp serve specific inference needs.

Star curve reading

The star-history data is sampled and sparse after the repository became very large, so it should not be used for fine-grained launch stories. The reliable shape is long-term adoption: from late 2018 to 2026, Transformers became one of the most-starred ML repositories on GitHub and a default bridge between the Hugging Face Hub and the rest of the ML stack.

For lower-level ML framework context, see tensorflow/tensorflow. For local model serving and developer workflows, see ollama/ollama. For the broader movement in model tooling, follow LLM tooling and the trending repositories hub.

FAQ

What is Hugging Face Transformers? It is a Python library and model-definition framework for pretrained text, vision, audio, video, and multimodal models, connected closely to the Hugging Face Hub.

How do I install Transformers? The README documents Python 3.10+ and PyTorch 2.4+, with pip install "transformers[torch]" as the main torch install path.

Is Transformers the same as PyTorch? No. PyTorch is the tensor and training framework. Transformers provides model definitions, tokenizers, processors, and higher-level APIs around many pretrained architectures.

Should I use Transformers or vLLM for serving? Use Transformers to load, inspect, test, and adapt models. Use vLLM or another serving engine when throughput, batching, and production serving behavior dominate.

Why do Transformers releases mention so many model-specific fixes? The library supports a very wide model surface. New model families, tokenizers, multimodal processors, cache implementations, and kernels all create edge cases that a narrower runtime would not carry.