Build a LLM Application with FastAPI and Hugging Face Inference API

Introduction

The Hugging Face Inference API makes it easy to send prompts to large language models (LLMs) hosted on the Hugging Face Hub. By combining this with FastAPI—a modern Python web framework—you can build scalable, production-ready APIs that serve LLM-powered responses to your applications.

In this tutorial, you’ll learn how to integrate the Hugging Face Inference API into a FastAPI app, securely store your API key, and create an endpoint to handle user input and return AI-generated text.

🔑 Why use FastAPI?

FastAPI is a modern, high-performance web framework for building APIs with Python. It’s known for:

✅ Easy-to-use syntax: FastAPI uses Python type hints for automatic data validation and documentation.
✅ Asynchronous support: Great for building responsive, scalable apps.
✅ Built-in OpenAPI support: Automatically generates interactive API documentation.
✅ Integration with Uvicorn: Easy to run locally and in production.

💡

Combining FastAPI with the Hugging Face Inference API allows you to build a robust backend that serves LLM responses to frontend apps, chatbots, or other services.

✅ Prerequisites

Before you start, make sure you have:

Python 3.10+ installed
A working Python environment
An access token from your Hugging Face account

🛠️1️⃣ Install Libraries

First, install the Python packages huggingface_hub, python-dotenv, and fastapi:

pip install huggingface_hub python-dotenv fastapi uvicorn

This will also install any dependencies needed to interact with the Inference API and run the FastAPI app.

📦2️⃣ Import Packages

Create a file named main.py and import the required packages:

import os
from dotenv import load_dotenv
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from huggingface_hub import InferenceClient

🔑 3️⃣ Store Your API Key

To use the Hugging Face Inference API, you need to pass your access token as the API key. It’s important to keep the access token safe and out of version control. A recommended way to manage it locally is by using a .env file and the python-dotenv package.

Create a file named .env in the root of your project and add your token:

HF_API_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxx

In your Python code, load the key:

load_dotenv()
api_key = os.getenv("HF_API_TOKEN")

This approach makes your credentials easy to manage across different environments and keeps them secure.

🚀 4️⃣ Initialize the Inference Client

Next, initialize the InferenceClient with your API key and select the desired inference provider:

client = InferenceClient(
    provider="sambanova",
    api_key=api_key,
)

Learn more about how to implement an end-to-end AI solution in our newly published book. We guide you step-by-step through the implementation by using the data platform Microsoft Fabric. Whether you’re a beginner or looking to deepen your expertise, you’ll gain practical insights and hands-on techniques to successfully deploy AI in real-world scenarios.

To the Book

💬 5️⃣ Create the FastAPI Endpoint

Now, let’s define a FastAPI app that includes an endpoint to receive a user prompt and return a response from the LLM:

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In