📘 Introduction

Many LLM applications need to work with content that already exists outside the prompt. That content might live in a text file, a PDF document, a help page, a product manual, or a company knowledge base. Before an AI app can summarize, search, or answer questions from that content, it needs a clean way to load it into Python.

In this tutorial, you will learn how LangChain document loaders help bring external content into an LLM workflow. We will load a text file, a PDF, and a web page into LangChain Document objects, inspect page_content and metadata, and prepare the documents for RAG, summarization, or search.

💡
Document loaders are often the first step in a RAG pipeline. They turn raw external content into a consistent structure that the rest of your AI app can process.

💡 What are LangChain document loaders?

A LangChain document loader is a component that reads data from a source and returns one or more LangChain Document objects. Each document usually contains two important parts: the actual text and metadata about where the text came from.

FieldMeaningExample
page_contentThe text loaded from the sourceA paragraph from a PDF page
metadataExtra information about the sourceFile path, page number, or URL
DocumentThe standard LangChain containerOne loaded web page or PDF page

🧠 Why document loaders matter

Document loaders make different data sources feel consistent. Instead of writing custom parsing code for every file type, you can load content into the same Document format and then pass those documents into the next part of your application.

This is useful for RAG systems, document summarizers, search tools, support assistants, and internal knowledge-base apps. The loader does not answer questions by itself. It prepares your content so another step can split, embed, summarize, or retrieve it.

🔍
A loader is not the same as a retriever. The loader reads documents first. A retriever helps find the most relevant chunks later.

✅ Prerequisites

Before getting started, make sure you have:

☑️ Python installed
☑️ Basic Python knowledge
☑️ A terminal or command prompt
☑️ Internet access for the web page loading example

⚙️1️⃣ Create a project folder

Create a new local project folder for this tutorial:

mkdir langchain-document-loaders
cd langchain-document-loaders

🐍2️⃣ Create a virtual environment

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate

On Windows, activate it with:

.venv\Scripts\activate

📦3️⃣ Install libraries

Install the packages we need for text files, PDFs, and web pages:

pip install langchain-community pypdf beautifulsoup4 reportlab

📝4️⃣ Load a text file

First, create a small text file. This gives us a simple source before we move to PDFs and web pages.

mkdir data
echo "LangChain document loaders help bring external content into LLM apps." > data/notes.txt

Now create a file named load_text_file.py and load the text file into a LangChain document.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/notes.txt")
documents = loader.load()

print(f"Number of documents: {len(documents)}")
print("Content:")
print(documents[0].page_content)
print("Metadata:")
print(documents[0].metadata)

load_text_file.py

Run the script:

python load_text_file.py

You should see one document, the text from the file, and metadata that includes the source path.

Number of documents: 1
Content:
LangChain document loaders help bring external content into LLM apps.

Metadata:
{'source': 'data/notes.txt'}
🎓
Want to build the full loader workflow? In the Academy section, we load PDFs and web pages, compare the results, and prepare the documents for RAG-style use cases.

You can view this post with the tier: Academy Membership

Join academy now to read the post and get access to the full library of premium posts for academy members only.

Join Academy Already have an account? Sign In