Documents Loader

Documents Loader#

LangChain helps load different documents (.txt, .pdf, .docx, .csv, .xlsx, .json) to feed into the LLM. The Document Loader even allows YouTube audio parsing and loading as part of unstructured document loading.

Once loaded into the LangChain, the document can be pre-processed in different ways as required in the LLM application.

There are several kinds of loaders.

TextLoader#

This is the simplest kind of document loader, it loads the text document from a filepath.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("../README.md")
document = (
    loader.load()
)  ## Returns an array of loaded documents. In this case, array of size 1.

# Uncomment and run
# document[0].page_content

document[0].metadata

{'source': '../README.md'}

CSVLoader#

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path="/your/file/path.csv")
data = loader.load()

ArxivLoader#

Langchain also provides an interface to load documents from ArXiv using the DOI (digital object identifier).

## TODO: Add to environment file
## Package Requirements
# %pip install arxiv pymupdf --quiet

DEPRECATION: sphinxext-rediraffe main has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of sphinxext-rediraffe or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
Note: you may need to restart the kernel to use updated packages.

from langchain_community.document_loaders import ArxivLoader

arxiv_doi = "2405.10195"
docs = ArxivLoader(query=arxiv_doi).load()

print(docs[0].metadata["Title"])

Formation pathways of the compact stellar systems

# Uncomment and run
# docs[0].page_content

PyPDFLoader#

This LangChain loader provides the interface to load PDF documents from your local file system.

As shown below, you can provide the path to your directory that contains the PDFs. Once loaded, it can extract the meta-data as well as the page_content.

import os
from langchain_community.document_loaders import PyMuPDFLoader

pdf_folder_path = "/your/folder/path/"  # update path to point to the relevant directory
documents = []
for file in os.listdir(pdf_folder_path):
    if file.endswith(".pdf"):
        pdf_path = os.path.join(pdf_folder_path, file)
        loader = PyMuPDFLoader(pdf_path)
        documents.extend(loader.load())

for each in documents:
    # print(each.page_content) # Uncomment this line to see the individual page_content
    print(each.metadata)

Use Cases for Document Loaders:#

If you have a collection of research papers you want to use and leverage in your LLM application, ‘ArxivLoader’ or ‘PyMuPDFLoader’ can help you load these papers into your application.

Once loaded, you can ask the LLM model to summarize or answer questions based on these documents.