How to upload all the core files from the directory using langchain
import os
import sys
from dotenv import load_dotenv
from langchain.chains import ConversationalRetrievalChain
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import Docx2txtLoader
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import UnstructuredXMLLoader
from langchain_community.document_loaders import UnstructuredExcelLoader
from langchain_community.document_loaders import UnstructuredPowerPointLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
import uuid
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
import chromadb
from langchain_community.document_loaders import DirectoryLoader
os.environ["OPENAI_API_KEY"] = "sk-............................................"
def process_documents(directory):
documents = []
for root, dirs, files in os.walk(directory):
for file in files:
file_path = os.path.join(root, file)
if file.endswith(".pdf"):
loader = PyPDFLoader(str(file_path))
documents.extend(loader.load())
elif file.endswith('.docx') or file.endswith('.doc'):
loader = Docx2txtLoader(str(file_path))
documents.extend(loader.load())
elif file.endswith('.txt'):
loader = TextLoader(str(file_path), encoding="utf-8")
documents.extend(loader.load())
elif file.endswith('.csv'):
loader = TextLoader(str(file_path), encoding="utf-8")
documents.extend(loader.load())
elif file.endswith('.xml'):
loader = UnstructuredXMLLoader(str(file_path), encoding="utf-8")
documents.extend(loader.load())
elif file.endswith('.xlsx') or file.endswith('.xls'):
loader = UnstructuredExcelLoader(str(file_path))
documents.extend(loader.load())
elif file.endswith('.pptx'):
loader = UnstructuredPowerPointLoader(str(file_path))
documents.extend(loader.load())
# Split the documents into smaller chunks
document_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=100)
documents = document_splitter.split_documents(documents)
return documents
# Usage example:
directory = "./docs"
processed_documents = process_documents(directory)
This code snippet is designed to traverse through a specified directory and load various types of documents using the langchain library and its associated community extensions. Here’s a breakdown of what it does and the types of files it can load:
- Traversing Directory: The code walks through the directory tree starting from “./docs”.
- Loading Documents: Depending on the file extension, it loads different types of documents:
- PDF Files: It uses the
PyPDFLoaderto load PDF documents. - Microsoft Word Files (.docx, .doc): It employs the
Docx2txtLoaderto load Word documents. - Text Files (.txt): It utilizes the
TextLoaderto load plain text files. - CSV Files (.csv): Similarly, it uses the
TextLoaderto load CSV files. - XML Files (.xml): It uses the
UnstructuredXMLLoaderto load XML files. - Microsoft Excel Files (.xlsx, .xls): It plans to load Excel files, though this part seems incomplete in the provided code snippet.
- PowerPoint Files (.pptx): It utilizes the
UnstructuredPowerPointLoaderto load PowerPoint files.
- PDF Files: It uses the
- Splitting Documents: After loading, it splits the documents into smaller chunks to facilitate further processing. This step is essential for handling large documents efficiently.
Overall, this code snippet provides a comprehensive approach to loading various document types from a directory, making it suitable for applications that require processing diverse sets of textual data.

Author: Bogdan Kuhar
Full Stack Developer/coach
https://www.youtube.com/@imimir_com
info@imimir.com