Absolutely! Below is a **complete, self-contained Jupyter Lab notebook**, combining your original LangChain/Ollama RAG agent with a clear, beginner-friendly LangExtract integration using your local `gpt-oss:20b` LLM. You’ll find extra commentary and learning points throughout to help any new user (including those new to Python or LLMs) understand the steps, experiment confidently, and keep learning. Each step is labelled and explained, and the code includes guidance for troubleshooting and further exploration. --- # 🧑‍💻 LangChain + Ollama Document Q&A and LangExtract Lab ## Introduction: What Will You Learn and Why? Welcome! In this lab, you’ll learn how to build a **local, private AI assistant** that reads your documents (TXT or PDF), answers questions about them, **and** extracts key details in a structured format. You’ll combine two powerful frameworks: - **LangChain:** For loading, chunking, searching, and Q&A with LLMs - **LangExtract:** For **structured extraction** of facts from text using LLMs You’ll use **Ollama** to run the `gpt-oss:20b` model directly on your Mac. This means all your data stays private. > 📝 **Learning Points:** > - What “retrieval-augmented generation” (RAG) really is > - Why chunking and embeddings are the backbone of document search > - How LLMs can create **structured data** from *unstructured* text > - Experimentation tips: improving quality, extending the workflow, and learning more --- ## 🛠️ Step 1: Environment Setup (Install Required Libraries) Let’s make sure your Python environment has all the tools needed. If you get an error below (about “pip” or anything else), ask for help or consult team documentation. ```python !pip install langchain langchain-community langchain-core ollama pypdf ipywidgets faiss-cpu langextract ``` --- ## 🚚 Step 2: Import Modules (With Explanations!) We’ll import only what we need and give short explanations along the way. ```python # LangChain core modules from langchain_community.llms import Ollama # Interface to LLMs running with Ollama on your Mac from langchain_community.document_loaders import TextLoader, PyPDFLoader from langchain.text_splitter import CharacterTextSplitter # For dividing docs into chunks from langchain.vectorstores import FAISS # Fast, local, open-source vector database from langchain.embeddings import OllamaEmbeddings from langchain.chains import RetrievalQA # Jupyter and file tools import os from IPython.display import display import ipywidgets as widgets # LangExtract for structured extraction from langextract import LangExtract # Just for optional data formatting import json ``` --- ## 📁 Step 3: Upload or Select Your Document (TXT or PDF) You can upload directly in Jupyter or specify a pre-existing file. ```python # Widget for uploading your own document (supports .txt or .pdf) uploader = widgets.FileUpload(accept='.pdf,.txt', multiple=False) display(uploader) ``` ```python # Save the uploaded file, or point to an existing one in your directory filename = None if uploader.value: for fname, fileinfo in uploader.value.items(): with open(fname, "wb") as f: f.write(fileinfo['content']) filename = fname else: # If you don't upload, enter a filename here (e.g., "yourfile.txt" or "document.pdf") filename = "yourfile.txt" # <-- Update this line if you're using an existing local file print("Selected file:", filename) ``` > *Learning point: PDFs will be read one page at a time, while .txt files are read as single blocks of text.* --- ## ✂️ Step 4: Load and Split the Document into Chunks LLMs cannot process very large documents at once; splitting keeps context and relevance. ```python if filename.lower().endswith('.pdf'): loader = PyPDFLoader(filename) elif filename.lower().endswith('.txt'): loader = TextLoader(filename, encoding='utf-8') else: raise ValueError("Unsupported file type. Please upload a PDF or TXT.") # Load the document content docs = loader.load() print(f"Loaded {len(docs)} document(s) from {filename}.") # Chunking: overlap helps keep context between chunks print("Splitting document: Each chunk ~500 characters, 50-character overlap for context.") splitter = CharacterTextSplitter( separator='\n', chunk_size=500, chunk_overlap=50 ) split_docs = splitter.split_documents(docs) print(f"Total text chunks created: {len(split_docs)}") ``` > *Learning point: More, smaller chunks = more precise search, but too small and you lose context. Try tweaking these numbers later!* --- ## 🧠 Step 5: Build Embeddings and the Vector Database (for RAG) This turns each chunk into a “meaning vector” for smart search. ```python # Ensure your Ollama gpt-oss:20b model is running (check in Terminal with `ollama run gpt-oss:20b`) llm = Ollama(model="gpt-oss:20b") embeddings = OllamaEmbeddings(model="gpt-oss:20b") # FAISS keeps all vectors local and fast! db = FAISS.from_documents(split_docs, embeddings) print("Created vector database for semantic search of your document.") ``` --- ## 🔗 Step 6: Build the Retrieval QA Chain (Classic RAG) Now you have “ask-any-question” power over your document. ```python # Build the chain: retrieval + answer generation in one object qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", # This just puts retrieved chunks into the LLM prompt retriever=db.as_retriever(), return_source_documents=True # Lets you see where each answer came from ) print("RetrievalQA chain ready! Now you can ask questions about your document.") ``` ### Example Queries ```python # 1. General summary response = qa_chain({"query": "Summarize the main ideas from this document."}) print("Summary:", response["result"]) # 2. Action items or tasks query = "List the action items or tasks mentioned in this document." response = qa_chain({"query": query}) print("Action Items:\n", response["result"]) # To ask your own: # user_query = input("Ask your question: ") # response = qa_chain({"query": user_query}) # print("Answer:", response["result"]) ``` > *Learning point: This pipeline is called RAG (“retrieval-augmented generation”): > 1. Finds relevant context using vector search > 2. Answers using your document, not just model “memory”* --- ## 🏷️ Step 7: Structured Extraction with LangExtract **Going beyond Q&A:** Pull out specific fields (like names, dates, or lists) as structured data for further automation or reporting. --- ### 7.1 What is Structured Extraction, and Why Do It? - Q&A is great for *human* readers. - Structured extraction is key for *automation*, import into spreadsheets, or building databases from messy docs. - **LangExtract** guides your LLM to pull out facts in the format you define: e.g., JSON, tables, or simple lists. --- ### 7.2 Write Extraction Instructions and Example(s) Clear instructions and example pairs (input → output) dramatically improve extraction accuracy. ```python # INSTRUCTIONS: Change these to fit what you need! extraction_instructions = """ Extract the following fields from the document content: - Project Name - Main Contact - Due Date Provide output as JSON with keys: 'project_name', 'main_contact', 'due_date'. """ # At least one example helps a lot! Add more examples for higher accuracy. training_examples = [ { "input": """ The 'NextGen Cloud Upgrade' will be overseen by Maria Lopez. Completion is due by August 30th, 2024. """, "output": { "project_name": "NextGen Cloud Upgrade", "main_contact": "Maria Lopez", "due_date": "August 30th, 2024" } }, # Add more real examples as you go! ] ``` > 📝 *Learning point: You are “programming” the LLM by example—clear, concrete examples teach it what a correct answer looks like. Try to cover variety in your docs.* --- ### 7.3 Choose Text to Extract Information From - On smaller docs: use the whole thing. - On larger: try one chunk, a summary, or combine several chunks. ```python # Try the first chunk to start. For exhaustive extraction, loop over all chunks! text_for_extraction = split_docs[0].page_content print("First text chunk:\n", text_for_extraction) ``` > *If the info isn't found, try another chunk, or combine several for richer context.* --- ### 7.4 Run the Extraction! ```python # The core extraction call extractor = LangExtract(model=llm) result = extractor.extract( instructions=extraction_instructions, text=text_for_extraction, examples=training_examples ) # Pretty print for easy review print("Extracted structured info:\n", json.dumps(result, indent=2)) # If you want to run on all chunks and collect results, try a loop like this: all_results = [] for i, d in enumerate(split_docs): chunk_result = extractor.extract( instructions=extraction_instructions, text=d.page_content, examples=training_examples ) if chunk_result: # Skip empty results all_results.append(chunk_result) print("All structured results from all chunks:\n", json.dumps(all_results, indent=2)) ``` --- ## 🧑‍🎓 Step 8: Experiment and Reflect ### 🧪 What To Try Next - Change the extraction instructions to pull out other fields! - Add more training examples for better accuracy, especially if your docs have different formats. - Modify chunk size and overlap above and observe how retrieval Q&A vs extraction performance changes. - Try running the extractor on the whole document at once if it's not too long: `text_for_extraction = '\n'.join([d.page_content for d in split_docs])` ### 💡 Learning Points - **Chunk size and example quality are KEY**: The more your examples match the variety in your data, the better the model learns. - **Check responses carefully**: LLMs can hallucinate, especially if examples are ambiguous or info isn’t in the text. Always validate before using in production. - **All code here runs *locally***: None of your document data or questions leave your Mac if you use Ollama and these open-source libraries. --- ## ❓ Troubleshooting and Going Further - **Not getting the results you want?** - Add another, more specific example. - Tweak your extraction instructions: The more concrete the better! - Make sure you are extracting from the right chunk. - **Performance slow or errors?** - Ensure `gpt-oss:20b` is running in Ollama. - Try with a smaller LLM (such as `phi3`) if hardware is limited. - **More learning resources:** - [LangChain documentation](https://python.langchain.com/docs/) - [Ollama documentation](https://ollama.com/) - [LangExtract documentation (if available)]() - Experiment and share results with peers: crowdsourcing examples and improvements is powerful! --- ## 🏁 What Did You Build? You now have: - A local, private Q&A agent over your documents (RAG). - A framework for extracting **structured data** from natural language docs. - All the building blocks to automate and transform how you work with text! > **Print this notebook, save as PDF, or duplicate and tweak to build more powerful agents! And if you run into issues or want to improve the flow, reach out to your team’s #help channels.** --- **Gentle reminder: Oracle Code Assist offers advanced AI-powered coding support. You may also visit the [#help-oracle-genai-chat Slack channel](https://oracle.enterprise.slack.com/archives/C08S2U7HDPU) for help on Generative Chat.** --- *You did it! 🚀 Now start customizing, expanding, and sharing your learnings with others.*