Understanding Ingestion
Ingestion is the process of preparing your data for search, chat, and analysis in Amplifi. It breaks large files into smaller chunks, generates vector embeddings, and indexes the data — making it faster and more efficient to retrieve relevant information.
Why Ingestion Matters
Ingesting data is essential to ensure that large, unstructured documents become easily searchable and accessible for AI-driven queries. Key benefits include:
- Faster Search Results: Smaller chunks enable quicker lookups.
- Improved Accuracy: Overlapping chunks ensure context is preserved across boundaries.
- Scalability: Handles large datasets by processing them into manageable units.
- Semantic Understanding: Vector embeddings allow the system to understand meaning, not just keywords.
The Ingestion Process in Amplifi
When you upload files into a dataset, Amplifi intelligently processes them to handle large files efficiently:
Parallel File Division: If a file contains 50,000 tokens but the processing limit is 10,000 tokens per file, Amplifi automatically divides it into 5 separate files (50,000 ÷ 10,000 = 5 files) for parallel ingestion.
Smart Chunking Strategy: Each divided file is further broken into smaller chunks with:
- Chunk Size Control: Configurable chunk sizes (typically 500-2000 tokens) based on content complexity
- Chunk Overlap: 10-50% overlap between adjacent chunks to maintain context across boundaries
- Context Preservation: Ensures no information is lost during the splitting process
Vector Embedding Generation: Each chunk is converted into vector embeddings using advanced AI models (like OpenAI's embedding models) that capture semantic meaning and context.
Unified Vector Database Storage: All chunks from the original document are stored together in the vector database with:
- Document Identity Linking: Each chunk maintains metadata linking it back to the original document
- Unified Search Capability: Enables seamless retrieval across all chunks of the same document
- Optimized Query Performance: Parallel processing and intelligent indexing for fast, accurate search results
Canceling Ingestion Process
While ingestion typically runs in the background, you can cancel an ongoing ingestion process if needed:
How to Cancel Ingestion
- Navigate to Dataset Management: Go to your workspace and select the Datasets tab
- Find the Active Ingestion: Look for datasets showing "In Progress" or "Processing" status
- Access Dataset Options: Click on the dataset name or use the options menu (⋯)
- Cancel Ingestion: Select Cancel Ingestion from the available actions
- Confirmation: Confirm the cancellation when prompted
What Happens When You Cancel
- Immediate Termination: The ingestion process stops immediately
- Partial Data Retention: Any data that has already been processed and stored will remain available
- No Data Loss: Previously ingested content for the dataset is preserved
- Re-ingestion Required: You'll need to restart ingestion to process any remaining unprocessed files
Important Notes
- Large Files: Canceling ingestion of very large files may take a few moments to fully terminate
- Partial Results: You can still use any data that was successfully ingested before cancellation
- Cost Considerations: Canceling doesn't refund any compute costs for the partial processing that occurred
Chunking: Breaking Data into Meaningful Units
Chunking is the process of dividing large files into smaller, meaningful segments. This helps manage data size, maintain context, and improve retrieval accuracy.
Chunk Size and Overlap
- Chunk Size: Determines the size of each chunk. Larger chunks retain more context, but smaller chunks improve search latency. As a rule of thumb a page contains roughly 1000 tokens. So with the default setting of 2500 tokens, this covers 2.5 pages of content in a pdf. For most reports (user manuals, 10k filings, market research reports, invoices etc.) this chunk size would maintain enough context for an LLM to retrieve the chunk and reason on the chunk to provide an accurate answer to the input query.
- Chunk Overlap: Ensures chunks share some content with their neighbors, preserving context across boundaries. By default a 10% chunk overlap on chunk size is recommended as a starting point. If the document contains complex concepts that span pages, then increasing the chunk overlap to upto 50% would make sense.
Vector Embeddings: Unlocking Semantic Search
Vector embeddings transform text into numerical representations, enabling semantic search by capturing meaning rather than just keywords. Amplifi uses:
- OpenAI Embeddings: High-accuracy embeddings that understand context and meaning.
These embeddings empower Amplifi to:
- Retrieve the most relevant chunks.
- Improve search accuracy by understanding context.
- Enable natural language queries for deeper insights.