Understanding Ingestion

Ingestion is the process of preparing your data for search, chat, and analysis in Amplifi. It breaks large files into smaller chunks, generates vector embeddings, and indexes the data — making it faster and more efficient to retrieve relevant information.

Why Ingestion Matters

Ingesting data is essential to ensure that large, unstructured documents become easily searchable and accessible for AI-driven queries. Key benefits include:

Faster Search Results: Smaller chunks enable quicker lookups.
Improved Accuracy: Overlapping chunks ensure context is preserved across boundaries.
Scalability: Handles large datasets by processing them into manageable units.
Semantic Understanding: Vector embeddings allow the system to understand meaning, not just keywords.

The Ingestion Process in Amplifi

When you upload files into a dataset, Amplifi intelligently processes them to handle large files efficiently:

Parallel File Division: If a file contains 50,000 tokens but the processing limit is 10,000 tokens per file, Amplifi automatically divides it into 5 separate files (50,000 ÷ 10,000 = 5 files) for parallel ingestion.

Smart Chunking Strategy: Each divided file is further broken into smaller chunks with:

Chunk Size Control: Configurable chunk sizes (typically 500-2000 tokens) based on content complexity
Chunk Overlap: 10-50% overlap between adjacent chunks to maintain context across boundaries
Context Preservation: Ensures no information is lost during the splitting process

Vector Embedding Generation: Each chunk is converted into vector embeddings using advanced AI models (like OpenAI's embedding models) that capture semantic meaning and context.

Unified Vector Database Storage: All chunks from the original document are stored together in the vector database with:

Document Identity Linking: Each chunk maintains metadata linking it back to the original document
Unified Search Capability: Enables seamless retrieval across all chunks of the same document
Optimized Query Performance: Parallel processing and intelligent indexing for fast, accurate search results

Canceling Ingestion Process

While ingestion typically runs in the background, you can cancel an ongoing ingestion process if needed:

How to Cancel Ingestion

Navigate to Dataset Management: Go to your workspace and select the Datasets tab
Find the Active Ingestion: Look for datasets showing "In Progress" or "Processing" status
Access Dataset Options: Click on the dataset name or use the options menu (⋯)
Cancel Ingestion: Select Cancel Ingestion from the available actions
Confirmation: Confirm the cancellation when prompted

What Happens When You Cancel

Immediate Termination: The ingestion process stops immediately
Partial Data Retention: Any data that has already been processed and stored will remain available
No Data Loss: Previously ingested content for the dataset is preserved
Re-ingestion Required: You'll need to restart ingestion to process any remaining unprocessed files

Important Notes

Large Files: Canceling ingestion of very large files may take a few moments to fully terminate
Partial Results: You can still use any data that was successfully ingested before cancellation
Cost Considerations: Canceling doesn't refund any compute costs for the partial processing that occurred

Chunking: Breaking Data into Meaningful Units

Chunking is the process of dividing large files into smaller, meaningful segments. This helps manage data size, maintain context, and improve retrieval accuracy.

Chunk Size and Overlap

Chunk Size: Determines the size of each chunk. Larger chunks retain more context, but smaller chunks improve search latency. As a rule of thumb a page contains roughly 1000 tokens. So with the default setting of 2500 tokens, this covers 2.5 pages of content in a pdf. For most reports (user manuals, 10k filings, market research reports, invoices etc.) this chunk size would maintain enough context for an LLM to retrieve the chunk and reason on the chunk to provide an accurate answer to the input query.
Chunk Overlap: Ensures chunks share some content with their neighbors, preserving context across boundaries. By default a 10% chunk overlap on chunk size is recommended as a starting point. If the document contains complex concepts that span pages, then increasing the chunk overlap to upto 50% would make sense.

Vector Embeddings: Unlocking Semantic Search

Vector embeddings transform text into numerical representations, enabling semantic search by capturing meaning rather than just keywords. Amplifi uses:

OpenAI Embeddings: High-accuracy embeddings that understand context and meaning.

These embeddings empower Amplifi to:

Retrieve the most relevant chunks.
Improve search accuracy by understanding context.
Enable natural language queries for deeper insights.

Why Ingestion Matters​

The Ingestion Process in Amplifi​

Canceling Ingestion Process​

How to Cancel Ingestion​

What Happens When You Cancel​

Important Notes​

Chunking: Breaking Data into Meaningful Units​

Chunk Size and Overlap​

Vector Embeddings: Unlocking Semantic Search​