Streaming Parser
Process datasets larger than memory with Tauq's iterator-based streaming parser.
Why Streaming?
❌ Block Parsing
Traditional parsers load the entire file into memory:
- • 10 GB file = 10 GB+ RAM required
- • Cannot process files larger than available memory
- • Slow startup (wait for full parse)
- • Wastes memory if you only need first N records
✅ Streaming
StreamingParser yields records one at a time:
- • Constant memory - only current record in RAM
- • Process files of any size
- • Instant first record (no wait for full parse)
- • Early exit possible (stop when condition met)
Perfect For:
- • 📊 Large datasets - Log files, database dumps, analytics data
- • 🔄 ETL pipelines - Transform records without loading everything
- • 🔍 Search/filter - Find specific records and exit early
- • 🌊 Real-time processing - Process data as it arrives
Rust API
Basic Usage
use tauq::StreamingParser;
use std::fs::File;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let file = File::open("large_data.tqn")?;
let mut parser = StreamingParser::new(file);
// Iterate through records
while let Some(result) = parser.next_record() {
match result {
Ok(record) => {
// Process this record
println!("{:?}", record);
}
Err(e) => {
eprintln!("Parse error: {}", e);
break;
}
}
}
Ok(())
} Example: Filter Large Dataset
Find all users with role "admin" in a 10 GB file:
use tauq::StreamingParser;
use std::fs::File;
fn find_admins() -> Result<Vec<serde_json::Value>, Box<dyn std::error::Error>> {
let file = File::open("users.tqn")?; // 10 GB file
let mut parser = StreamingParser::new(file);
let mut admins = Vec::new();
while let Some(result) = parser.next_record() {
let record = result?;
if let Some(role) = record.get("role") {
if role == "admin" {
admins.push(record);
}
}
}
Ok(admins)
}
// Memory usage: ~constant (only current record + result Vec)
// Without streaming: Would need 10+ GB RAM Example: Early Exit
Find the first record matching a condition and stop:
use tauq::StreamingParser;
use std::fs::File;
fn find_user_by_id(target_id: i64) -> Result<Option<serde_json::Value>, Box<dyn std::error::Error>> {
let file = File::open("users.tqn")?;
let mut parser = StreamingParser::new(file);
while let Some(result) = parser.next_record() {
let record = result?;
if let Some(id) = record.get("id").and_then(|v| v.as_i64()) {
if id == target_id {
return Ok(Some(record)); // Found it! Exit early
}
}
}
Ok(None) // Not found
}
// Only parses records until target is found
// Could save hours on multi-GB files Example: Transform & Write
ETL pipeline - read large file, transform records, write output:
use tauq::StreamingParser;
use std::fs::File;
use std::io::Write;
fn anonymize_users() -> Result<(), Box<dyn std::error::Error>> {
let input = File::open("users.tqn")?;
let mut parser = StreamingParser::new(input);
let mut output = File::create("users_anonymized.json")?;
output.write_all(b"[")?;
let mut first = true;
while let Some(result) = parser.next_record() {
let mut record = result?;
// Anonymize email
if let Some(email) = record.get_mut("email") {
*email = serde_json::Value::String("***@***.com".to_string());
}
// Write transformed record
if !first {
output.write_all(b",")?;
}
first = false;
serde_json::to_writer(&mut output, &record)?;
}
output.write_all(b"]")?;
Ok(())
}
// Processes 10 GB file with ~constant memory usage Python API
Basic Usage
import tauq
with open('large_data.tqn') as f:
for record in tauq.stream(f):
# Process each record
print(record)
# Early exit example
if record['id'] > 1000:
break Example: Data Processing Pipeline
import tauq
def process_logs(input_file, output_file):
with open(input_file) as infile, open(output_file, 'w') as outfile:
for record in tauq.stream(infile):
# Filter: only errors
if record.get('level') == 'ERROR':
# Transform: add timestamp
record['processed_at'] = datetime.now().isoformat()
# Load: write to output
json.dump(record, outfile)
outfile.write('\n')
# Constant memory usage regardless of input file size Performance Characteristics
Memory Usage
| Operation | Block Parse | Streaming Parse |
|---|---|---|
| 10 MB file | ~10 MB RAM | ~4 KB RAM |
| 1 GB file | ~1 GB RAM | ~4 KB RAM |
| 100 GB file | ❌ OOM | ~4 KB RAM |
Time to First Record
| File Size | Block Parse | Streaming Parse |
|---|---|---|
| 10 MB | ~50 ms | ~1 ms |
| 1 GB | ~5 sec | ~1 ms |
| 100 GB | ~8 min | ~1 ms |
Streaming starts processing immediately without waiting for the entire file.
vs Other Formats
Why Tauq Beats JSON/TOON for Streaming
1. No Count Required
TOON requires knowing array length upfront (users[1000]). Tauq doesn't.
users[???]{id,name}: !def User id name 2. Iterator API
Tauq provides StreamingParser with a true iterator interface. JSON/TOON parsers typically parse the entire structure before returning.
3. Schema Stability
Once you parse !def User id name, all subsequent records follow that schema. No need to re-validate structure per-record.
Best Practices
✅ Use Streaming When:
- • File size > 100 MB
- • File size exceeds available RAM
- • You need early exit (search/filter)
- • Processing one record at a time (ETL)
- • Memory constraints (embedded systems, serverless)
❌ Use Block Parsing When:
- • File is small (< 10 MB)
- • You need random access to records
- • You need the full dataset in memory anyway
- • Doing complex queries across all records
💡 Pro Tips
- • Buffer writes: When writing output, use buffered IO to avoid syscall overhead
- • Error handling: Decide whether to skip bad records or fail fast
- • Progress tracking: Use file position to show progress (bytes read / total bytes)
- • Parallel processing: Split large files and stream in parallel workers