Streaming Parser

Process datasets larger than memory with Tauq's iterator-based streaming parser.

Why Streaming?

❌ Block Parsing

Traditional parsers load the entire file into memory:

  • • 10 GB file = 10 GB+ RAM required
  • • Cannot process files larger than available memory
  • • Slow startup (wait for full parse)
  • • Wastes memory if you only need first N records

✅ Streaming

StreamingParser yields records one at a time:

  • Constant memory - only current record in RAM
  • • Process files of any size
  • • Instant first record (no wait for full parse)
  • • Early exit possible (stop when condition met)

Perfect For:

  • • 📊 Large datasets - Log files, database dumps, analytics data
  • • 🔄 ETL pipelines - Transform records without loading everything
  • • 🔍 Search/filter - Find specific records and exit early
  • • 🌊 Real-time processing - Process data as it arrives

Rust API

Basic Usage

use tauq::StreamingParser;
use std::fs::File;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let file = File::open("large_data.tqn")?;
    let mut parser = StreamingParser::new(file);

    // Iterate through records
    while let Some(result) = parser.next_record() {
        match result {
            Ok(record) => {
                // Process this record
                println!("{:?}", record);
            }
            Err(e) => {
                eprintln!("Parse error: {}", e);
                break;
            }
        }
    }

    Ok(())
}

Example: Filter Large Dataset

Find all users with role "admin" in a 10 GB file:

use tauq::StreamingParser;
use std::fs::File;

fn find_admins() -> Result<Vec<serde_json::Value>, Box<dyn std::error::Error>> {
    let file = File::open("users.tqn")?; // 10 GB file
    let mut parser = StreamingParser::new(file);
    let mut admins = Vec::new();

    while let Some(result) = parser.next_record() {
        let record = result?;

        if let Some(role) = record.get("role") {
            if role == "admin" {
                admins.push(record);
            }
        }
    }

    Ok(admins)
}

// Memory usage: ~constant (only current record + result Vec)
// Without streaming: Would need 10+ GB RAM

Example: Early Exit

Find the first record matching a condition and stop:

use tauq::StreamingParser;
use std::fs::File;

fn find_user_by_id(target_id: i64) -> Result<Option<serde_json::Value>, Box<dyn std::error::Error>> {
    let file = File::open("users.tqn")?;
    let mut parser = StreamingParser::new(file);

    while let Some(result) = parser.next_record() {
        let record = result?;

        if let Some(id) = record.get("id").and_then(|v| v.as_i64()) {
            if id == target_id {
                return Ok(Some(record));  // Found it! Exit early
            }
        }
    }

    Ok(None)  // Not found
}

// Only parses records until target is found
// Could save hours on multi-GB files

Example: Transform & Write

ETL pipeline - read large file, transform records, write output:

use tauq::StreamingParser;
use std::fs::File;
use std::io::Write;

fn anonymize_users() -> Result<(), Box<dyn std::error::Error>> {
    let input = File::open("users.tqn")?;
    let mut parser = StreamingParser::new(input);
    let mut output = File::create("users_anonymized.json")?;

    output.write_all(b"[")?;
    let mut first = true;

    while let Some(result) = parser.next_record() {
        let mut record = result?;

        // Anonymize email
        if let Some(email) = record.get_mut("email") {
            *email = serde_json::Value::String("***@***.com".to_string());
        }

        // Write transformed record
        if !first {
            output.write_all(b",")?;
        }
        first = false;
        serde_json::to_writer(&mut output, &record)?;
    }

    output.write_all(b"]")?;
    Ok(())
}

// Processes 10 GB file with ~constant memory usage

Python API

Basic Usage

import tauq

with open('large_data.tqn') as f:
    for record in tauq.stream(f):
        # Process each record
        print(record)

        # Early exit example
        if record['id'] > 1000:
            break

Example: Data Processing Pipeline

import tauq

def process_logs(input_file, output_file):
    with open(input_file) as infile, open(output_file, 'w') as outfile:
        for record in tauq.stream(infile):
            # Filter: only errors
            if record.get('level') == 'ERROR':
                # Transform: add timestamp
                record['processed_at'] = datetime.now().isoformat()

                # Load: write to output
                json.dump(record, outfile)
                outfile.write('\n')

# Constant memory usage regardless of input file size

Performance Characteristics

Memory Usage

Operation Block Parse Streaming Parse
10 MB file ~10 MB RAM ~4 KB RAM
1 GB file ~1 GB RAM ~4 KB RAM
100 GB file ❌ OOM ~4 KB RAM

Time to First Record

File Size Block Parse Streaming Parse
10 MB ~50 ms ~1 ms
1 GB ~5 sec ~1 ms
100 GB ~8 min ~1 ms

Streaming starts processing immediately without waiting for the entire file.

vs Other Formats

Why Tauq Beats JSON/TOON for Streaming

1. No Count Required

TOON requires knowing array length upfront (users[1000]). Tauq doesn't.

TOON - Can't stream in
users[???]{id,name}:
Tauq - Stream friendly
!def User id name

2. Iterator API

Tauq provides StreamingParser with a true iterator interface. JSON/TOON parsers typically parse the entire structure before returning.

3. Schema Stability

Once you parse !def User id name, all subsequent records follow that schema. No need to re-validate structure per-record.

Best Practices

✅ Use Streaming When:

  • • File size > 100 MB
  • • File size exceeds available RAM
  • • You need early exit (search/filter)
  • • Processing one record at a time (ETL)
  • • Memory constraints (embedded systems, serverless)

❌ Use Block Parsing When:

  • • File is small (< 10 MB)
  • • You need random access to records
  • • You need the full dataset in memory anyway
  • • Doing complex queries across all records

💡 Pro Tips

  • Buffer writes: When writing output, use buffered IO to avoid syscall overhead
  • Error handling: Decide whether to skip bad records or fail fast
  • Progress tracking: Use file position to show progress (bytes read / total bytes)
  • Parallel processing: Split large files and stream in parallel workers

Language Bindings

See streaming APIs for your language.

View Bindings →

Syntax Guide

Learn the Tauq notation syntax.

Read Guide →

CLI Reference

Command-line tools for parsing and formatting.

View CLI →