Schema-Based Encoding
Type hints and optimization techniques for achieving maximum compression and performance.
Why Schema Encoding Matters
Generic Encoding (CLI)
$ tauq build data.tqn --format tbf Compression: ~44-56% of JSON
Setup: None
Use when: Quick conversions
Schema-Aware (Rust API)
#[derive(TableEncode)]
struct Data { /* ... */ } Compression: ~17% of JSON (84% reduction)
Setup: Add derive macro
Use when: Maximum compression needed
The TableEncode Trait
The TableEncode derive macro generates optimized encoding based on field annotations:
use tauq::tbf::TableEncode;
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize, TableEncode)]
struct Employee {
#[tauq(encoding = "u16")]
id: u32,
#[tauq(encoding = "dictionary")]
department: String,
#[tauq(encoding = "u8", offset = 18)]
age: u32,
#[tauq(encoding = "f32")]
rating: f64,
} Available Encodings
Integer Encodings
#[tauq(encoding = "u8")] // 0-255
#[tauq(encoding = "u16")] // 0-65,535
#[tauq(encoding = "u32")] // 0-4,294,967,295
#[tauq(encoding = "u64")] // Full range
#[tauq(encoding = "i8")] // -128 to 127
#[tauq(encoding = "i16")] // -32,768 to 32,767
#[tauq(encoding = "i32")] // Full i32 range
#[tauq(encoding = "i64")] // Full i64 range
#[tauq(encoding = "varint")] // Adaptive (default) Offset Encodings
Store values relative to a base offset to fit larger logical ranges into smaller types:
// Employee IDs 10000-10255 stored as 0-255
#[tauq(encoding = "u8", offset = 10000)]
employee_id: u32,
// Years 1900-2155 stored as 0-255
#[tauq(encoding = "u8", offset = 1900)]
birth_year: u32,
// Temperatures -40 to 215 stored as 0-255
#[tauq(encoding = "u8", offset = -40)]
temperature: i32, String Encodings
#[tauq(encoding = "dictionary")] // Best for repeated values
#[tauq(encoding = "inline")] // Best for unique values
#[tauq(encoding = "auto")] // Analyze and pick best (default) Float Encodings
#[tauq(encoding = "f32")] // 4 bytes, ~7 significant digits
#[tauq(encoding = "f64")] // 8 bytes, ~15 significant digits (default) Manual Schema Building
For dynamic schemas or non-derive use cases:
use tauq::tbf::{TableSchemaBuilder, FieldEncoding};
let schema = TableSchemaBuilder::new()
.column("id", FieldEncoding::U16)
.column("name", FieldEncoding::Dictionary)
.column("department", FieldEncoding::Dictionary)
.column("age", FieldEncoding::U8Offset { offset: 18 })
.column("salary", FieldEncoding::Float64)
.column("active", FieldEncoding::Bool)
.build(); Adaptive Encoders
For cases where you don't know the data distribution upfront:
AdaptiveIntEncoder
Collects all values first, then picks optimal encoding:
use tauq::tbf::AdaptiveIntEncoder;
let mut encoder = AdaptiveIntEncoder::new(FieldEncoding::Auto, 1000);
for value in values {
encoder.push(value);
}
encoder.encode_to(&mut buffer); AdaptiveStringEncoder
Builds dictionary if beneficial, falls back to inline:
use tauq::tbf::AdaptiveStringEncoder;
let mut encoder = AdaptiveStringEncoder::new(FieldEncoding::Auto, 1000);
for s in strings {
encoder.push(&s);
}
encoder.encode_to(&mut buffer); Best Practices
1. Use Offset Encoding for Constrained Ranges
// Bad: wastes 3 bytes per value
#[tauq(encoding = "u32")]
age: u32, // Always 0-150
// Good: uses 1 byte per value
#[tauq(encoding = "u8", offset = 0)]
age: u32, // Clamped to 0-255 2. Use Dictionary for Low-Cardinality Strings
// Department names repeat across thousands of employees
#[tauq(encoding = "dictionary")]
department: String, // "Engineering", "Sales", "HR", etc. 3. Use f32 When Precision Allows
// Rating doesn't need 15 digits of precision
#[tauq(encoding = "f32")]
rating: f64, // 4.5, 3.8, etc. 4. Consider Data Distribution
// If most IDs are small but some are large, use varint
#[tauq(encoding = "varint")]
user_id: u64,
// If all IDs are uniformly distributed, use fixed-width
#[tauq(encoding = "u64")]
random_id: u64, Size Impact
Example with 10,000 employees:
| Configuration | Size | vs Default |
|---|---|---|
| All defaults (Auto) | 180 KB | baseline |
| Optimized types | 95 KB | -47% |
| With offset encoding | 82 KB | -54% |
| Full optimization | 68 KB | -62% |
Encoding Selection Guide
| Data Characteristic | Recommended Encoding |
|---|---|
| Small positive integers (0-255) | u8 |
| Bounded range with known min | u8/u16 + offset |
| Mostly small, occasionally large | varint |
| Uniformly distributed | Fixed-width (u32, u64) |
| Few unique strings, many repeats | dictionary |
| Unique strings (UUIDs, etc.) | inline |
| Unknown string distribution | auto |