Apache Iceberg Integration
Write TBF data to Apache Iceberg tables with full columnar optimization and distributed processing.
Overview
Apache Iceberg is a high-performance table format for huge analytic datasets. The tbf_iceberg module allows you to:
🔄
Convert
Arrow RecordBatches to TBF format
✍️
Write
TBF data compatible with Iceberg workflows
📊
Map
Iceberg/Arrow schemas to TBF encoding
Enabling the Feature
Add the iceberg feature to your Cargo.toml:
[dependencies]
tauq = { version = "0.2", features = ["iceberg"] } Basic Usage
Converting Arrow RecordBatch to TBF
use tauq::tbf_iceberg::{arrow_schema_to_tbf, ArrowToTbf};
use arrow_array::RecordBatch;
// Convert Arrow schema to TBF schema
let tbf_schema = arrow_schema_to_tbf(&arrow_schema);
// Encode RecordBatch to TBF bytes
let tbf_bytes = record_batch.encode_to_tbf(&tbf_schema); Using the TbfFileWriter
use tauq::tbf_iceberg::{TbfFileWriterBuilder, TbfFileWriter};
use iceberg::spec::Schema as IcebergSchema;
// Create writer from Iceberg schema
let mut writer = TbfFileWriterBuilder::new()
.with_iceberg_schema(&iceberg_schema)
.build();
// Write batches
writer.write(&batch1);
writer.write(&batch2);
// Finish and get the TBF data
let tbf_data = writer.finish();
// Access the bytes
let bytes: Vec<u8> = tbf_data.into_bytes();
println!("Wrote {} rows, {} bytes", tbf_data.row_count, bytes.len()); Schema Conversion
Arrow to TBF Type Mapping
| Arrow Type | TBF Encoding |
|---|---|
| Int8 | I8 |
| Int16 | I16 |
| Int32, Date32 | I32 |
| Int64, Date64, Timestamp | I64 |
| UInt8 | U8 |
| UInt16 | U16 |
| UInt32 | U32 |
| UInt64 | U64 |
| Float32 | Float32 |
| Float64 | Float64 |
| Utf8, LargeUtf8 | Dictionary |
| Boolean | Bool |
Iceberg to TBF Type Mapping
| Iceberg Type | TBF Encoding |
|---|---|
| Boolean | Bool |
| Int | I32 |
| Long | I64 |
| Float | Float32 |
| Double | Float64 |
| String | Dictionary |
| Binary, Fixed | Inline |
| Date | I32 |
| Time | I64 |
| Timestamp, Timestamptz | I64 |
| Decimal | VarInt |
| Uuid | Inline |
Complete Example
use std::sync::Arc;
use arrow_array::{Int32Array, StringArray, RecordBatch};
use arrow_schema::{DataType, Field, Schema as ArrowSchema};
use iceberg::spec::{NestedField, PrimitiveType, Schema, Type};
use tauq::tbf_iceberg::{TbfFileWriterBuilder, ArrowToTbf, arrow_schema_to_tbf};
fn main() {
// Define Iceberg schema
let iceberg_schema = Schema::builder()
.with_fields(vec![
Arc::new(NestedField::required(1, "id", Type::Primitive(PrimitiveType::Int))),
Arc::new(NestedField::required(2, "name", Type::Primitive(PrimitiveType::String))),
])
.build()
.unwrap();
// Create writer
let mut writer = TbfFileWriterBuilder::new()
.with_iceberg_schema(&iceberg_schema)
.build();
// Create Arrow data
let arrow_schema = Arc::new(ArrowSchema::new(vec![
Field::new("id", DataType::Int32, false),
Field::new("name", DataType::Utf8, false),
]));
let batch = RecordBatch::try_new(
arrow_schema,
vec![
Arc::new(Int32Array::from(vec![1, 2, 3])),
Arc::new(StringArray::from(vec!["Alice", "Bob", "Carol"])),
],
).unwrap();
// Write and finish
writer.write(&batch);
let data = writer.finish();
println!("TBF output: {} bytes for {} rows", data.file_size(), data.row_count);
} Custom TBF Schema
For fine-grained control over encoding:
use tauq::tbf::{TableSchemaBuilder, FieldEncoding};
use tauq::tbf_iceberg::TbfFileWriterBuilder;
// Build custom schema with specific encodings
let tbf_schema = TableSchemaBuilder::new()
.column("id", FieldEncoding::U16) // Override: use U16 instead of I32
.column("name", FieldEncoding::Dictionary)
.column("age", FieldEncoding::U8Offset { offset: 18 })
.build();
let writer = TbfFileWriterBuilder::new()
.with_tbf_schema(tbf_schema)
.build(); Integration Patterns
Pattern 1: Batch Processing
Process large datasets in batches:
// Process large datasets in batches
for batch in record_batch_reader {
writer.write(&batch);
}
let data = writer.finish();
write_to_object_store(&data.into_bytes()).await?; Pattern 2: Direct Conversion
One-shot conversion for smaller datasets:
// One-shot conversion for smaller datasets
let tbf_schema = arrow_schema_to_tbf(&schema);
let bytes = batch.encode_to_tbf(&tbf_schema); Pattern 3: Streaming with Custom Schema
Stream with optimized schema:
// Stream with optimized schema
let optimized_schema = analyze_data_and_build_schema(&sample_batch);
let mut writer = TbfFileWriterBuilder::new()
.with_tbf_schema(optimized_schema)
.build();
while let Some(batch) = stream.next().await {
writer.write(&batch);
} Performance Considerations
Optimization Strategies
- ✅ Schema Optimization: Pre-analyze data to choose encodings
- ✅ Batch Size: Larger batches amortize overhead
- ✅ Dictionary Encoding: Best for <10% cardinality
- ✅ Column Order: Put frequently-accessed columns first
TBF vs Parquet
TBF: Fast writes, streaming
Parquet: Analytics, archival
TBF excels when write speed matters more than maximum compression
Comparison with Parquet
| Aspect | TBF | Parquet |
|---|---|---|
| Encoding speed | Faster | Slower |
| Compression | Good | Better |
| Random access | Limited | Full |
| Best for | Streaming | Analytics |
Ideal for Data Lakes
TBF with Iceberg is perfect for:
- ✅ High-throughput streaming ingestion
- ✅ Columnar storage optimization
- ✅ Time-series data processing
- ✅ Real-time analytics pipelines