Big Data Technologies

Introduction

When data scale exceeds single-machine processing capacity, distributed computing frameworks are needed. This article covers the Hadoop ecosystem, Apache Spark, stream processing, data lake/data warehouse architectures, and cloud-based big data services.

1. Big Data Architecture Overview

graph TB
    subgraph Data Sources
        A1[Databases]
        A2[Logs]
        A3[IoT]
        A4[APIs]
    end

    subgraph Data Ingestion
        B1[Kafka]
        B2[Flume]
        B3[Sqoop]
    end

    subgraph Storage
        C1[HDFS]
        C2[S3 / Object Storage]
        C3[Data Lake Delta/Iceberg]
    end

    subgraph Compute
        D1[Spark Batch Processing]
        D2[Flink Stream Processing]
        D3[Spark SQL]
    end

    subgraph Serving
        E1[Data Warehouse Hive/BQ]
        E2[OLAP ClickHouse]
        E3[ML Pipeline]
        E4[BI Dashboards]
    end

    A1 & A2 & A3 & A4 --> B1 & B2 & B3
    B1 & B2 & B3 --> C1 & C2 & C3
    C1 & C2 & C3 --> D1 & D2 & D3
    D1 & D2 & D3 --> E1 & E2 & E3 & E4

2. Hadoop Ecosystem

2.1 HDFS (Hadoop Distributed File System)

A distributed file system optimized for large files:

NameNode (master node)
  - Manages file system metadata (file → block mapping)
  - Does not store actual data

DataNode (data nodes)
  - Stores actual data blocks
  - Default block size 128MB
  - Default 3 replicas

File "logs.csv" (512MB):
  Block 1 (128MB) → DataNode 1, 3, 5
  Block 2 (128MB) → DataNode 2, 4, 6
  Block 3 (128MB) → DataNode 1, 4, 5
  Block 4 (128MB) → DataNode 2, 3, 6

2.2 MapReduce

The classic distributed computing model:

Input Data
  │
  ▼ Split
[Split 1] [Split 2] [Split 3]
  │          │          │
  ▼ Map      ▼ Map      ▼ Map
(k1,v1)   (k2,v2)   (k1,v3)
  │          │          │
  ▼ Shuffle & Sort
[k1: v1,v3]  [k2: v2]
  │              │
  ▼ Reduce      ▼ Reduce
(k1, result1) (k2, result2)

WordCount Example:

# Map: document → (word, 1)
def mapper(document):
    for word in document.split():
        emit(word, 1)

# Reduce: (word, [1,1,1,...]) → (word, count)
def reducer(word, counts):
    emit(word, sum(counts))

2.3 YARN (Yet Another Resource Negotiator)

Hadoop's resource manager:

Component	Responsibility
ResourceManager	Cluster resource allocation
NodeManager	Single-node resource management
ApplicationMaster	Resource coordination for a single application
Container	Resource allocation unit (CPU + memory)

3. Apache Spark

Spark is MapReduce's successor, achieving 10-100x speedup through in-memory computation.

3.1 Core Concepts

Concept	Description
RDD	Resilient Distributed Dataset, foundational abstraction
DataFrame	Distributed table with schema (recommended)
Dataset	Type-safe DataFrame (Scala/Java)
Transformation	Lazy operations (map, filter, join)
Action	Triggers computation (collect, count, save)

3.2 DataFrame Operations

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("DataAnalysis") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

# Read data
df = spark.read.parquet("s3://bucket/data/")

# Transformations
result = (df
    .filter(F.col("age") >= 18)
    .withColumn("income_level", 
        F.when(F.col("income") > 100000, "high")
         .when(F.col("income") > 50000, "medium")
         .otherwise("low"))
    .groupBy("city", "income_level")
    .agg(
        F.count("*").alias("count"),
        F.avg("income").alias("avg_income"),
        F.percentile_approx("income", 0.5).alias("median_income")
    )
    .orderBy(F.desc("count"))
)

result.show()
result.write.parquet("s3://bucket/output/")

3.3 Spark SQL

df.createOrReplaceTempView("users")

result = spark.sql("""
    SELECT 
        city,
        COUNT(*) as user_count,
        AVG(income) as avg_income,
        PERCENTILE(income, 0.5) as median_income
    FROM users
    WHERE age >= 18
    GROUP BY city
    HAVING COUNT(*) > 100
    ORDER BY avg_income DESC
""")

3.4 Spark MLlib

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Feature processing pipeline
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features_raw")
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=100)

pipeline = Pipeline(stages=[assembler, scaler, rf])

# Training
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_df)

# Evaluation
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc:.4f}")

4. Stream Processing

4.1 Kafka Streams

A lightweight stream processing library (no separate cluster required):

StreamsBuilder builder = new StreamsBuilder();

KStream<String, String> events = builder.stream("user-events");

KTable<String, Long> eventCounts = events
    .groupByKey()
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .count();

eventCounts.toStream()
    .to("event-counts", Produced.with(windowedSerde, Serdes.Long()));

4.2 Apache Flink

A powerful distributed stream processing engine:

from pyflink.table import EnvironmentSettings, TableEnvironment

env = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env)

# Define Kafka source
t_env.execute_sql("""
    CREATE TABLE events (
        user_id STRING,
        event_type STRING,
        event_time TIMESTAMP(3),
        WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'events',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Windowed aggregation
result = t_env.sql_query("""
    SELECT 
        user_id,
        TUMBLE_START(event_time, INTERVAL '1' HOUR) as window_start,
        COUNT(*) as event_count
    FROM events
    GROUP BY user_id, TUMBLE(event_time, INTERVAL '1' HOUR)
""")

4.3 Batch vs Stream Processing

Dimension	Batch Processing	Stream Processing
Latency	Minutes to hours	Seconds to milliseconds
Data	Bounded datasets	Unbounded data streams
Frameworks	MapReduce, Spark	Flink, Kafka Streams
Use cases	Reports, ETL, model training	Real-time monitoring, alerts, recommendations

5. Data Lake vs Data Warehouse

Dimension	Data Lake	Data Warehouse
Data format	Raw (JSON, CSV, Parquet)	Structured (Schema-on-Write)
Processing	Schema-on-Read	Schema-on-Write
Cost	Low (object storage)	High (dedicated engine)
Users	Data scientists, engineers	Analysts, business users
Representatives	S3 + Delta Lake	Snowflake, BigQuery

5.1 Lakehouse Architecture

Combining the flexibility of data lakes with the management capabilities of data warehouses:

Object Storage (S3/ADLS)
    +
Table Format (Delta Lake / Apache Iceberg)
    - ACID transactions
    - Schema evolution
    - Time travel
    - Efficient update/delete
    +
Query Engine (Spark / Trino / Presto)
    +
BI / ML Tools

# Delta Lake example
df.write.format("delta").save("s3://bucket/delta_table")

# Time travel
spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("s3://bucket/delta_table")

# UPSERT (MERGE)
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "s3://bucket/delta_table")
delta_table.alias("target") \
    .merge(new_data.alias("source"), "target.id = source.id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

6. Cloud Big Data Services

Service	Cloud Provider	Type
EMR	AWS	Managed Hadoop/Spark
Databricks	Multi-cloud	Lakehouse platform
BigQuery	GCP	Serverless data warehouse
Snowflake	Multi-cloud	Cloud data warehouse
Synapse	Azure	Unified analytics platform
Glue	AWS	Serverless ETL
Dataflow	GCP	Managed Apache Beam

7. Performance Optimization

Optimization	Method	Effect
Partitioning	Partition storage by date/region	Reduces data scanned
Columnar storage	Parquet / ORC	Only read needed columns
Compression	Snappy / Zstd	Reduces I/O
Broadcast join	Broadcast small table to all nodes	Avoids shuffle
Bucketing	Pre-bucket by key	Speeds up joins
Caching	`df.cache()`	Avoids recomputation
Proper partition count	Tune `spark.sql.shuffle.partitions`	Balance parallelism and overhead

References

"Learning Spark" - Damji et al.
"Designing Data-Intensive Applications" - Martin Kleppmann
Apache Spark Official Documentation: https://spark.apache.org
Delta Lake Official Documentation: https://delta.io