Skip to content

Big Data Technologies

Introduction

When data scale exceeds single-machine processing capacity, distributed computing frameworks are needed. This article covers the Hadoop ecosystem, Apache Spark, stream processing, data lake/data warehouse architectures, and cloud-based big data services.


1. Big Data Architecture Overview

graph TB
    subgraph Data Sources
        A1[Databases]
        A2[Logs]
        A3[IoT]
        A4[APIs]
    end

    subgraph Data Ingestion
        B1[Kafka]
        B2[Flume]
        B3[Sqoop]
    end

    subgraph Storage
        C1[HDFS]
        C2[S3 / Object Storage]
        C3[Data Lake Delta/Iceberg]
    end

    subgraph Compute
        D1[Spark Batch Processing]
        D2[Flink Stream Processing]
        D3[Spark SQL]
    end

    subgraph Serving
        E1[Data Warehouse Hive/BQ]
        E2[OLAP ClickHouse]
        E3[ML Pipeline]
        E4[BI Dashboards]
    end

    A1 & A2 & A3 & A4 --> B1 & B2 & B3
    B1 & B2 & B3 --> C1 & C2 & C3
    C1 & C2 & C3 --> D1 & D2 & D3
    D1 & D2 & D3 --> E1 & E2 & E3 & E4

2. Hadoop Ecosystem

2.1 HDFS (Hadoop Distributed File System)

A distributed file system optimized for large files:

NameNode (master node)
  - Manages file system metadata (file → block mapping)
  - Does not store actual data

DataNode (data nodes)
  - Stores actual data blocks
  - Default block size 128MB
  - Default 3 replicas

File "logs.csv" (512MB):
  Block 1 (128MB) → DataNode 1, 3, 5
  Block 2 (128MB) → DataNode 2, 4, 6
  Block 3 (128MB) → DataNode 1, 4, 5
  Block 4 (128MB) → DataNode 2, 3, 6

2.2 MapReduce

The classic distributed computing model:

Input Data
  │
  ▼ Split
[Split 1] [Split 2] [Split 3]
  │          │          │
  ▼ Map      ▼ Map      ▼ Map
(k1,v1)   (k2,v2)   (k1,v3)
  │          │          │
  ▼ Shuffle & Sort
[k1: v1,v3]  [k2: v2]
  │              │
  ▼ Reduce      ▼ Reduce
(k1, result1) (k2, result2)

WordCount Example:

# Map: document → (word, 1)
def mapper(document):
    for word in document.split():
        emit(word, 1)

# Reduce: (word, [1,1,1,...]) → (word, count)
def reducer(word, counts):
    emit(word, sum(counts))

2.3 YARN (Yet Another Resource Negotiator)

Hadoop's resource manager:

Component Responsibility
ResourceManager Cluster resource allocation
NodeManager Single-node resource management
ApplicationMaster Resource coordination for a single application
Container Resource allocation unit (CPU + memory)

3. Apache Spark

Spark is MapReduce's successor, achieving 10-100x speedup through in-memory computation.

3.1 Core Concepts

Concept Description
RDD Resilient Distributed Dataset, foundational abstraction
DataFrame Distributed table with schema (recommended)
Dataset Type-safe DataFrame (Scala/Java)
Transformation Lazy operations (map, filter, join)
Action Triggers computation (collect, count, save)

3.2 DataFrame Operations

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("DataAnalysis") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

# Read data
df = spark.read.parquet("s3://bucket/data/")

# Transformations
result = (df
    .filter(F.col("age") >= 18)
    .withColumn("income_level", 
        F.when(F.col("income") > 100000, "high")
         .when(F.col("income") > 50000, "medium")
         .otherwise("low"))
    .groupBy("city", "income_level")
    .agg(
        F.count("*").alias("count"),
        F.avg("income").alias("avg_income"),
        F.percentile_approx("income", 0.5).alias("median_income")
    )
    .orderBy(F.desc("count"))
)

result.show()
result.write.parquet("s3://bucket/output/")

3.3 Spark SQL

df.createOrReplaceTempView("users")

result = spark.sql("""
    SELECT 
        city,
        COUNT(*) as user_count,
        AVG(income) as avg_income,
        PERCENTILE(income, 0.5) as median_income
    FROM users
    WHERE age >= 18
    GROUP BY city
    HAVING COUNT(*) > 100
    ORDER BY avg_income DESC
""")

3.4 Spark MLlib

from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Feature processing pipeline
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features_raw")
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=100)

pipeline = Pipeline(stages=[assembler, scaler, rf])

# Training
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_df)

# Evaluation
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc:.4f}")

4. Stream Processing

4.1 Kafka Streams

A lightweight stream processing library (no separate cluster required):

StreamsBuilder builder = new StreamsBuilder();

KStream<String, String> events = builder.stream("user-events");

KTable<String, Long> eventCounts = events
    .groupByKey()
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .count();

eventCounts.toStream()
    .to("event-counts", Produced.with(windowedSerde, Serdes.Long()));

A powerful distributed stream processing engine:

from pyflink.table import EnvironmentSettings, TableEnvironment

env = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env)

# Define Kafka source
t_env.execute_sql("""
    CREATE TABLE events (
        user_id STRING,
        event_type STRING,
        event_time TIMESTAMP(3),
        WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'events',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Windowed aggregation
result = t_env.sql_query("""
    SELECT 
        user_id,
        TUMBLE_START(event_time, INTERVAL '1' HOUR) as window_start,
        COUNT(*) as event_count
    FROM events
    GROUP BY user_id, TUMBLE(event_time, INTERVAL '1' HOUR)
""")

4.3 Batch vs Stream Processing

Dimension Batch Processing Stream Processing
Latency Minutes to hours Seconds to milliseconds
Data Bounded datasets Unbounded data streams
Frameworks MapReduce, Spark Flink, Kafka Streams
Use cases Reports, ETL, model training Real-time monitoring, alerts, recommendations

5. Data Lake vs Data Warehouse

Dimension Data Lake Data Warehouse
Data format Raw (JSON, CSV, Parquet) Structured (Schema-on-Write)
Processing Schema-on-Read Schema-on-Write
Cost Low (object storage) High (dedicated engine)
Users Data scientists, engineers Analysts, business users
Representatives S3 + Delta Lake Snowflake, BigQuery

5.1 Lakehouse Architecture

Combining the flexibility of data lakes with the management capabilities of data warehouses:

Object Storage (S3/ADLS)
    +
Table Format (Delta Lake / Apache Iceberg)
    - ACID transactions
    - Schema evolution
    - Time travel
    - Efficient update/delete
    +
Query Engine (Spark / Trino / Presto)
    +
BI / ML Tools
# Delta Lake example
df.write.format("delta").save("s3://bucket/delta_table")

# Time travel
spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("s3://bucket/delta_table")

# UPSERT (MERGE)
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "s3://bucket/delta_table")
delta_table.alias("target") \
    .merge(new_data.alias("source"), "target.id = source.id") \
    .whenMatchedUpdateAll() \
    .whenNotMatchedInsertAll() \
    .execute()

6. Cloud Big Data Services

Service Cloud Provider Type
EMR AWS Managed Hadoop/Spark
Databricks Multi-cloud Lakehouse platform
BigQuery GCP Serverless data warehouse
Snowflake Multi-cloud Cloud data warehouse
Synapse Azure Unified analytics platform
Glue AWS Serverless ETL
Dataflow GCP Managed Apache Beam

7. Performance Optimization

Optimization Method Effect
Partitioning Partition storage by date/region Reduces data scanned
Columnar storage Parquet / ORC Only read needed columns
Compression Snappy / Zstd Reduces I/O
Broadcast join Broadcast small table to all nodes Avoids shuffle
Bucketing Pre-bucket by key Speeds up joins
Caching df.cache() Avoids recomputation
Proper partition count Tune spark.sql.shuffle.partitions Balance parallelism and overhead

References

  • "Learning Spark" - Damji et al.
  • "Designing Data-Intensive Applications" - Martin Kleppmann
  • Apache Spark Official Documentation: https://spark.apache.org
  • Delta Lake Official Documentation: https://delta.io

评论 #