Big Data Technologies
Introduction
When data scale exceeds single-machine processing capacity, distributed computing frameworks are needed. This article covers the Hadoop ecosystem, Apache Spark, stream processing, data lake/data warehouse architectures, and cloud-based big data services.
1. Big Data Architecture Overview
graph TB
subgraph Data Sources
A1[Databases]
A2[Logs]
A3[IoT]
A4[APIs]
end
subgraph Data Ingestion
B1[Kafka]
B2[Flume]
B3[Sqoop]
end
subgraph Storage
C1[HDFS]
C2[S3 / Object Storage]
C3[Data Lake Delta/Iceberg]
end
subgraph Compute
D1[Spark Batch Processing]
D2[Flink Stream Processing]
D3[Spark SQL]
end
subgraph Serving
E1[Data Warehouse Hive/BQ]
E2[OLAP ClickHouse]
E3[ML Pipeline]
E4[BI Dashboards]
end
A1 & A2 & A3 & A4 --> B1 & B2 & B3
B1 & B2 & B3 --> C1 & C2 & C3
C1 & C2 & C3 --> D1 & D2 & D3
D1 & D2 & D3 --> E1 & E2 & E3 & E4
2. Hadoop Ecosystem
2.1 HDFS (Hadoop Distributed File System)
A distributed file system optimized for large files:
NameNode (master node)
- Manages file system metadata (file → block mapping)
- Does not store actual data
DataNode (data nodes)
- Stores actual data blocks
- Default block size 128MB
- Default 3 replicas
File "logs.csv" (512MB):
Block 1 (128MB) → DataNode 1, 3, 5
Block 2 (128MB) → DataNode 2, 4, 6
Block 3 (128MB) → DataNode 1, 4, 5
Block 4 (128MB) → DataNode 2, 3, 6
2.2 MapReduce
The classic distributed computing model:
Input Data
│
▼ Split
[Split 1] [Split 2] [Split 3]
│ │ │
▼ Map ▼ Map ▼ Map
(k1,v1) (k2,v2) (k1,v3)
│ │ │
▼ Shuffle & Sort
[k1: v1,v3] [k2: v2]
│ │
▼ Reduce ▼ Reduce
(k1, result1) (k2, result2)
WordCount Example:
# Map: document → (word, 1)
def mapper(document):
for word in document.split():
emit(word, 1)
# Reduce: (word, [1,1,1,...]) → (word, count)
def reducer(word, counts):
emit(word, sum(counts))
2.3 YARN (Yet Another Resource Negotiator)
Hadoop's resource manager:
| Component | Responsibility |
|---|---|
| ResourceManager | Cluster resource allocation |
| NodeManager | Single-node resource management |
| ApplicationMaster | Resource coordination for a single application |
| Container | Resource allocation unit (CPU + memory) |
3. Apache Spark
Spark is MapReduce's successor, achieving 10-100x speedup through in-memory computation.
3.1 Core Concepts
| Concept | Description |
|---|---|
| RDD | Resilient Distributed Dataset, foundational abstraction |
| DataFrame | Distributed table with schema (recommended) |
| Dataset | Type-safe DataFrame (Scala/Java) |
| Transformation | Lazy operations (map, filter, join) |
| Action | Triggers computation (collect, count, save) |
3.2 DataFrame Operations
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("DataAnalysis") \
.config("spark.sql.shuffle.partitions", 200) \
.getOrCreate()
# Read data
df = spark.read.parquet("s3://bucket/data/")
# Transformations
result = (df
.filter(F.col("age") >= 18)
.withColumn("income_level",
F.when(F.col("income") > 100000, "high")
.when(F.col("income") > 50000, "medium")
.otherwise("low"))
.groupBy("city", "income_level")
.agg(
F.count("*").alias("count"),
F.avg("income").alias("avg_income"),
F.percentile_approx("income", 0.5).alias("median_income")
)
.orderBy(F.desc("count"))
)
result.show()
result.write.parquet("s3://bucket/output/")
3.3 Spark SQL
df.createOrReplaceTempView("users")
result = spark.sql("""
SELECT
city,
COUNT(*) as user_count,
AVG(income) as avg_income,
PERCENTILE(income, 0.5) as median_income
FROM users
WHERE age >= 18
GROUP BY city
HAVING COUNT(*) > 100
ORDER BY avg_income DESC
""")
3.4 Spark MLlib
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# Feature processing pipeline
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features_raw")
scaler = StandardScaler(inputCol="features_raw", outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features", numTrees=100)
pipeline = Pipeline(stages=[assembler, scaler, rf])
# Training
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)
model = pipeline.fit(train_df)
# Evaluation
predictions = model.transform(test_df)
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc:.4f}")
4. Stream Processing
4.1 Kafka Streams
A lightweight stream processing library (no separate cluster required):
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> events = builder.stream("user-events");
KTable<String, Long> eventCounts = events
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.count();
eventCounts.toStream()
.to("event-counts", Produced.with(windowedSerde, Serdes.Long()));
4.2 Apache Flink
A powerful distributed stream processing engine:
from pyflink.table import EnvironmentSettings, TableEnvironment
env = EnvironmentSettings.in_streaming_mode()
t_env = TableEnvironment.create(env)
# Define Kafka source
t_env.execute_sql("""
CREATE TABLE events (
user_id STRING,
event_type STRING,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'events',
'properties.bootstrap.servers' = 'localhost:9092',
'format' = 'json'
)
""")
# Windowed aggregation
result = t_env.sql_query("""
SELECT
user_id,
TUMBLE_START(event_time, INTERVAL '1' HOUR) as window_start,
COUNT(*) as event_count
FROM events
GROUP BY user_id, TUMBLE(event_time, INTERVAL '1' HOUR)
""")
4.3 Batch vs Stream Processing
| Dimension | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Seconds to milliseconds |
| Data | Bounded datasets | Unbounded data streams |
| Frameworks | MapReduce, Spark | Flink, Kafka Streams |
| Use cases | Reports, ETL, model training | Real-time monitoring, alerts, recommendations |
5. Data Lake vs Data Warehouse
| Dimension | Data Lake | Data Warehouse |
|---|---|---|
| Data format | Raw (JSON, CSV, Parquet) | Structured (Schema-on-Write) |
| Processing | Schema-on-Read | Schema-on-Write |
| Cost | Low (object storage) | High (dedicated engine) |
| Users | Data scientists, engineers | Analysts, business users |
| Representatives | S3 + Delta Lake | Snowflake, BigQuery |
5.1 Lakehouse Architecture
Combining the flexibility of data lakes with the management capabilities of data warehouses:
Object Storage (S3/ADLS)
+
Table Format (Delta Lake / Apache Iceberg)
- ACID transactions
- Schema evolution
- Time travel
- Efficient update/delete
+
Query Engine (Spark / Trino / Presto)
+
BI / ML Tools
# Delta Lake example
df.write.format("delta").save("s3://bucket/delta_table")
# Time travel
spark.read.format("delta") \
.option("timestampAsOf", "2024-01-01") \
.load("s3://bucket/delta_table")
# UPSERT (MERGE)
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "s3://bucket/delta_table")
delta_table.alias("target") \
.merge(new_data.alias("source"), "target.id = source.id") \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
6. Cloud Big Data Services
| Service | Cloud Provider | Type |
|---|---|---|
| EMR | AWS | Managed Hadoop/Spark |
| Databricks | Multi-cloud | Lakehouse platform |
| BigQuery | GCP | Serverless data warehouse |
| Snowflake | Multi-cloud | Cloud data warehouse |
| Synapse | Azure | Unified analytics platform |
| Glue | AWS | Serverless ETL |
| Dataflow | GCP | Managed Apache Beam |
7. Performance Optimization
| Optimization | Method | Effect |
|---|---|---|
| Partitioning | Partition storage by date/region | Reduces data scanned |
| Columnar storage | Parquet / ORC | Only read needed columns |
| Compression | Snappy / Zstd | Reduces I/O |
| Broadcast join | Broadcast small table to all nodes | Avoids shuffle |
| Bucketing | Pre-bucket by key | Speeds up joins |
| Caching | df.cache() |
Avoids recomputation |
| Proper partition count | Tune spark.sql.shuffle.partitions |
Balance parallelism and overhead |
References
- "Learning Spark" - Damji et al.
- "Designing Data-Intensive Applications" - Martin Kleppmann
- Apache Spark Official Documentation: https://spark.apache.org
- Delta Lake Official Documentation: https://delta.io