Readings on system and performance

  • ★ indicates the required reading for class presentation.
    • indicates optional/background reading.

Performance

Predictive performance db

  • ★ On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems, VLDB12 [pdf]: predict to optimize, distributed transactions, stored procedure, workload trace, Markov modeling
  • ★ Performance and Resource Modeling in Highly-Concurrent OLTP Workloads, SIGMOD13 [pdf], [src] : log, statistic analysis, TPC-C/OLTP, predict bottleneck, predict resource consumption, predict throughput
  • Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads, VLDB13 [pdf]
  • Automated analysis of multithreaded programs for performance modeling, SIGMETRICS14 [pdf] (using SU network)
  • Uncertainty Aware Query Execution Time Prediction, VLDB14 [pdf]

Performance isolation

  • ★ Performance Isolation and Fairness for Multi-Tenant Cloud Storage, OSDI12 [link]: multi-tenant kv store, system-wide fairness, utilization, dominant resource, local weight (re)allocation
  • ★ CPI2: CPU performance isolation for shared compute clusters, Eurosys13 [pdf]: per-task CPI sampling, perf. counter, anomaly detection, high CPI is contention
  • End-to-end Performance Isolation Through Virtual Datacenters, OSDI14 [link]: network isolation, isolated appliance

Request trace analysis

  • The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services, OSDI14 [pdf]: logging with request id, traces by request id, modelling causal relations, happens-before, critical path/slack, aggregate for probablistic model, predict based on history
  • ★ IntroPerf: Transparent Context-Sensitive Multi-Layer Performance Inference using System Stack Traces, SIGMETRICS14 [pdf]: system stack trace, function call graph/tree, call-path latency profiling, call-path based debugging/patching
  • Towards General-Purpose Resource Management in Shared Cloud Services, HotDep14 [pdf, slides], NSDI15: performance versus resource sharing, tracing and resource profiling, application/src-level resource modeling, control points
  • Non-intrusive, Out-of-band and Out-of-the-box Systems Monitoring in the Cloud, SIGMETRICS14 [pdf]: VM monitoring
    • So, you want to trace your distributed system? Key design insights from years of practical experience, CMU-TR 2014, [pdf]: request tracing, (survey)
    • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google technical report, 2010: tracing
  • Detecting large-scale system problems by mining console logs, ICML10 [pdf], SOSP09 [pdf]: console log, stitching logs by static analysis
    • lprof: A Non-intrusive Request Flow Profiler for Distributed Systems, OSDI14: stitching logs by static analysis, non-intrusive tracing, perf. bug verified by patch

Tail latency

  • ★ Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency, SOCC14 [pdf]: latency variance, causes of tails, sharing, scheduler:pin-core, single-queue reduces latency, no remote access in NUMA, interrupt isolation by dedicated core
  • ★ Bobtail: Avoiding Long Tails in the Cloud, NSDI13: vm scheduling, vm responsiveness by sleep-and-wake
  • ★ The Tail at Scale, CACM10 [pdf] (using SU network): tail tolerance, one finish, cancel others, duplicated requests, if works fine in small, then move on to large-scale execution
  • C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection, NSDI15 [pdf]
  • CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services, NSDI15

Distributed storage system (NoSQL)

Not-Only-SQL

  • ★ Spanner: Google's Globally-Distributed Database [cockroach]
  • ★ F1: A Distributed SQL Database That Scales
    • New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps

Onto multi-cloud

  • ★ A Self-Configurable Geo-Replicated Cloud Storage System, OSDI14 [pdf]: multi-datacenter
  • ★ Customizable and Extensible Deployment for Mobile/Cloud Applications, OSDI14 [pdf]: mobile/cloud
  • ★ SPANStore: cost-effective geo-replicated storage spanning multiple cloud services, SOSP13 [pdf]: multi-cloud
  • Simba: Tunable End-to-End Data Consistency for Mobile Apps, Eurosys15 [pdf]: mobile/cloud, nosql
  • Stronger Semantics for Low-Latency Geo-Replicated Storage, NSDI13
  • The Hybrex model for confidentiality and privacy in cloud computing, HotCloud11, hybrid-cloud

Kernel optimization for IO at bare-metal speed

  • ★ Arrakis: The Operating System is the Control Plane, OSDI14: control plane in kernel, data plane out of kernel, LibOS, capacity, Intel VT, disk/network IO
  • ★ IX: A Protected Dataplane Operating System for High Throughput and Low Latency, OSDI14

Cold-data storage

  • ★ Pelican: A Building Block for Exascale Cold Data Storage, OSDI14 [pdf]: rack-scale computing, energy/cooling, cold data, disk spin up/down, low throughput
  • ★ f4: Facebook's Warm BLOB Storage System, OSDI14 [pdf]: RAID at rack-scale, error-correcting code
  • Characterizing Storage Workloads with Counter Stacks, OSDI14 [pdf]

Elasticity

  • ★ E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems, VLDB14 [pdf]: hot data fine partitioned, cold data coarsely partitioned
  • ★ Benchmarking Scalability and Elasticity of Distributed Database Systems, VLDB14 [pdf]
  • MET: Workload aware elasticity for NoSQL, EuroSys13 [pdf]
    • On Scale Independence for Querying Big Data [PODS14]

Consistency

  • ★ Caelus: Verifying the Consistency of Cloud Services with Battery-Powered Devices, SP15 [pdf]
  • ★ Consistency-based service level agreements for cloud storage, SOSP13
  • ★ Probabilistically Bounded Staleness for Practical Partial Quorums, VLDB12
  • ★ Salt: Combining ACID and BASE in a Distributed Database, OSDI14
  • Scalable Atomic Visibility with RAMP Transactions, SIGMOD14
    • CAP theorem: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services
    • CAP theorem in plain english [link]

Fault tolerance

  • ★ SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems, OSDI14
  • ★ Torturing Databases for Fun and Profit, OSDI14
  • ★ Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems, SOCC13