The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services, OSDI14 [pdf]: logging with request id, traces by request id, modelling causal relations, happens-before, critical path/slack, aggregate for probablistic model, predict based on history
★ IntroPerf: Transparent Context-Sensitive Multi-Layer Performance Inference using System Stack Traces, SIGMETRICS14 [pdf]: system stack trace, function call graph/tree, call-path latency profiling, call-path based debugging/patching
Towards General-Purpose Resource Management in Shared Cloud Services, HotDep14 [pdf, slides], NSDI15: performance versus resource sharing, tracing and resource profiling, application/src-level resource modeling, control points
Non-intrusive, Out-of-band and Out-of-the-box Systems Monitoring in the Cloud, SIGMETRICS14 [pdf]: VM monitoring
So, you want to trace your distributed system? Key design insights from years of practical experience, CMU-TR 2014, [pdf]: request tracing, (survey)
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google technical report, 2010: tracing
Detecting large-scale system problems by mining console logs, ICML10 [pdf], SOSP09 [pdf]: console log, stitching logs by static analysis
lprof: A Non-intrusive Request Flow Profiler for Distributed Systems, OSDI14: stitching logs by static analysis, non-intrusive tracing, perf. bug verified by patch
Tail latency
★ Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency, SOCC14 [pdf]: latency variance, causes of tails, sharing, scheduler:pin-core, single-queue reduces latency, no remote access in NUMA, interrupt isolation by dedicated core
★ Bobtail: Avoiding Long Tails in the Cloud, NSDI13: vm scheduling, vm responsiveness by sleep-and-wake
★ The Tail at Scale, CACM10 [pdf] (using SU network): tail tolerance, one finish, cancel others, duplicated requests, if works fine in small, then move on to large-scale execution
C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection, NSDI15 [pdf]
CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services, NSDI15
Simba: Tunable End-to-End Data Consistency for Mobile Apps, Eurosys15 [pdf]: mobile/cloud, nosql
Stronger Semantics for Low-Latency Geo-Replicated Storage, NSDI13
The Hybrex model for confidentiality and privacy in cloud computing, HotCloud11, hybrid-cloud
Kernel optimization for IO at bare-metal speed
★ Arrakis: The Operating System is the Control Plane, OSDI14: control plane in kernel, data plane out of kernel, LibOS, capacity, Intel VT, disk/network IO
★ IX: A Protected Dataplane Operating System for High Throughput and Low Latency, OSDI14
Cold-data storage
★ Pelican: A Building Block for Exascale Cold Data Storage, OSDI14 [pdf]: rack-scale computing, energy/cooling, cold data, disk spin up/down, low throughput
Characterizing Storage Workloads with Counter Stacks, OSDI14 [pdf]
Elasticity
★ E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems, VLDB14 [pdf]: hot data fine partitioned, cold data coarsely partitioned
★ Benchmarking Scalability and Elasticity of Distributed Database Systems, VLDB14 [pdf]
MET: Workload aware elasticity for NoSQL, EuroSys13 [pdf]
On Scale Independence for Querying Big Data [PODS14]
Consistency
★ Caelus: Verifying the Consistency of Cloud Services with Battery-Powered Devices, SP15
[pdf]
★ Consistency-based service level agreements for cloud storage, SOSP13
★ Probabilistically Bounded Staleness for Practical Partial Quorums, VLDB12
★ Salt: Combining ACID and BASE in a Distributed Database, OSDI14
Scalable Atomic Visibility with RAMP Transactions, SIGMOD14
CAP theorem: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services