Readings on system and performance

Performance

★ On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems, VLDB12 [pdf]: predict to optimize, distributed transactions, stored procedure, workload trace, Markov modeling
★ Performance and Resource Modeling in Highly-Concurrent OLTP Workloads, SIGMOD13 [pdf], [src] : log, statistic analysis, TPC-C/OLTP, predict bottleneck, predict resource consumption, predict throughput
Towards Predicting Query Execution Time for Concurrent and Dynamic Database Workloads, VLDB13 [pdf]
Automated analysis of multithreaded programs for performance modeling, SIGMETRICS14 [pdf] (using SU network)
Uncertainty Aware Query Execution Time Prediction, VLDB14 [pdf]

★ Performance Isolation and Fairness for Multi-Tenant Cloud Storage, OSDI12 [link]: multi-tenant kv store, system-wide fairness, utilization, dominant resource, local weight (re)allocation
★ CPI2: CPU performance isolation for shared compute clusters, Eurosys13 [pdf]: per-task CPI sampling, perf. counter, anomaly detection, high CPI is contention
End-to-end Performance Isolation Through Virtual Datacenters, OSDI14 [link]: network isolation, isolated appliance

The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services, OSDI14 [pdf]: logging with request id, traces by request id, modelling causal relations, happens-before, critical path/slack, aggregate for probablistic model, predict based on history
★ IntroPerf: Transparent Context-Sensitive Multi-Layer Performance Inference using System Stack Traces, SIGMETRICS14 [pdf]: system stack trace, function call graph/tree, call-path latency profiling, call-path based debugging/patching
Towards General-Purpose Resource Management in Shared Cloud Services, HotDep14 [pdf, slides], NSDI15: performance versus resource sharing, tracing and resource profiling, application/src-level resource modeling, control points
Non-intrusive, Out-of-band and Out-of-the-box Systems Monitoring in the Cloud, SIGMETRICS14 [pdf]: VM monitoring
- So, you want to trace your distributed system? Key design insights from years of practical experience, CMU-TR 2014, [pdf]: request tracing, (survey)
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Google technical report, 2010: tracing
Detecting large-scale system problems by mining console logs, ICML10 [pdf], SOSP09 [pdf]: console log, stitching logs by static analysis
- lprof: A Non-intrusive Request Flow Profiler for Distributed Systems, OSDI14: stitching logs by static analysis, non-intrusive tracing, perf. bug verified by patch

★ Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency, SOCC14 [pdf]: latency variance, causes of tails, sharing, scheduler:pin-core, single-queue reduces latency, no remote access in NUMA, interrupt isolation by dedicated core
★ Bobtail: Avoiding Long Tails in the Cloud, NSDI13: vm scheduling, vm responsiveness by sleep-and-wake
★ The Tail at Scale, CACM10 [pdf] (using SU network): tail tolerance, one finish, cancel others, duplicated requests, if works fine in small, then move on to large-scale execution
C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection, NSDI15 [pdf]
CosTLO: Cost-Effective Redundancy for Lower Latency Variance on Cloud Storage Services, NSDI15

★ A Self-Configurable Geo-Replicated Cloud Storage System, OSDI14 [pdf]: multi-datacenter
★ Customizable and Extensible Deployment for Mobile/Cloud Applications, OSDI14 [pdf]: mobile/cloud
★ SPANStore: cost-effective geo-replicated storage spanning multiple cloud services, SOSP13 [pdf]: multi-cloud
Simba: Tunable End-to-End Data Consistency for Mobile Apps, Eurosys15 [pdf]: mobile/cloud, nosql
Stronger Semantics for Low-Latency Geo-Replicated Storage, NSDI13
The Hybrex model for confidentiality and privacy in cloud computing, HotCloud11, hybrid-cloud

★ Arrakis: The Operating System is the Control Plane, OSDI14: control plane in kernel, data plane out of kernel, LibOS, capacity, Intel VT, disk/network IO
★ IX: A Protected Dataplane Operating System for High Throughput and Low Latency, OSDI14

★ Pelican: A Building Block for Exascale Cold Data Storage, OSDI14 [pdf]: rack-scale computing, energy/cooling, cold data, disk spin up/down, low throughput
★ f4: Facebook's Warm BLOB Storage System, OSDI14 [pdf]: RAID at rack-scale, error-correcting code
Characterizing Storage Workloads with Counter Stacks, OSDI14 [pdf]

★ E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems, VLDB14 [pdf]: hot data fine partitioned, cold data coarsely partitioned
★ Benchmarking Scalability and Elasticity of Distributed Database Systems, VLDB14 [pdf]
MET: Workload aware elasticity for NoSQL, EuroSys13 [pdf]
- On Scale Independence for Querying Big Data [PODS14]

★ Caelus: Verifying the Consistency of Cloud Services with Battery-Powered Devices, SP15 [pdf]
★ Consistency-based service level agreements for cloud storage, SOSP13
★ Probabilistically Bounded Staleness for Practical Partial Quorums, VLDB12
★ Salt: Combining ACID and BASE in a Distributed Database, OSDI14
Scalable Atomic Visibility with RAMP Transactions, SIGMOD14
- CAP theorem: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services
- CAP theorem in plain english [link]

★ SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems, OSDI14
★ Torturing Databases for Fun and Profit, OSDI14
★ Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems, SOCC13