Robust Distributed System Nucleus (rDSN) is a framework for quickly building robust distributed systems. It has a microkernel for pluggable components, including applications, distributed frameworks, devops tools, and local runtime/resource providers, enabling their independent development and seamless integration. The project was originally developed for Microsoft Bing, and now has been adopted in production both inside and outside Microsoft.
- What are the existing modules I can immediately use?
- What scenaios are enabled by combining these modules differently?
- How does rDSN build robustness?
- Related papers
- [Case] RocksDB made replicated using rDSN!
- [Tutorial] A one-box cluster demo to understand how rDSN helps service registration, deployment, monitoring etc..
- [Tutorial] Build a counter service with built-in tools (e.g., codegen, auto-test, fault injection, bug replay, tracing)
- [Tutorial] Build a scalable and reliable counter service with built-in replication support
- API Reference
The core of rDSN is a service kernel with which we can develop (via Service API and Tool API) and plugin lots of different application, framework, tool, and local runtime modules, so that they can seamlessly benefit each other. Here is an incomplete list of the pluggable modules.
|dsn.core||rDSN service kernel||todo|
|dsn.dist.service.stateless||scale-out and fail-over for stateless services (e.g., micro services)||todo|
|dsn.dist.service.stateful.type1||scale-out, replicate, and fail-over for stateful services (e.g., storage)||todo|
|dsn.dist.service.meta_server||membership, load balance, and machine pool management for the above service frameworks||todo|
|dsn.dist.uri.resolver||a client-side helper module that resolves service URL to target machine||todo|
|dsn.dist.traffic.router||fine-grain RPC request routing/splitting/forking to multiple services (e.g., A/B test)||todo|
|dsn.tools.common||deployment runtime (e.g., network, aio, lock, timer, perf counters, loggers) for both Windows and Linux; simple toollets, such as tracer, profiler, and fault-injector||todo|
|dsn.tools.nfs||an implementation of remote file copy based on rpc and aio||todo|
|dsn.tools.emulator||an emulation runtime for whole distributed system emulation with auto-test, replay, global state checking, etc.||todo|
|dsn.tools.hpc||high performance counterparts for the modules as implemented in tools.common||todo|
|dsn.tools.explorer||extracts task-level dependencies automatically||todo|
|dsn.tools.log.monitor||collect critical logs (e.g., log-level >= WARNING) in cluster||todo|
|dsn.app.simple_kv||an example application module||todo|
rDSN provides flexible configuration so that developers can combine and configure the modules differently to enable different scenarios. All modules are loaded by dsn.svchost, a common process runner in rDSN, with the given configuration file. The following table lists some examples (note dsn.core is always required therefore omitted in
|logic correctness development||dsn.app.simple_kv + dsn.tools.emulator + dsn.tools.common||config||todo|
|logic correctness with failure||dsn.app.simple_kv + dsn.tools.emulator + dsn.tools.common||config||todo|
|performance tuning||dsn.app.simple_kv + dsn.tools.common||config||todo|
|progressive performance tuning||dsn.app.simple_kv + dsn.tools.common + dsn.tools.emulator||config||todo|
|Paxos enabled stateful service||dsn.app.simple_kv + dsn.tools.common + dsn.tools.emulator + dsn.dist.uri.resolver + dsn.dist.serivce.meta_server + dsn.dist.service.stateful.type1||config||todo|
There are a lot more possibilities. rDSN provides a web portal to enable quick deployment of these scenarios in a cluster, and allow easy operations through simple clicks as well as rich visualization. Deployment scenarios are defined here, and developers can add more on demand.
reduced system complexity via microkernel architecture: applications, frameworks (e.g., replication, scale-out, fail-over), local runtime libraries (e.g., network libraries, locks), and tools are all pluggable modules into a microkernel to enable independent development and seamless integration (therefore modules are reusable and transparently benefit each other)
flexible configuration with global deploy-time view: tailor the module instances and their connections on demand with configurable system complexity and resource allocation (e.g., run all nodes in one simulator for testing, allocate CPU resources appropriately for avoiding resource contention, debug with progressively added system complexity)
transparent tooling support: dedicated tool API for tool development; built-in plugged tools for understanding, testing, debugging, and monitoring the upper applications and frameworks
auto-handled distributed system challenges: built-in frameworks to achieve scalability, reliability, availability, and consistency etc. for the applications
rDSN borrows the idea in many research work, from both our own and the others, and tries to make them real in production in a coherent way; we greatly appreciate the researchers who did these work.
- Failure Recovery: When the Cure Is Worse Than the Disease, HotOS'13
- SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, SOSP'01
- PacificA: replication in log-based distributed storage systems, MSR Tech Report
- Rex: Replication at the Speed of Multi-core, Eurosys'14
- Arming Cloud Services with Task Aspects, MSR Tech Report
- D3S: Debugging Deployed Distributed Systems, NSDI'08
- MoDIST: Transparent Model Checking of Unmodified Distributed Systems, NSDI'09
- G2: A Graph Processing System for Diagnosing Distributed Systems, USENIX ATC'11
- R2: An Application-Level Kernel for Record and Replay, OSDI'08
- WiDS: an Integrated Toolkit for Distributed System Development, HotOS'05
License and Support
rDSN is provided on Windows and Linux, with the MIT open source license. You can use the "issues" tab in GitHub to report bugs.