A Refined Venn Diagram of Data Science
The classic Venn diagram of data science—comprising “hacking skills,” “math & stats,” and “domain expertise”—has become a meme. While catchy, it lacks intellectual rigor. It neither reflects the disciplinary foundations of data science nor the epistemic processes by which knowledge is developed. If we’re serious about defining data science as a discipline, we need to do better.
We propose a more precise model grounded in academic traditions and scientific method. Our refined Venn diagram of data science includes three intersecting domains:
- Statistical Modeling — the science of validation and inference,
- Scientific Computing — the engine of optimization and simulation,
- Systems Research — the generator of applied intuition and understanding.
Each brings its own principles, methods, and goals. The heart of data science lies in the interplay among them.
Statistical modeling provides the theoretical machinery to evaluate hypotheses, estimate uncertainty, and draw justified inferences from data. This includes not only classical frequentist tools but also Bayesian frameworks, experimental design, and causal inference. It is the domain of epistemic humility: accepting that claims must be backed by uncertainty-aware models, not confident guesswork.
Scientific computing brings numerical optimization, algorithm design, and simulation-based techniques. From gradient descent to MCMC, it allows models to be trained, tuned, and scaled. It also includes symbolic computation, differentiable programming, and high-performance computing—powering the machinery of modern ML.
Systems research includes software engineering, human-computer interaction, distributed systems, and sociotechnical modeling. It equips data scientists to work with real-world systems and real-world constraints. This is where questions are framed, where telemetry is designed, and where failure modes are diagnosed. It is the most neglected domain, yet it supplies the practical intuition for what data mean in operational contexts.
The intersections of these domains form key subfields:
- Algorithms emerge where statistical modeling meets scientific computing. Here we find regularization, optimization theory, kernel methods, and the mathematically grounded side of ML.
- Experimentation arises at the intersection of systems research and statistical modeling: the design and interpretation of tests and interventions in live environments.
- Data Analysis emerges between systems research and scientific computing: understanding how to extract interpretable signals from messy, large-scale, often poorly-instrumented systems.
Where all three meet—where validation, optimization, and intuition converge—is the space we should call data science proper.
This refined model has practical consequences. It can guide hiring and team design. A mature data science team should not just be a row of ML engineers. It should include researchers trained in experimental design, experts in large-scale systems instrumentation, and individuals who understand both the math and the meaning of the models they deploy.
Pedagogically, this triadic structure supports curriculum development. Courses can be mapped to ensure coverage of statistical methods, computational infrastructure, and systems reasoning. Degree programs can avoid overindexing on narrow optimization skills or purely theoretical training.
Epistemically, this model encourages a more honest engagement with how knowledge is constructed in practice. It acknowledges that neither statistical modeling nor ML algorithms are sufficient on their own. Insight arises not just from mathematical correctness or computational speed, but from embedding those tools within systems that generate meaningful observations and allow structured interventions.
To treat data science as a science is to acknowledge its multi-rooted nature. Not as a fusion of buzzwords, but as a structured intersection of disciplines, each with its own rigor, history, and methods. That intersection—carefully defined—is where data science lives.