Roman's data science

Another Data Science book?

Why I decided to write and publish the new Data Science book.

When I started working on the book "Roman's Data Science: How to Monetize Your Data," I wanted to bridge the gap between the interests of business and Data Science. We data scientists are under pressure of high expectations. Many executives expect that implementing ML models will solve almost all of their business problems. We, in turn, are also victims of our expectations. Many are surprised to realize that it only takes 10% of their time to create helpful ML models when entering the profession. Fixing problems and proving that these models benefit the company in production take another 90% of the time.

So in this book, I decided to show how things are. It will be helpful for managers who assign tasks to data scientists. And for the data scientists who perform these tasks. I hope that through this book, we will be able to get more value out of the data.

Book Style

I address the book for a wide range of readers. The use of mathematics is kept to a minimum (it is almost absent, except for two chapters), and the programming code is wholly omitted. All this opens up the possibility for a wide range of readers who have no idea about mathematics and programming to become closely acquainted with data analytics and understand what we do.

When I wrote the book, I wrote it as if I were writing it for myself at the beginning of my career, mentioning mistakes and failed experiments. If there had been a time machine, I would certainly have sent it to myself. Like the anti-hero Biff did in "Back to the Future Part 2", sending Grays Sports Almanac to himself in the past to make money on the sweepstakes. Would I have made fewer mistakes then? I think just as many, only in a different way :), more interesting.

Summary Table of Contents

The book consists of 13 chapters, and I wrote it in a general-to-specific format.
The first chapter focuses on the most crucial goal of Data Science - data-driven decision-making. Decisions can be made either by a human being given data in a report/dashboard or an interactive system or automatically by ML models trained on past choices. I don't have a complete decision-making theory, but I put principles that seem particularly important.

Chapter 2 covers the main areas of data analytics: business intelligence, machine learning, and data engineering. It also raises some questions: who really should analyze data, what's wrong with a metrics sheet on a dashboard, how little we value data engineering, and the conflict of interest between the researcher and the business.

The third chapter will be beneficial for startups - those who are doing analytics from scratch. Here I have tried to outline the basics, from technology selection to management and task management issues, including my experiment with task randomization conducted over several years.

The fourth chapter is about analytical tasks and how to set, validate, and test them; the second part is about the basics of statistics, descriptive analysis, and basic graphs. Additionally, I wrote about my experience with paired data analysis, which I practiced in my startup, and technical debt in large ML projects.

Chapter five is all about data. It is a collection of recipes about data - how to collect it, control quality without spending a lot of effort, and some technical issues.

Chapter six is data warehouses. Why we need them, how data gets there, archiving, some thoughts on Hadoop and Spark, open-source technologies that make storing and processing large amounts of data cheap.

Chapter seven is about data analysis tools. Even though I've been programming a lot since I was a kid, I consider spreadsheets (Excel, Google Sheets) and pivot tables are the best inventions for data analysis. I also discuss notepad services, statistical analysis packages here, and my criteria for a sound reporting system, which removes a lot of headaches from our analytical heads.

The eighth chapter, which contains basic machine learning algorithms and everything related to them, correlates very strongly with Andrew Ng's course on machine learning. I only wrote about the algorithms that I use a lot.

The ninth chapter is machine learning in practice. How to effectively learn ML algorithms, best use Kaggle competitions, Mechanical Turk capabilities for ML; Recency/Frequency are the most potent predictors of user behavior.

The tenth chapter is devoted to testing algorithms in production. I wrote about three approaches to analyze A/B tests: Fisher's statistics (p-value), Bayesian statistics, and bootstrap.

The eleventh chapter, data ethics, discusses the good and bad uses of data. Admittedly, I have encountered temptations to misuse data, so it is crucial to understand good and evil.

The twelfth chapter is about challenges and startups. It lists some interesting problems that I have worked with during my career and advises those going from data scientist to startup founder.

The thirteenth chapter - building a career - is all about creating a career in data science, just personal experience and observations.

Illustrations for the book

An illustration precedes each chapter by an artist at my request. The meaning of each one reflects the content of the chapter. The main character is my cat, who slept near me while I was writing the book.

Additional information

The book itself has many references to sources, over 100. I took care of readers and provided the Kindle and paperback versions with QR codes. Just point your smartphone camera, click, and the link opens. A unique Java Script redirection mechanism allowed to correct the broken links without reprinting the book itself.

The book was released in Kindle and Paperback formats through Kindle Direct Publishing on Amazon on August 16, 2021.