Roman's data science

12 Tips: From Data Analyst to Startup Co-Founder

Early in my career, working as a data analyst, I, like many people, dreamed of doing something significant and valuable for people. I wanted to be creative. I wanted to feel the results of my work, not just study the data. I worked at several startup companies for ten years before co-founding an e-commerce recommendation services company in 2012. In 2020, during the COVID pandemic, I went on sabbatical for a year — wrote a book, released it on Amazon, and left the successful company (positive cash flow, 150+ employees in Russia, Europe, and South America) entirely in the summer of 2021. Now I am developing my next project from scratch.

I decided to share some tips that came in handy for me when I first started my startup. This post complements chapter 12 (“Challenges and Startups”) of m book [3].

Create a product idea for one client. First, there was the idea, then the product. If you have an idea, the best way to test it is to implement it in your current work for the company's benefit. For example, my project was born many years before my partners, and I started our company. I was developing recommendation systems as a side project when I was still employed. So I knew right away what to do and how to do it, especially initially. I could create a minimum viable product (MVP) with my eyes closed.

Find a way to scale the product cheaply to thousands of customers (understanding engineering). It is essential to be able to scale a product with minimal effort. To do this, you need to learn the practice of putting models into production. In addition to Python and SQL, it is a good idea to know some compiled programming languages (Java, Scala, C#, C++). Understanding how to use different types of databases and Hadoop is helpful. I found it very helpful to visit the Netflix office, where I received some advice about open source software and Hadoop. It took me about six months to learn and implement Hadoop. I also heard about Spark at an O'Reilly STRATA conference video. We were among the first to implement Spark in production.

Do less long-term research, more business. In a startup, you have to act very quickly and still get guaranteed results. Suppose you have a choice between two ML algorithms. In that case, the first is difficult to develop and theoretically gives better results. The second is simple, but its metrics are average — I will always choose the easier one as step number one. Yes, you can put the first algorithm in the backlog of hypotheses, but you’ll probably change it in a few months.
I spent two years on one short-term recommendation algorithm [4]. We did many complicated things. Ultimately the simplest version of the algorithm won all A/B tests. Trying to find the perfect algorithm was a waste of time in my case.

The theory is not always the same as practice. I’ve spent the last ten years developing recommendation algorithms for e-commerce. Scientific articles on the subject use standard metrics (Precision, Recall, Novelty, Diversity [2]) the link). It would seem, take a scientific paper and do it. This approach works well in computer vision, for example, but not in my field. A purchase is a deferred event. The buyer can make it in a few hours or even days. So there is no direct correlation between accuracy metrics [1] and visitor conversion to buyers.
Another point — the recommendation algorithm changes user behavior. When you do offline testing on users’ past actions in logs — you don’t consider this. This factor also contributes to the error in predicting online performance in A/B tests with offline testing.

Getting the metrics right is key to success. It would seem, pick a standard metric for your ML model, and everything will work out. Unfortunately, it doesn’t. Have you heard about the problems of classification metrics in healthcare? For example, you have two different COVID PCR tests:
  • test A yields more false positives
  • test B yield more false negatives
Test A will send more healthy people to quarantine, which will cause economic harm. Test B will miss more sick people, and they will spread the disease. The choice of the test is a complicated question; it will depend on the circumstances. You will face the same choice after A/B testing two different algorithms. For example, I’ve encountered situations in recommendation systems where the algorithm improved the commercial metric, but the recommendations themselves visually looked worse. Such an “improvement” is hard to sell to customers.

A full-time focus is better than doing a startup in the evenings. I used to program in the evenings and weekends myself. I even bought myself a second laptop, which I carried with me to my current job, so that I could sometimes program there as well. In the evenings after work, I was getting tired, so there were a lot of mistakes in the code I was writing. I had to fix them all the next day. I found even some of them several years later. So try to find an opportunity to devote yourself fully to your project. It will save you a lot of effort and time in the future.

The importance of professional experience. With it, it is easier to succeed without attracting significant investments. But for the lack of experience, you will have to pour out money. That is why it is good to work in actively developing companies to learn a lot. My former boss once gave me this advice. And if you have also worked on the side of your potential client, it will be much easier for you to understand their needs. It's common knowledge that a customer won't tell you everything during a product interview. Many issues are difficult to understand, even for a client himself. My experience as an analyst in e-commerce has helped me a lot. I created my first startup based on the knowledge gained there.

A B2B startup is more likely to take off than a B2C startup. B2B will require less investment, fewer employees, and the average transaction check is much higher there. You need to get far fewer customers to break even in B2B than if you did it in B2C.

No cargo cult. It's easy to copy what others are doing, especially FAANG companies. Your company and internal culture are unique. Simply stringing together templates of any standards such as development, ML model deployment, or product will not work. Any of them are in question; you're more likely to create your rules based on common sense and appropriate for you.

Clouds are better for getting started. Now the cloud is like a Lego constructor - you don't have to think about hardware and software issues, and it's relatively easy to scale. I did my first project on rented hardware. I deployed Hadoop there and set up all the computation algorithms. I immediately started creating the second project in the cloud, spending much less time on it.

Clients want to increase their sales rather than get data analytics. An analytical product is much more complex to sell than a system that raises a client's sales. And suppose this effect is also easy to check, for example, in Google Analytics. In that case, your product will sell like hotcakes. That's why I did not create an analytics project but went straight to where the data increases sales directly - for example, to recommender systems.

Self-service analytics. Try to instill a culture of self-service analytics in your company. No one likes to be bombarded with a deluge of tasks that can be done on a "calculator." These "monkey tasks" are elementary tasks that employees can do with two or three mouse clicks. You need three conditions to get rid of them: 
  • A user-friendly interactive analytics system (OLAP, Tableau, Metabase).
  • Minimal level of data quality.
  • Trained users.

Put effort into these three areas. It will pay off well, even in a startup environment where development happens very quickly.


[1] Marco Rossetti, Fabio Stella, Markus Zanker, RecSys 2016: Paper Session 1 - Contrasting Offline and Online Results when Evaluating Recommendation (2016)
[2] Paolo Cremonesi, Yehuda Koren, Roberto Turrin, Performance of recommender algorithms on top-N recommendation tasks (2010)
[3] Roman Zykov, Roman's Data Science: How to monetize your data (2021)
[4] Maxim Borisyak, Roman Zykov, Artem Noskov, Application of Kullback-Leibler divergence for short-term user interest detection (2015)