Videos

Julien Simon's Podcast

The interview I gave as part of Julien Simon's podcast

Video Transcript (Slightly edited for clarity)

Julien: Hi everybody! This is Julien from AWS and welcome to episode 6 of my podcast. Don't forget to subscribe to be notified of future episodes.

In this podcast I'm talking to my friend Cosmin from Denmark. Cosmin is a data scientist, a blogger, and he also runs the Apache MXNet meetup in Copenhagen. public We talk about getting started with ML, running your machine learning projects, write best practices and a whole bunch of different things, so I'm sure you will enjoy that conversation and you will learn a few things. Let's not wait and let's just listen to Cosmin.

Cosmin, thank you very much for taking the time to speak to me today. I guess we need to start with an introduction so tell us a little bit about you what you're doing today and how you got started with machine learning.

Cosmin: I started working with data in general about 10 years ago. I was very interested in doing reporting and fiddling with database management, so then I moved into doing more and more data engineering. At the same time I was lucky enough to work for companies that spearheaded machine learning efforts in Denmark, so I naturally became interested with the data science domain.

Julien: OK, so can you tell us a little bit about your company and the kind of projects you work on a daily basis?

Cosmin: Just to give an overview before I dive into details, AudienceProject helps brands, agencies and publishers plan, optimize and validate digital campaigns. At the same time we're also helping our customers grow audience segments that are of high value. To achieve that, we use data science and machine learning at several levels in our organization, from the actual business projects to the operations. An example is our Audience Hub solution which helps our customers, such as publishers for example, to grow audience segments from their deterministic data. We use extrapolation that is driven by machine learning models to grow these audience segments. Another example is our True Frequency Graph that we use to understand how many times on average, a person has been exposed to an online campaign. For this, we don't use necessarily machine learning, but we use graph algorithms. And one last example that you might relate to, is where we have used machine learning to understand which availability zones from Amazon are best to bid in for EC2 spot instances so that we can have stability over time and also low price. This is an example of how we have used machine learning and data science at different levels in our organization to deliver value.

Julien: OK, that's pretty cool. So tell us about the typical project, how do you get started? You know, people tend to focus on algorithms and the technicalities.

Cosmin: Of course it depends very much on the problem at hand. The way I usually approach a project is that I try to use my previous knowledge or experience, and then I do research online and I essentially trust the community and the crowd wisdom. I try to find the example projects that are similar to what I'm trying to do and I try to fit that into a solution for my problem. The point of all this is also to become familiar with the problem so that I'm confident enough to discuss it. Then I would probably move towards doing some exploratory data analysis, I would understand the data, do cleaning and also approach my colleagues and ask for opinions and validations. I'm fortunate enough to be surrounded by smart people so that's really helpful. There's always good feedback and then I would go towards implementation. I try to productionize or to have a working prototype as soon as possible. That allows me to have a framework for doing multiple iterations towards better results.

Julien: Yeah, I think that's a very important point because again, one of my beliefs is that machine learning is software engineering, and you need tooling, and you need agile techniques, and you need iterations and, you know, sometimes I meet people who tell me: "Well, you know I've just spent six months researching the thing, and then you know, I'll tell you in six months if I can build a model or not". So if you're if you're doing a pure research, that's OK. But, you're working for a private company.

Cosmin: Exactly. I have business constraints. I need to be pragmatic. We need to live in between the constraints of doing something that is very good and doing it in a fixed amount of time and within a certain budget. So we need to be pragmatic. I also wanted to add one more thing that I believe it's important. When we start building a solution in my company, we build and architect towards change. So it's important to assume that things will change, especially for a ML project. The model might change, the data might change, assumptions that you had first might not be realistic in a production scenario, so assume change and engineer towards that.

Julien: Just agility and validate assumptions all the time. Any advice you would give to a young ML engineer to get started? What should they focus on in the early steps of their projects and careers?

Cosmin: One area of focus should be on seeing the pros and cons of different approaches. You might say that neural networks are very powerful, and you can solve some problems with it, but how about explainability? That's a trade-off, and it might work in some cases, and it might not work in other cases. You need to understand your context very well, not act like having a hammer and looking for nails. Like when you know a library very well, and then you try to use it for everything. This goes hand-in-hand with the pros and cons advice. If something doesn't work maybe it's time to look at something else and not try to force your problem into a certain box. Create a project for yourself. That's what I like doing and then experiment with tools. I create artificial datasets. or I derive datasets from existing ones. and then I try to solve a problem. This is what I do, for example, on my blog. I create datasets, and then I artificially create a problem and I try to solve it. So this is one way to gain experience and I think experience is the most important trait to have. Experience gives you intuition and intuition is extremely important in data science. It allows you to choose one model over the other. Ideally, you're able to explore scientifically, all reasonable paths, but in practice it might not be feasible to do that. So at that point you have to use your experience to narrow down where you need to look what are actual possible solutions to your problem, what is the limited set of possible solutions to your problem. The experience of the team is also important. The same project might be handled in a different way in a different team with different skill sets, that arrive at the same product with the same quality. In my company, we have a certain experience, and we work with tools that we find most comfortable to work with. Another team might find a different combination of tools to be more appropriate for the same task.

Julien: So I guess the moral is: "Use what you know and use the best tool for the job at any given point".

Cosmin: I think there's a combination between using the right tool for the job and using what you know. If you only use what you know you might be missing on possibilities. For example, in the past XGBoost was very popular, and it is one of my go-to tools, but if I try to use XGBoost for everything, I will not get appropriate results. There are some certain types of problems for which I would apply XGBoost, and perhaps some other solutions would be a good fit as well. If the problem fits and my experience fits I will use XGBoost, otherwise I would have to possibly look for a different library

Julien: Yeah, so curiosity is important right, trying out the new algos. I agree. So you mentioned you were using SageMaker, can you tell us a little bit about that? What you like about it and what you think the really strong areas are in SageMaker.

Cosmin: So I think the most important thing for us is that it gives us resources that are already provisioned with the libraries that we need and we don't need to maintain that. We're using MXNet, I'm very happy to say that, I'm a big fan of Apache MXNet, and that comes directly provisioned in SageMaker notebooks. That makes our life very easy. We start the notebook and then in a few minutes we're ready to dive into the fun stuff with data "sciencing". Another thing that is great and I don't think I've heard it mentioned many times, but I think it's important is that SageMaker gracefully encourages best practices. It's like a framework for doing data science and machine learning that doesn't force you into a certain way, but it certainly encourages best practices. So you have you can easily provision machines dedicated to training and to validation. You can provision endpoints so you have some separation in that sense. The documentation is also very well targeted towards best practices. The last point would be the efficiency. In the past I would maybe launch a GPU cluster in EMR just to have MXNet running on it and that would work just fine, but there would be a lot of wasted dollars. In Sage maker I can do my data engineering on a cheaper machine that works just fine, and then I can launch my training on a very powerful GPU machine for just a few minutes. Every dollar spent would be spent towards actual training. There are some recent developments for SageMaker like the experiments library, so I was very interested in that. I was so interested in storing trials in a consistent manner that before I knew SageMaker experiments would appear I have made a pull request for Apache MXNet's Gluon to the MLflow project originating from Databricks. Now we have SageMaker experiments, so I'm going to see which one fits us best.

Julien: Well you know you just try out both let us know if we're missing anything. So, Cosmin, we're almost done. Any last words?

Cosmin: Just thank you for inviting me. It's always a pleasure to have a talk to you Julien.

Julien: Well thank you very much, thanks for your time, and thanks for sharing the knowledge I'm sure this will be much appreciated by the listeners.

That's it for this episode I hope you enjoyed it. Don't forget to subscribe to my channel and I'll see you soon with more conversations and more content. Until then, keep rocking!