Ok, so you want to be a Data Scientist. The best prep would be to have interned as a Data Scientist. I hadn’t done that. Here are some things I have learned about interviewing for this job.
As I mentioned last post, you will probably have some of the following:
this is the good old quick-programming-puzzle interview, as in software engineer interviews. Usually whiteboard, but sometimes they let you use a computer, which is nice. You might get this for a phone screen.
“Here’s a database structure (on a whiteboard), how would you write these queries?” A couple places let me do this on a computer, that was nice. You might get this for a phone screen too.
“We want to change our UI from this old UI to this new UI, how would we do it?” and then talk about metrics to measure, how to evaluate success, how to sample users, how many users and how long to run the study (use a power calculation!), what your conclusions would be if you got this certain kinds of answers.
“We want to expand to selling cars too, how would we do it?” - and then we talk about high-level what kind of metrics we’d measure, how we’d evaluate success, how we’d trade off risk, etc. This was rarer, but happened a couple times. I think I wasn’t necessarily supposed to know how to do this, so I could wing it a bit. This interview was different from the experimental design one because it was a little higher-level - your 6-12-month vision instead of the single Next Experiment you’re running.
“Here’s a big CSV of our hypothetical users' behavior; what leads to them buying our product?” This might be a homework problem or an in-person interview.
I know I just said “modeling.” A couple “modeling” interviews, though, were something different - more like “here’s how this part of our business works - how would you design the database for it?” Which tables would you have, which fields on each, etc.
Collaborating with other people/teams, stories of projects you’ve done
This is vaguer, talky. I could usually come up with these on the spot, but it doesn’t hurt to have a few in your pocket.
These are usually “off the record” - use them to refuel, and try to absorb stuff about the company or your future coworkers here.
Things that are good to know
How to do quick programming puzzles fluently. HackerRank’s Python, Algorithms, and Data Structures tracks are probably pretty good. You don’t have to get to the “Hard” level - if you can do the “Medium"s, you’re probably good.
SQL. If you haven’t used SQL, or haven’t used any actually difficult SQL, in a while, take an online tutorial all the way through. PostgreSQL Exercises is the best I think; Mode Analytics’s one is good too. Particularly learn:
how to do a GROUP BY and an aggregate (like “tell me the total sales in each state”)
when to use WHERE vs HAVING (HAVING is after the groupby/aggregate)
how to do JOINs, including the difference between types of joins
how to do subqueries, and when you would
Some of this is just a feel thing, which is why I say work through a whole SQL course. I’m getting more fluent in SQL, even if sometimes I can’t quite articulate, for example, when you would use a subquery.
how to work with dates is a nice bonus
window functions would be good. Here’s one example. This is kinda in the “bonus points” - when a question came up where it’d be appropriate, I always would say “uhh I guess I’d use window functions but I don’t know how to,” and I still got jobs.
oh, one more tip: when I’m trying to do complicated things with joins or subqueries, I’d often draw out what the end table is that I’m SELECTing from. So like, if I’m joining A to B, just write down what the “A JOIN B” table looks like, even though of course it’s not actually done like that.
The formula to calculate a binomial confidence interval. p +/- z*sqrt(p(1-p)/n). I don’t know many stats formulas, but I had remembered this one, and it came in handy so many times. (Useful in A/B tests - if you test it on 1000 people, and 7% of them click, what’s your 95% CI for the real click-through rate? 0.07 +/- 1.96 * sqrt(0.07*0.93/1000) = 0.07 +/- 0.015
How you pick which model to use - tradeoffs of logistic regression, decision trees/random forests, SVMs, neural networks, etc. Which ones are good/bad if your classes are unbalanced, or your data’s very sparse, or whatever. And how to pick stuff around this - like how do you pick training/test set, how do you normalize your data, etc.
How Ridge and Lasso regression work, and more generally what regularization is. I missed this a lot :-P
How to quickly load in a data set and make a simple classification/regression model and/or charts, in Python/Pandas or R. Then use whichever of those you feel more comfortable with in the interview.
A story of a project where you used machine learning.
A story of how you communicated some finding to some other people who weren’t as data-nerdy as you.
A story of a project where you had to change your plans, maybe. Other kinds of soft-skills stories are nice.
It might be that you don’t think you have the practical skills yet. If that is the case, you might do a boot camp - a couple-month program. Insight Data Science is probably the best boot camp, because it’s aimed at exactly you. A lot of my soon-to-be-coworkers did this, coming from a diverse set of PhD backgrounds.
- If you can, interview first with companies you’re less excited about. I learned a lot about this process through doing it - my first few interviews ended at the phone screen or homework stage, and as I did more of them, I ended up getting farther through the process.
- If you can be local, that probably helps. If you know you want to move to SF, say, then plan a couple week trip out here and tell them you’ll be in town on these certain days. That way they don’t have to worry about flying you out. Most good companies probably don’t care, but I dunno, maybe they do.
- I had one company ask me for one or two references. Like, your advisor would be fine, or someone you interned with. This was after they gave me a verbal offer, so it probably wouldn’t make or break it, unless you’re secretly a serial killer (or, realistically, completely unsuited for the job).
- No suits. This is nice.
- I’m sure there are more things I’m forgetting. Ask me some questions.
blog 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010