And why you don’t need another ‘How to become a data scientist in 2021’ listicle
How much should one know in SQL for data science
Do data scientists need to be experts at SQL? Not quite. I’d say that an intermediate level of SQL knowledge is good enough and I’ll explain what this “intermediate” level is in more detail below.
For our typical “small” datasets, ie, data that fits into your computer’s memory, I’ve found
dplyr to be much more efficient and practical for data manipulation. If you’re working on Python,
pandas does a pretty good job at data manipulation there as well. So for all kinds of filters, joins or aggregations, SQL is usually the 3rd best alternative.
import pandas as better_sql— Chris Albon (@chrisalbon) October 22, 2020
When we talk about “big data” applications,
Apache Spark has really made things pretty easy now. You can run queries on distributed systems in a more expressive manner using
pyspark depending on your programming language. You can even code natively in
Scala for Spark related applications for better flexibility and performance.
So the only area where SQL really seems to be irreplaceable is data extraction (and ofcourse, data definition), ie, working directly with the databases. And most data science interview questions at Tech firms usually involve atleast 1 SQL based problem in their initial coding round of interview to setup a minimum baseline. Yes, you can always have better tools to write your data manipulation queries but you should still be able to work with good old SQL to pull data from a real world database.
In addition to the introductory SQL clauses (WHERE, GROUP BY, ORDER BY, HAVING, etc.), I’ve seen that if you understand JOINs, UNIONs and analytic functions in SQL, you should be good enough to solve most problems thrown at you during interviews or in your day-to-day job. Out of these topics, JOINs and analytic functions seem to be the most popular interview topics. Unlike aggregate functions like SUM or MAX which return a single value, analytic functions operate OVER sets of rows & can return sets of rows using PARTITION BY, ORDER BY or window frame clauses.
The best resource I’ve found online that covers these topics well is this course on Kaggle - Advanced SQL. If you feel a bit rusty with SQL syntax, you can try their introductory SQL coure as well. They’re fairly short, hands on and most of all, free!
Again, this is based on my personal experience so far. If you feel I’m missing some important topics in the above list or have a different opinion, let me know in the comments below. I’ll be talking about another common topic featuring in data science interviews at big tech firms, data structures and algorithms, soon.
An intuitive way to look at matrix vector multiplication, with applications in image processing
Most tech firm interviews include SQL problems for DS roles
Implementing basic matrix algebra operations in Scheme using a Jupyter notebook
Building a gender classifier model based on the dialogues of characters in Hollywood movies
Simple EDA of my reading activity using tidyverse on R Markdown
My experience using productivity tools for personal projects
Comparing Tree Recursion & Tail Recursion in Scheme & Python
My notes halfway through the book Learn You A Haskell
My topsy turvy ride to data science
Books, MOOCs and other resources that I would highly recommend
The magic of SICP