How much should one know in SQL for data science

Do data scientists need to be experts at SQL? Not quite. I’d say that an intermediate level of SQL knowledge is good enough and I’ll explain what this “intermediate” level is in more detail below.

For our typical “small” datasets, ie, data that fits into your computer’s memory, I’ve found dplyr to be much more efficient and practical for data manipulation. If you’re working on Python, pandas does a pretty good job at data manipulation there as well. So for all kinds of filters, joins or aggregations, SQL is usually the 3rd best alternative.

import pandas as better_sql
— Chris Albon (@chrisalbon) October 22, 2020

When we talk about “big data” applications, Apache Spark has really made things pretty easy now. You can run queries on distributed systems in a more expressive manner using sparklyr or pyspark depending on your programming language. You can even code natively in Scala for Spark related applications for better flexibility and performance.

So the only area where SQL really seems to be irreplaceable is data extraction (and ofcourse, data definition), ie, working directly with the databases. And most data science interview questions at Tech firms usually involve atleast 1 SQL based problem in their initial coding round of interview to setup a minimum baseline. Yes, you can always have better tools to write your data manipulation queries but you should still be able to work with good old SQL to pull data from a real world database.

In addition to the introductory SQL clauses (WHERE, GROUP BY, ORDER BY, HAVING, etc.), I’ve seen that if you understand JOINs, UNIONs and analytic functions in SQL, you should be good enough to solve most problems thrown at you during interviews or in your day-to-day job. Out of these topics, JOINs and analytic functions seem to be the most popular interview topics. Unlike aggregate functions like SUM or MAX which return a single value, analytic functions operate OVER sets of rows & can return sets of rows using PARTITION BY, ORDER BY or window frame clauses.

The best resource I’ve found online that covers these topics well is this course on Kaggle - Advanced SQL. If you feel a bit rusty with SQL syntax, you can try their introductory SQL coure as well. They’re fairly short, hands on and most of all, free!

Again, this is based on my personal experience so far. If you feel I’m missing some important topics in the above list or have a different opinion, let me know in the comments below. I’ll be talking about another common topic featuring in data science interviews at big tech firms, data structures and algorithms, soon.

2022 2
2021 7
2020 9

2022

AI, consciousness & Vedanta

2 minute read

Book recommendations from my Covid reading list

2 minute read

5 books to offer a different perspctive on life

Pritesh Shrivastava

How much should one know in SQL for data science

2022

AI, consciousness & Vedanta

Book recommendations from my Covid reading list

2021

Is Data Structures & Algorithms important for a data scientist

Structural and Generative Recursion

You are never going to be an expert in data science

Tips and Tricks that helped me find a new job

A lookback at what I was studying as a beginner data scientist

How matrices transform space

How much should one know in SQL for data science

2020

Matrix algebra operations using recursion

Analyzing gender bias in movie dialogues

Analyzing my Goodreads data

Laziness vs The Trap of Productivity & Ambition

Tree vs Tail Recursion & Memoization

Fun with Haskell

How I ended up with a data science job

Favourite learning resources for ML

My journey with SICP so far