Warning: I wrote this blog in 2011. That is a long time ago, especially on the internet. My opinions may have changed since then. Technological progress may have made this information completely obsolete. Proceed with caution.
Here is a tip for writing better SQL queries—specifically, the FROM/WHERE clause.1 I only work with SQL Server and MySQL, so the syntax may be different in different engines.2 I’m writing this because I often see very smart and experienced developers write some downright ugly SQL. I don’t mean to call anyone out; I just want everyone to know there is a better way of joining.
Rule of thumb: If you have a comma in your FROM clause, you’re doing it wrong.
Let’s take a query I just made up as an example
SELECT AVERAGE(Grades.score) , Departments.Name , Students.Gender FROM Students,Grades,Departments,Teachers,Courses,Classrooms,Buildings WHERE Students.id = Grades.Student_id AND Courses.Period = 3 AND Courses.id = Grades.Course_id AND Buildings.Building_id = 11 AND Classrooms.id = Courses.Classroom_id AND Buildings.id = Classrooms.Building_id AND Grades.Grade > 0 AND Teachers.id = Courses.Teacher_id AND Departments.id = Teachers.Department_id GROUP BY Departments.Name, Students.Gender
So what’s so bad about this? First of all, it’s tricky to see what exactly we are trying to do here. The WHERE clause is huge, and it’s full of stuff that isn’t really relevant to what we are trying to return. If you were told that this query was returning the wrong values, you would have to stare at it for quite a while to figure out what’s going on here. You know what tables the data is coming from, but you don’t have any simple way to tell how those tables relate to each other. You have to reverse-engineer that information from the monster WHERE clause.
And here is where developers are most likely to mess up: if you forget just one of those AND predicates in the WHERE clause, you end up with a cartesian product. This means if one table has n rows, and another has m rows, you will get n×m rows. This is pretty much never what you actually want. If the developer is working on a small database, with very little data3, he might not even realize this is happening. He checks in his code, and goes on to the next problem. But then the system goes into production, and the tables get a few hundred rows each, and now the query runs for twenty minutes and then the database server crashes! Ouch.
So, what’s the solution? Like I said, if you see a comma in your FROM clause, you’re doing it wrong. JOIN to the rescue!
SELECT AVERAGE(Grades.score) , Departments.Name , Students.Gender FROM Grades JOIN Students ON Students.id = Grades.Student_id JOIN Courses ON Courses.id = Grades.Course_id JOIN Teachers ON Teachers.id = Courses.Teacher_id JOIN Classrooms ON Classrooms.id = Courses.Classroom_id JOIN Departments ON Departments.id = Teachers.Department_id JOIN Buildings ON Buildings.id = Classrooms.Building_id WHERE Buildings.Building_id = 11 AND Courses.Period = 3 AND Grades.Grade > 0 GROUP BY Departments.Name, Students.Gender
In this version of the query, we JOIN each of the tables together in the FROM clause. This has several advantages, but the biggest by far is that it forces the developer to specify how the tables are joined together. This makes it extremely difficult to accidentally create a cartesian product. It also cleans up the WHERE clause, stripping it down to conditions you are actually filtering on. In this case, we are just looking at third-period courses in some particular building where students are at least attending the class. Whenever I make a change to an existing query that doesn’t use joins, I usually rewrite with joins. I’ve done this for dozens of queries, and the WHERE clause is reduced to a single statement probably 80-90 percent of the time. This makes the query much easier for the next developer to understand at a glance, especially since the join conditions aren’t likely to change over time. As far as performance goes, there is a slight benefit because you’ve already told the database how the tables are joined together. But the database was already figuring this out on its own, so there’s not much of a difference. But it certainly doesn’t perform any worse.
Update 2011-04-18: I forgot to mention one other benefit of this approach. There is a type of error that comes up every once in a while looks something like this: “This query doesn’t return online courses because Courses.Classroom_id is 0/NULL for online courses.” This is easy to solve with joins. Just change the “JOIN Classrooms” clause to “LEFT JOIN Classrooms”. I’ve seen some ugly code that tries to accomplish the same thing with UNION. Union is worse for performance than left join, and left join is specifically designed for this purpose.