I’ve had a devil of a time trying to get Google to tell me how to write a Mysql query that allows us to somehow perform a MySql query that 1) filters rows on a distinct column 2) returns other columns in the query besides the distinct column and 3) allows us to order by a column. In our case, we (and you, if you’re running Savage Beast!) have a list of most recent forum posts on the site — currently, if you list all recent posts, the search will just find all posts and order by date of creation, but this makes for some dumb-looking output since you often end up with a list where 10 of the 20 posts are all from the same forum topic. All the user really wants to know is what topics have a new post in them, and to get a brief glimpse as to what that new post might be.
Thus, we want to create a query that returns the new posts, ordered by date of creation, that have a distinct topic_id.
Here’s the SQL that can make it happen:
Post.find_by_sql(”select posts.* from posts LEFT JOIN posts t2 on posts.topic_id=t2.topic_id AND posts.created_at < t2.created_at WHERE t2.topic_id IS NULL ORDER BY posts.created_at DESC”)
Hope that Google sees fit to lead other people here instead of struggling to get GROUP BY to order results beforehand (GROUP BY ‘posts.topic_id’ works, but it returns the first post in each distinct topic, rather than the last post as we desire), or get SELECT DISTINCT to return more than one column, as many forum posters unhelpfully suggested in all the results I was getting.
Update 11/26/08 - A Word of Caution
I finally got around to setting up some profiling for our site yesterday and was surprised to discover that the above query was taking longer per execution than almost anything else on our entire site. The SQL Explain was not too helpful to explain why, but it showed three joins, the join on the topics table involving every row of the table (which is presently almost 10,000).
Takeaway: for this query to work, it seems to consider every distinct topic in the table, rather than being smart and stopping when it hits the per-page paginated limit. Since I already determined that “group” and “distinct” were non-starters for being able to pick the newest post in each topic, I ended up revising the way the logic was done to an easier to manage and far-more-DB-efficient way:
We now track in each topic the newest post_id within that topic. While this adds a bit of overhead to keeping the topic updated when new posts are made to it, it allows us to do a far simpler query where we just select the more recent topics, joining to the most recent post in each topic, and then ordered by the age of those posts.
If you have the ability to create an analogous situation to solve your problem, your database will thank you for it. The above query starts getting extremely slow with more than a few thousand rows. Yet, I defy you to find an alternative to it that works at all using “group” or “distinct.”
Bill Uncategorized