A faster way to use sets in MySQL
I have a MySQL 5.1 InnoDB ( customers
) table with the following structure:
int record_id (PRIMARY KEY)
int user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..
The table contains about 7 million rows. The table is currently being queried like this:
SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...
in the current request, there are currently more than 560 user_id
in the proposal IN
. With several million records in the table, this query is slow!
The table has secondary indexes, the first one is on user_id
, which I thought would help.
I know SELECT(*)
this is Bad Thing and this will expand to include a complete list of required fields. However, fields not listed above are larger int
and double
s. There are 50 more of those that are coming back, but they are needed for the report.
I guess there is a much better way to access data for user_id
s, but I can't think of how to do this. My initial reaction is to delete ALLOW NULL
in the field user_id
, as I understand the NULL
processing slows down the requests?
I would be very grateful if you could point me in a more efficient direction than the method IN ( )
.
EDIT Ran EXPLAIN who said:
select_type = SIMPLE
table = customers
type = range
possible_keys = userid_idx
key = userid_idx
key_len = 5
ref = (NULL)
rows = 637640
Extra = Using where
Does it help?
a source to share
First check if the index is on USER_ID
and make sure it is in use.
You can do this with EXPLAIN
.
Second, create a temporary table and use it in JOIN
:
CREATE TABLE temptable (user_id INT NOT NULL)
SELECT *
FROM temptable t
JOIN customers c
ON c.user_id = t.user_id
Third, how can rows be returned by your query?
If it returns almost all rows, then it will be just slow, since first you will need to pump all these millions over the communication channel.
NULL
won't slow down your query as the condition IN
only matches indexed values NULL
.
Update:
The index is used, the plan is good except it returns over half a million rows.
Do you really need to put all these lines 638,000
in the report?
Hope it's unpublished: bad for rainforests, global warming, etc.
Seriously, you feel like you want either aggregation or pagination as per your query.
a source to share
"Select *" isn't as bad as some people think; row-based databases will fetch the entire row if they fetch it, so in situations where you are not using a coverage index, "SELECT *" is essentially no slower than "SELECT a, b, c" (NB: There is sometimes except when you have large BLOBS, but this is an edge case).
First of all - does your database fit in RAM? If not, get more RAM. No seriously. Now, suppose your database is too big to reasonably fit in the ram (Say,> 32Gb), you should try to reduce the random IOs as they are likely holding something.
I will assume here that you are using suitable server-level hardware with a RAID controller in RAID1 (or RAID10, etc.) and at least two spindles. If not, go and get it.
You can definitely consider using a clustered index. In MySQL InnoDB, you can cluster the primary key, which means if something else is the primary key, you have to change it. Composite primary keys are fine, and if you're doing many queries on the same criterion (like user_id), it's a definite benefit to make it the first part of the primary key (you need to add something else to make this unique).
Alternatively, you can make your query using the coverage index, in which case you don't need the user_id to be the primary key (it shouldn't actually be). This will only happen if all the columns you want are in an index that starts with user_id.
When it comes to query efficiency, WHERE user_id IN is almost certainly the most efficient way to do this from SQL.
BUT my biggest tips are:
- Have a goal in mind, figure out what it is, and when you get there, stop.
- Don't take a single word for this - try and see
- Make sure your performance testing system is the same hardware spec as production
- Make sure your performance test system is the same size and data type as production (same schema is not good enough!).
- Use synthetic data when production data cannot be used (copying production data can be logically difficult (remember your database is> 32 GB), this can also violate security rules).
- If your query is optimal (as it probably already is), try tweaking the schema and then the database itself.
a source to share
Is this your most important question? Is this a transactional table?
If so, try creating a clustered index on user_id. Your query might be slow because it still has to do random disk reads in order to fetch columns (key searches), even after finding the records that match (index lookup on user_Id index).
If you cannot change the clustered index, you might need to consider the ETL process (the simplest is a trigger that inserts into another table with the best indexing). This should lead to faster results.
Also note that such large queries may take a while to parse, so help it by putting the requested ids in a temporary table if possible
a source to share