A faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB ( customers

) table with the following structure:

int         record_id (PRIMARY KEY)
int         user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..

      

The table contains about 7 million rows. The table is currently being queried like this:

SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...

      

in the current request, there are currently more than 560 user_id

in the proposal IN

. With several million records in the table, this query is slow!

The table has secondary indexes, the first one is on user_id

, which I thought would help.

I know SELECT(*)

this is Bad Thing and this will expand to include a complete list of required fields. However, fields not listed above are larger int

and double

s. There are 50 more of those that are coming back, but they are needed for the report.

I guess there is a much better way to access data for user_id

s, but I can't think of how to do this. My initial reaction is to delete ALLOW NULL

in the field user_id

, as I understand the NULL

processing slows down the requests?

I would be very grateful if you could point me in a more efficient direction than the method IN ( )

.

EDIT Ran EXPLAIN who said:

select_type = SIMPLE 
table = customers 
type = range 
possible_keys = userid_idx 
key = userid_idx 
key_len = 5 
ref = (NULL) 
rows = 637640 
Extra = Using where 

      

Does it help?

+1


a source to share


5 answers


First check if the index is on USER_ID

and make sure it is in use.

You can do this with EXPLAIN

.

Second, create a temporary table and use it in JOIN

:

CREATE TABLE temptable (user_id INT NOT NULL)

SELECT  *
FROM    temptable t
JOIN    customers c
ON      c.user_id = t.user_id

      

Third, how can rows be returned by your query?

If it returns almost all rows, then it will be just slow, since first you will need to pump all these millions over the communication channel.



NULL

won't slow down your query as the condition IN

only matches indexed values NULL

.

Update:

The index is used, the plan is good except it returns over half a million rows.

Do you really need to put all these lines 638,000

in the report?

Hope it's unpublished: bad for rainforests, global warming, etc.

Seriously, you feel like you want either aggregation or pagination as per your query.

+3


a source


"Select *" isn't as bad as some people think; row-based databases will fetch the entire row if they fetch it, so in situations where you are not using a coverage index, "SELECT *" is essentially no slower than "SELECT a, b, c" (NB: There is sometimes except when you have large BLOBS, but this is an edge case).

First of all - does your database fit in RAM? If not, get more RAM. No seriously. Now, suppose your database is too big to reasonably fit in the ram (Say,> 32Gb), you should try to reduce the random IOs as they are likely holding something.

I will assume here that you are using suitable server-level hardware with a RAID controller in RAID1 (or RAID10, etc.) and at least two spindles. If not, go and get it.

You can definitely consider using a clustered index. In MySQL InnoDB, you can cluster the primary key, which means if something else is the primary key, you have to change it. Composite primary keys are fine, and if you're doing many queries on the same criterion (like user_id), it's a definite benefit to make it the first part of the primary key (you need to add something else to make this unique).



Alternatively, you can make your query using the coverage index, in which case you don't need the user_id to be the primary key (it shouldn't actually be). This will only happen if all the columns you want are in an index that starts with user_id.

When it comes to query efficiency, WHERE user_id IN is almost certainly the most efficient way to do this from SQL.

BUT my biggest tips are:

  • Have a goal in mind, figure out what it is, and when you get there, stop.
  • Don't take a single word for this - try and see
  • Make sure your performance testing system is the same hardware spec as production
  • Make sure your performance test system is the same size and data type as production (same schema is not good enough!).
  • Use synthetic data when production data cannot be used (copying production data can be logically difficult (remember your database is> 32 GB), this can also violate security rules).
  • If your query is optimal (as it probably already is), try tweaking the schema and then the database itself.
+2


a source


Are they the same ~ 560 id every time? Or is it different ~ 500 IDs on different runs of queries?

You can just insert your 560 UserIDs into a separate table (or even a temporary table), insert an index on that table, and an inner join to it into the original table.

+1


a source


Is this your most important question? Is this a transactional table?

If so, try creating a clustered index on user_id. Your query might be slow because it still has to do random disk reads in order to fetch columns (key searches), even after finding the records that match (index lookup on user_Id index).

If you cannot change the clustered index, you might need to consider the ETL process (the simplest is a trigger that inserts into another table with the best indexing). This should lead to faster results.

Also note that such large queries may take a while to parse, so help it by putting the requested ids in a temporary table if possible

+1


a source


You can try to insert the ids you need to query in the temp table and the inner join of both tables. I don't know if this will help.

0


a source







All Articles