MySQL query puzzle - finding what was the most recent date

I've looked through everything and still haven't found a sane way to handle this, although I'm sure it's possible:

One historical data table has quarterly information:

CREATE TABLE Quarterly (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
PRIMARY KEY (unique_ID));

      

Another historical data table (which is very large) contains daily information:

CREATE TABLE Daily (
unique_ID INT UNSIGNED NOT NULL,
date_posted DATE NOT NULL,
datasource TINYINT UNSIGNED NOT NULL,
data FLOAT NOT NULL,
qtr_ID INT UNSIGNED,
PRIMARY KEY (unique_ID));

      

The qtr_ID field is not part of the daily data stream that populates the database - instead, I need to retroactively populate the qtr_ID field in the Daily table with the row id Quarterly.unique_ID, using what would be the most recent quarterly data that Daily.date_posted is for this data source.

For example, if the quarterly data

101 2009-03-31 1 4.5
 102 2009-06-30 1 4.4
 103 2009-03-31 2 7.6
 104 2009-06-30 2 7.7
 105 2009-09-30 1 4.7

and daily data

1001 2009-07-14 1 3.5 ??
1002 2009-07-15 1 3.4 &&
1003 2009-07-14 2 2.3 ^^

then we would like to? the qtr_ID field to be assigned "102" as the last quarter for this data source on that date, and && & will also be "102" and ^^ will be "104".

Problems include the fact that both tables (in particular the daily table) are actually very large, they cannot be normalized to get rid of duplicate dates or otherwise optimized, and there is no prior quarterly record for certain daily records.

I've tried various joins using datiff (where the problem is finding the minimum datiff value greater than zero) and other attempts, but nothing works for me - usually my syntax breaks somewhere. Any ideas are appreciated - I will follow up on any main ideas or concepts and report back.

+2


a source to share


3 answers


Just a subquery for the quarter id using something like:

(
 SELECT unique_ID 
 FROM Quarterly 
 WHERE 
     datasource = ? 
     AND date_posted >= ? 
 ORDER BY
     unique_ID ASC
 LIMIT 1
)

      



Of course, this probably won't give you the best performance, and assumes the dates are added to Quarterly sequentially (otherwise order by date_posted

). However, it should solve your problem.

You would use this subquery in your statements INSERT

or UPDATE

as your field value qtr_ID

for your table Daily

.

+1


a source


It looks like it works exactly as intended, but it is definitely ugly (with three calls to the same DATEDIFF !!), perhaps after seeing a working query, someone can further reduce or improve it:



UPDATE Daily SET qtr_ID = (select unique_ID from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) = 
(SELECT MIN(DATEDIFF(Daily.date_posted, Quarterly.date_posted)) from Quarterly
WHERE Quarterly.datasource = Daily.datasource AND
DATEDIFF(Daily.date_posted, Quarterly.date_posted) > 0));

      

0


a source


After further work on this query, I ended up with huge performance improvements over the original concept. The most important improvement was creating indexes on both Daily and Quarterly tables - in Daily I created indexes (datasource, date_posted) and (date_posted, datasource) USING BTREE and on (datasource) USING HASH and in Quarterly I did the same thing. This is overkill, but I'm sure I have an option that the query engine can use. This reduced the request time to less than 1% of what it was. (!!)

Then I found out that given my specific circumstances, I can use MAX () instead of ORDER BY and LIMIT, so I use the MAX () call to get the corresponding unique_ID. This reduced the request time by about 20%.

Finally, I found out that with the InnoDB storage engine I could segment a chunk of the Daily table that I was updating with a single query, which allowed me to multithreaded queries with a little grease and scripting. Parallel processing worked well and each thread reduced the request time linearly.

So the basic query, which is literally 1000 times better than my first attempt, is:

UPDATE Daily
SET qtr_ID =
(
  SELECT MAX(unique_ID)
  FROM Quarterly
  WHERE Daily.datasource = Quarterly.datasource AND
        Daily.date_posted > Quarterly.dateposted
)
WHERE unique_ID > ScriptVarLowerBound AND
      unique_ID <= ScriptVarHigherBound
;

      

0


a source







All Articles