Skip to Navigation

Reply to comment

Printer-friendly version

Speed Up Your MySQL Queries Using MD5 and Indexing

So I created a poorly constructed 'SELECT' query in a stored procedure and I was up late last night trying to figure out how I could best optimize it. After sitting in my 94 degree basement for a few hours, I thought of using hashed values via CONCAT and MD5 functions to create an index of multiple columns to improve 'SELECT' queries using 'JOINS.' The performance increase was instant after I made the necessary adjustments. To save others the time and headache, I'm going to share with you my solution and the resulting performance improvements.

So, I am dealing with two tables called "meetup_events" and "location" containing 1,000+ and 290,000+ records, respectively. Both tables contain information for:

  • location name
  • location address
  • location city
  • location state
  • location zip
  • location latitude
  • location longitude
  • location phone

meetup_events table

Field Type Null Key Default Extra
id int(11) NO PRI NULL  
name varchar(100) NO   NULL  
description text NO   NULL  
time timestamp NO   CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP
group_id varchar(100) NO   NULL  
group_name varchar(100) NO   NULL  
group_photo_url varchar(200) NO   NULL  
fee varchar(100) NO   NULL  
how_to_find text NO   NULL  
organizer_name varchar(100) NO   NULL  
nid int(11) NO   NULL  
status int(11) NO   0  
venue_name varchar(100) NO   NULL  
venue_address varchar(100) NO   NULL  
venue_city varchar(100) NO   NULL  
venue_state char(5) NO   NULL  
venue_zip char(5) NO   NULL  
venue_phone varchar(15) NO   NULL  
venue_lon decimal(10,6) NO   NULL  
venue_lat decimal(10,6) NO   NULL  

 

location table

Field Type Null Key Default Extra
lid int(10) unsigned NO PRI NULL auto_increment
name varchar(255) NO      
street varchar(255) NO      
additional varchar(255) NO      
city varchar(255) NO      
province varchar(16) NO      
postal_code varchar(16) NO      
country char(2) NO      
latitude decimal(10,6) NO   0.000000  
longitude decimal(10,6) NO   0.000000  
source tinyint(4) NO   0  
is_primary tinyint(4) NO   0  

 

The data fields that I need to extract include locations from the meetup_events table that are not in the location table. The query that I was using originally is:

SELECT DISTINCT
`me`.`venue_name`,`me`.`venue_address`, `me`.`venue_city`,`me`.`venue_state`,
`me`.`venue_zip`,`me`.`venue_phone`, `me`.`venue_lon`,`me`.`venue_lat`
FROM `meetup_events` `me` LEFT JOIN `location` `l`
ON( `me`.`venue_address` = `l`.`street`
    AND `me`.`venue_state` = `l`.`province`
    AND `me`.`venue_city` = `l`.`city`)
WHERE `me`.`venue_lat` <> 0 AND isnull(`l`.`name`)

Unfortunately, there is no immediate way of identifying a relationship between the location associated with each meetup_event and a record in the location table that uses an index because the meetup_events table contains data that is imported from a third-party source. This is where I took a wrong turn and JOIN'ed the tables using the address, state, and city fields.

...
ON( `me`.`venue_address` = `l`.`street`
    AND `me`.`venue_state` = `l`.`province`
    AND `me`.`venue_city` = `l`.`city`)
...

This didn't pose an issue when the number of records in the location table was smaller, but as the table grew it failed to scale and resulted in ERROR: query execution was interrupted. Not fun!


(click to see full image)

So after thinking about the minimal data that I would need to constuct a unique identifer for a given location, I decided to turn my attention to the location/venue name, latitude, and longitude data. This will take care of potential duplicates such as locations with mutliple venues like malls, multi-storied buildings, etc.

Now that I have figured out that aspect, I must store the data in a format that is optimial for the MySQL to quickly read and index. Without thought, I turned to the CONCAT() function to provide a single field to store the data and return a scalar value to store. 

CONCAT(name, latitude, longitude);

To ensure that the data remains relatively short and unique, the MD5() function comes into play which will result in a hashed value of the concatenated value returned by CONCAT(name, latitude, longitude).

MD5(CONCAT(name, latitude, longitude));

Example: SELECT MD5(CONCAT('The Park at 14th', 38.879456, -76.985059));

MD5(CONCAT('The Park at 14th',38.879456,-76.985059))

e3dd1caa1253f2f5def9caadc3e474b7

 

Using this means of generating a unique value, both tables (meetup_events and location) are given a new field called, hashed_index of type varchar(100) with an INDEX.

meetup_events table (with new field)

Field Type Null Key Default Extra
id int(11) NO PRI NULL  
name varchar(100) NO   NULL  
description text NO   NULL  
time timestamp NO   CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP
group_id varchar(100) NO   NULL  
group_name varchar(100) NO   NULL  
group_photo_url varchar(200) NO   NULL  
fee varchar(100) NO   NULL  
how_to_find text NO   NULL  
organizer_name varchar(100) NO   NULL  
nid int(11) NO   NULL  
status int(11) NO   0  
venue_name varchar(100) NO   NULL  
venue_address varchar(100) NO   NULL  
venue_city varchar(100) NO   NULL  
venue_state char(5) NO   NULL  
venue_zip char(5) NO   NULL  
venue_phone varchar(15) NO   NULL  
venue_lon decimal(10,6) NO   NULL  
venue_lat decimal(10,6) NO   NULL  
hashed_index varchar(50) NO MUL NULL  

 

location table (with new field)

Field Type Null Key Default Extra
lid int(10) unsigned NO PRI NULL auto_increment
name varchar(255) NO      
street varchar(255) NO      
additional varchar(255) NO      
city varchar(255) NO      
province varchar(16) NO      
postal_code varchar(16) NO      
country char(2) NO      
latitude decimal(10,6) NO   0.000000  
longitude decimal(10,6) NO   0.000000  
source tinyint(4) NO   0  
is_primary tinyint(4) NO   0  
hashed_index varchar(50) NO MUL NULL  

 

Here is the query rewritten and optimized:

SELECT DISTINCT
`me`.`venue_name`,`me`.`venue_address`,
`me`.`venue_city`,`me`.`venue_state`,
`me`.`venue_zip`,`me`.`venue_phone`,
`me`.`venue_lon`,`me`.`venue_lat`
FROM `meetup_events` `me` LEFT JOIN `location` `l`
USING( hashed_index)
WHERE `me`.`venue_lat` <> 0 AND isnull(`l`.`name`);


(click to see full image)

If you have any multiple column indexes within any of your tables, try indexing a hashed version of those fields concatenated, and then test to see if this results in a performance increase. Good luck!

Reply

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

Mollom CAPTCHA (play audio CAPTCHA)
Type the characters you see in the picture above; if you can't read them, submit the form and a new image will be generated. Not case sensitive.