Speed Up Your MySQL Queries Using MD5 and Indexing
So I created a poorly constructed 'SELECT' query in a stored procedure and I was up late last night trying to figure out how I could best optimize it. After sitting in my 94 degree basement for a few hours, I thought of using hashed values via CONCAT and MD5 functions to create an index of multiple columns to improve 'SELECT' queries using 'JOINS.' The performance increase was instant after I made the necessary adjustments. To save others the time and headache, I'm going to share with you my solution and the resulting performance improvements.
So, I am dealing with two tables called "meetup_events" and "location" containing 1,000+ and 290,000+ records, respectively. Both tables contain information for:
- location name
- location address
- location city
- location state
- location zip
- location latitude
- location longitude
- location phone
meetup_events table
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| id | int(11) | NO | PRI | NULL | |
| name | varchar(100) | NO | NULL | ||
| description | text | NO | NULL | ||
| time | timestamp | NO | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP | |
| group_id | varchar(100) | NO | NULL | ||
| group_name | varchar(100) | NO | NULL | ||
| group_photo_url | varchar(200) | NO | NULL | ||
| fee | varchar(100) | NO | NULL | ||
| how_to_find | text | NO | NULL | ||
| organizer_name | varchar(100) | NO | NULL | ||
| nid | int(11) | NO | NULL | ||
| status | int(11) | NO | 0 | ||
| venue_name | varchar(100) | NO | NULL | ||
| venue_address | varchar(100) | NO | NULL | ||
| venue_city | varchar(100) | NO | NULL | ||
| venue_state | char(5) | NO | NULL | ||
| venue_zip | char(5) | NO | NULL | ||
| venue_phone | varchar(15) | NO | NULL | ||
| venue_lon | decimal(10,6) | NO | NULL | ||
| venue_lat | decimal(10,6) | NO | NULL |
location table
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| lid | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | |||
| street | varchar(255) | NO | |||
| additional | varchar(255) | NO | |||
| city | varchar(255) | NO | |||
| province | varchar(16) | NO | |||
| postal_code | varchar(16) | NO | |||
| country | char(2) | NO | |||
| latitude | decimal(10,6) | NO | 0.000000 | ||
| longitude | decimal(10,6) | NO | 0.000000 | ||
| source | tinyint(4) | NO | 0 | ||
| is_primary | tinyint(4) | NO | 0 |
The data fields that I need to extract include locations from the meetup_events table that are not in the location table. The query that I was using originally is:
SELECT DISTINCT
`me`.`venue_name`,`me`.`venue_address`, `me`.`venue_city`,`me`.`venue_state`,
`me`.`venue_zip`,`me`.`venue_phone`, `me`.`venue_lon`,`me`.`venue_lat`
FROM `meetup_events` `me` LEFT JOIN `location` `l`
ON( `me`.`venue_address` = `l`.`street`
AND `me`.`venue_state` = `l`.`province`
AND `me`.`venue_city` = `l`.`city`)
WHERE `me`.`venue_lat` <> 0 AND isnull(`l`.`name`)
Unfortunately, there is no immediate way of identifying a relationship between the location associated with each meetup_event and a record in the location table that uses an index because the meetup_events table contains data that is imported from a third-party source. This is where I took a wrong turn and JOIN'ed the tables using the address, state, and city fields.
...
ON( `me`.`venue_address` = `l`.`street`
AND `me`.`venue_state` = `l`.`province`
AND `me`.`venue_city` = `l`.`city`)
...
This didn't pose an issue when the number of records in the location table was smaller, but as the table grew it failed to scale and resulted in ERROR: query execution was interrupted. Not fun!
So after thinking about the minimal data that I would need to constuct a unique identifer for a given location, I decided to turn my attention to the location/venue name, latitude, and longitude data. This will take care of potential duplicates such as locations with mutliple venues like malls, multi-storied buildings, etc.
Now that I have figured out that aspect, I must store the data in a format that is optimial for the MySQL to quickly read and index. Without thought, I turned to the CONCAT() function to provide a single field to store the data and return a scalar value to store.
CONCAT(name, latitude, longitude);
To ensure that the data remains relatively short and unique, the MD5() function comes into play which will result in a hashed value of the concatenated value returned by CONCAT(name, latitude, longitude).
MD5(CONCAT(name, latitude, longitude));
Example: SELECT MD5(CONCAT('The Park at 14th', 38.879456, -76.985059));
| MD5(CONCAT('The Park at 14th',38.879456,-76.985059)) |
|---|
|
e3dd1caa1253f2f5def9caadc3e474b7 |
Using this means of generating a unique value, both tables (meetup_events and location) are given a new field called, hashed_index of type varchar(100) with an INDEX.
meetup_events table (with new field)
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| id | int(11) | NO | PRI | NULL | |
| name | varchar(100) | NO | NULL | ||
| description | text | NO | NULL | ||
| time | timestamp | NO | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP | |
| group_id | varchar(100) | NO | NULL | ||
| group_name | varchar(100) | NO | NULL | ||
| group_photo_url | varchar(200) | NO | NULL | ||
| fee | varchar(100) | NO | NULL | ||
| how_to_find | text | NO | NULL | ||
| organizer_name | varchar(100) | NO | NULL | ||
| nid | int(11) | NO | NULL | ||
| status | int(11) | NO | 0 | ||
| venue_name | varchar(100) | NO | NULL | ||
| venue_address | varchar(100) | NO | NULL | ||
| venue_city | varchar(100) | NO | NULL | ||
| venue_state | char(5) | NO | NULL | ||
| venue_zip | char(5) | NO | NULL | ||
| venue_phone | varchar(15) | NO | NULL | ||
| venue_lon | decimal(10,6) | NO | NULL | ||
| venue_lat | decimal(10,6) | NO | NULL | ||
| hashed_index | varchar(50) | NO | MUL | NULL |
location table (with new field)
| Field | Type | Null | Key | Default | Extra |
|---|---|---|---|---|---|
| lid | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | |||
| street | varchar(255) | NO | |||
| additional | varchar(255) | NO | |||
| city | varchar(255) | NO | |||
| province | varchar(16) | NO | |||
| postal_code | varchar(16) | NO | |||
| country | char(2) | NO | |||
| latitude | decimal(10,6) | NO | 0.000000 | ||
| longitude | decimal(10,6) | NO | 0.000000 | ||
| source | tinyint(4) | NO | 0 | ||
| is_primary | tinyint(4) | NO | 0 | ||
| hashed_index | varchar(50) | NO | MUL | NULL |
Here is the query rewritten and optimized:
SELECT DISTINCT
`me`.`venue_name`,`me`.`venue_address`,
`me`.`venue_city`,`me`.`venue_state`,
`me`.`venue_zip`,`me`.`venue_phone`,
`me`.`venue_lon`,`me`.`venue_lat`
FROM `meetup_events` `me` LEFT JOIN `location` `l`
USING( hashed_index)
WHERE `me`.`venue_lat` <> 0 AND isnull(`l`.`name`);
If you have any multiple column indexes within any of your tables, try indexing a hashed version of those fields concatenated, and then test to see if this results in a performance increase. Good luck!






