Loading Bulk CSV Tables

March 23, 2010

In my job I use many data tables that are transient.  New weather data is received all the time and old data is purged.  Most of these table are received as CSV files.  The data is then loaded into MySQL tables and indexed to be used with geographic queries.

Most of these tables never see an insert or update.  It would be nice if  you could build make these CVS tables read only and build byte pointer indexes for each row.  (Maybe some day I’ll code this into MySQL.)

Most people load large data tables at night time with the LOCK & LOAD method.  It goes like LOCK TABLE…; LOAD DATA INFILE…; UNLOCK TABLE.  In other words, nobody will read data or generate reports during while this is running.

With the script I developed I have been able to load 33,000,000 records from a CSV file into a MySQL table, with indexes, in 22m 36.282s minutes without creating long LOCK times effecting the users.

Here is what I’m doing.  This is a proof of concept script written in BASH.

In the ‘test’ database there are two tables.

forecast = MyISAM table with index
NEWforecast = CSV table

#!/bin/bash
echo "Truncate forecast file"
mysql test -Bse "truncate table forecast;"
count=`mysql test -Bse "select TABLE_ROWS from information_schema.tables where table_name = 'forecast';"`
echo "Count is now $count"

echo "Check Slave is truncated"
count=`mysql -h slave_ip -u dbaops –pP@ssw0rd -Bse "select TABLE_ROWS from information_schema.tables where table_name = 'forecast';"`
echo "Count is now $count"

# The size of the split file determines the time the MyISAM table will be locked.
    echo "splitting NEWforecast.CVS file into 100,000 records"
    split -l 100000 NEWforecast.CSV data_
# Time the for loop
time for x in data_*
do
# copy new data to MySQL CSV file
    echo "cp /home/dbaops/$x /data/mysql/test/NEWforecast.CSV"
    cp /home/dbaops/$x /data/mysql/test/NEWforecast.CSV
# copy same data to the SLAVE server
    scp /home/dbaops/$x 'dbaops:P@ssw0rd@slave_ip:/data/mysql/test/NEWforecast.CSV'
# Flush tables to load new data
    mysql test -Bse "flush tables;"
    mysql -h slave_ip -u dbaops –pP@ssw0rd -Bse "flush tables;"
# Insert from CVS to MyISAM with index – This command get replicated.
    mysql test -Bse "concurrent insert ignore into forecast select * from NEWforecast;"
    count=`mysql test -Bse "select TABLE_ROWS from information_schema.tables where table_name = 'forecast';"`
    echo "Count for this load is $count"
done

rm data_*

sleep 5
count=`mysql -h slave_ip  -u dbaops –pP@ssw0rd test -Bse "select TABLE_ROWS from information_schema.tables where table_name = 'forecast';"`

echo "Count on SLAVE is now $count"

I use the CONCURRENT keyword to enable inserts to happen concurrently, and if needed use “SET GLOBAL concurrent_insert=2“.

Deleting the old records can be a trick too.  In the above example I just empty the table using the ‘TRUNCATE TABLE” command.  Having no data for the application to query may return strange results to the user.

Bulk deletes can also lock the table for a long amount of time.  A stored procedure can be used to loop through the data and remove all record in batches until there are none left.

DROP PROCEDURE IF EXISTS delete_incrementally;
DELIMITER //
CREATE PROCEDURE delete_incrementally(IN tbl VARCHAR(64), IN days INTEGER)
MODIFIES SQL DATA
BEGIN
  SET @TBL_NAME = tbl;
  SET @sql_text = CONCAT('DELETE FROM ',@TBL_NAME,' WHERE date_time < NOW() - INTERVAL ',days,' DAY LIMIT 100;');
  PREPARE stmt FROM @sql_text;
  REPEAT
    EXECUTE stmt;
    SELECT SLEEP(.1);
  UNTIL ROW_COUNT() <= 0 END REPEAT;
 DEALLOCATE PREPARE stmt;
END //
DELIMITER ;

You can then call this procedure with:

CALL scour_table("TheTableName", 10);

For InnoDB:

It is a big advantage if the data in the CSV files is already ordered by primary key. (Because the InnoDB primary key is a clustered index, so it will organize the table physically to be in primary key order anyway.)

For the bulk insert, you should consider turning off foreign key checking and unique index checking.

UNIQUE_CHECKS=0;
FOREIGN_KEY_CHECKS=0

Using InnoDB plugin, you can speed things up by inserting data into a table without indexes (only define primary key, of course), and then create the indexes separately with alter table. (on an existing table you can also consider dropping existing indexes, the benefit of this would depend case by case).

CSV Files

http://www.shinguz.ch/MySQL/CSV_tables.pdf

http://blogs.sun.com/carriergrademysql/entry/tips_for_bulk_loading

http://www.mysqlperformanceblog.com/2008/07/03/how-to-load-large-files-safely-into-innodb-with-load-data-infile/

InnoDB

http://www.innodb.com/doc/innodb_plugin-1.0/innodb-create-index.html

http://www.mysqlperformanceblog.com/2008/04/23/testing-innodb-barracuda-format-with-compression/

Tweet

tags: , , , , , , , , ,
posted in Tips & Tricks by mark

Follow comments via the RSS Feed | Leave a comment | Trackback URL

1 Comment to "Loading Bulk CSV Tables"

  1. Euro Finance wrote:

    European indexes woke up, but only for a while. I suppose the Japan crisis will cause new economy crisis, of course supported by Libyan civil war.

Leave Your Comment

 



Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org
Creative Commons License
MySQL Fan Boy by Mark Grennan is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.
HOME