Indexed CSV

July 26, 2012

Why is there no index for CSV files?  Indexes are very simple.

If the first column of your CSV file is in sorted order you can do a binary search to find your data.  But what if you need to find data in the second or third column?

If you have a separate index file pointing to the first byte of each line you could seek to that position in the CSV file and get your data.  Given a file with only your needed column and the byte offset and length of the line you can search the index to find the pointer to the position in the CSV file.

SAMPLE DATA
“record”,”title”,”data”
1,”test1”,”data1”
2,”test3”,”data3”
3,”test2”,”data2”

SAMPLE INDEX
“test1”,24,17
“test2”,58,17
“test3”,41,17
“title”,0,23

Here is a simple perl program to create just such an index.

#!/usr/bin/perl
# csvindex.pl

use strict;
use warnings;

use Text::CSV_XS;

my $file = $ARGV[0];

my $csv = Text::CSV_XS->new ({
binary    => 1,
auto_diag => 1,
sep_char  => ','    # not really needed as this is the default
});

open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file'\n";

my $head = 0;
my $tail = 0 ;

while (my $fields = $csv->getline( $data )) {
$tail = tell $data ;
printf("\"%s\",%d,%d\n",$fields->[$ARGV[1]], $head, $tail-$head);
$head = $tail;
}

This command create an index of the second column (number start with 0).

$ perl csvindex.pl 2009.csv 1 | sort > 2009.cidx

A binary search can be done with the linux ‘look’ command.

$ look \"OK\" 2009.cidx
"OK",2134,59

Here is a simple program to seek into a file and print ‘n’ number of characters.

#!/usr/bin/perl
# csvseek.pl
use strict;
use warnings;

use Text::CSV_XS;

my $file = $ARGV[0];

open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file'\n";

seek $data, $ARGV[1], 0 ;

my $line = "" ;

read $data, $line, $ARGV[2];
chomp($line);

print $line . "\n" ;

Here is how it is used.

$ perl csvseek.pl  2009.csv 2134 59
OKLAHOMA,OK,40,"4,765","1,251","2,160","1,024",6,"9,206",

I’d like to see these tools added to CSVKit.

What I’d really like to see is this index/s added to MySQL’s CSV engine.  Would anyone like to help me get this done?

posted in Code, Commentary by mark

Follow comments via the RSS Feed | Leave a comment | Trackback URL

2 Comments to "Indexed CSV"

  1. Log Buffer #279, A Carnival of the Vanities for DBAs | The Pythian Blog wrote:

    [...] is there no index for CSV files?  Indexes are very simple. Mark Grennan has [...]

  2. Price wrote:

    There is certainly a lot to find out about this issue.

    I really like all of the points you have made.

Leave Your Comment

You must be logged in to post a comment.

 



Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org
Creative Commons License
MySQL Fan Boy by Mark Grennan is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.