Where is the Language Data?

December 5, 2014

Image if you will,  a geographic database of locations and the language/s spoken there with keys to a dictionary of common words with keys the same word in all other languages.   And again image this database was like a wiki, so it could crowd sourced.  You could put in your own home and “English spoken here”.  And maybe this database would have keys to the words spoken in each language.

If there was an API for this database you could do simple application language translations.  Programs would not need to code tables for translations.  The “PUSH” or “OK” button would just have a key to that word.  A web app could do IP to GEO lookup and speak your language.  A mobile app could alert you “French spoken here”.

Oh and this would need to be FREE and in beer.

Does this exist?  Why not?





Oklahoma City MySQL Group Forming

July 22, 2014

I’m excited to learn, just one day before the event, that Oklahoma City has formed a MySQL User Group. OKCMySQL.org

July 23, 2014 will be the inaugural meet-up for the OKC MySQL user group. As a special guest speaker, Peter Zaitsev (CEO of Percona and co-author of High Performance MySQL) will be giving a presentation on “Best Indexing Practices“, followed by a Q/A session.

Time is very short.  RPVP on Meetup.

I hope to see you at The Paramount – 701 West Sheridan, Oklahoma City, OK (map).

MySQL Switch project

October 24, 2013

As an old firewall developer, I’m dreaming of a MySQL Switch (MySwitch).

You can modify an application to change the database destination. Often the application has to be restarted creating down time. Most servers have many applications connected to them. To make a switch to a Slave server everything has to happen together. Some applications start up MUCH slower than others creating more down time.

The purpose of this project is to take control of application/s connections to MySQL servers . I propose using Network Address Translation (NAT) as a “MySQL Switch” (MySwitch) to move applications between servers. Simply stated, the application would not connect directly to the database server. It would instead connect to the MySwitch. The switch would then change the destination address inside the IP packet to direct it to the correct server. Because the source address packet does not change the responding packet would return back to the source application. Of Course nothing is this simple.

Some exceptions to this are; You can’t switch servers in the middle of a conversion. The Slave would have to be make Read/Write before you switch. The slave replication should not be delayed. Sleeping connections should be closed gracefully on the Master. How do you hold off our new connections while the switch is made or can it be made fast enough? No DNS changes have to be made and no IPs have to be moved between servers.

I’m thinking of a Linux project written in C, maybe with a NodeJS web interface. It would document and control the Master / Slave relationships. Perhaps you could point a service at a Master or a leaf Slave and it would build the configuration. At the least you could document the cluster configurations in a JSON file. Maybe the Slave could be configured as the Master and the current state at the time of the switch could be document so turning the old Master into a slave could eased.

Here is a simple outline of what needs to happen.

  • A client, of a client cluster ( connects to MySwitch (
  • MySwitch changes the destination IP ( to the correct MySQL server (
    (MySQL requests are made and returned to the client.)
  • The admin makes a request to switch to MySQL server (
  • MySwitch checks that the slave ( is less than ? seconds delayed.
  • MySwitch starts queuing new connection requests and continues processing all other packet types.
  • MySwitch checks for connections from the client cluster on the Master (
  • MySwitch
  • MySwitch closes sleeping connections opened by client cluster members.
  • MySwitch double checks the slave delay.
  • MySwitch changes the slave to read write and resets the slave connection.
  • MySwitch processes all the queued connections to the new MySQL server (

MySQL commands needed


Check that:
Master_Host: (Master = correct master)
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0

SELECT ID FROM information_schema.processlist WHERE host like ‘67.67%’ ;

This is used to check connections from the “Application Cluster” (67.67) on the Master.

SELECT concat(‘KILL ‘,id,’;’) FROM information_schema.processlist WHERE host like ‘67.67%’ and COMMAND = ‘Sleep’ into outfile ‘/tmp/a.txt’;
SOURCE /tmp/a.txt;

This is used to kill all the open connection for the “Application Cluster” on the Master.


These are used to turn off replication on the Slave;






Related references


F5 – BigIP
http://support.f5.com/kb/en-us/products/big-ip_ltm/manuals/product/ltm_configuration_guide_10_1/ltm_appendixa_monitor_types.html (Search for MySQL)

IP Queuing


You upgraded to > 5.1 but did you turn on performance_schema?

July 26, 2013

If you’ve upgraded from MySQL 5.1 to a newer version (5.5?  5.6? 5.7?)  Mark Leith shows you why you should turn on performance_schemaRoland Bouman gave us Common_Schema to help improve the SQL user experience.   Mark now lets us look inward at the IO latency to improve the system.

Mark has developed a great set of views and procedures to work with performance_schema that can give you things like:

  • A list all statements who’s average runtime, in microseconds, are the top 5% slowest
  • statements that use have done a full table scans
  • Find indexes that have had no events against them (and hence, no usage)
  • Show the top disk IO consumers by bytes usage by file

You’ll need to start mysql with the command line –performance_schema option or add performance_schema=on to your my.cnf and restart the server.  Then download the lattest 5.5 or 5.6 code from Mark’s page and import it.

 mysql < ps_helper_56.sql

Now you’re ready to learn things like:

mysql> SELECT * FROM statements_with_errors_or_warnings;
| query | exec_count | errors | error_pct | warnings | warning_pct | digest |
| CREATE PROCEDURE currently_ena ... w_instruments BOOLEAN DEFAULT | 2 | 2 | 100.0000 | 0 | 0.0000 | ad6024cfc2db562ae268b25e65ef27c0 |
| CREATE PROCEDURE currently_ena ... ents WHERE enabled = ? ; END | 2 | 1 | 50.0000 | 0 | 0.0000 | 4aac3ab9521a432ff03313a69cfcc58f |
| CREATE PROCEDURE currently_enabled ( BOOLEAN show_instruments | 1 | 1 | 100.0000 | 0 | 0.0000 | c6df6711da3d1a26bc136dc8b354f6eb |
| CREATE PROCEDURE disable_backg ... d = ? WHERE TYPE = ? ; END IF | 1 | 1 | 100.0000 | 0 | 0.0000 | 12e0392402780424c736c9555bcc9703 |
| DROP PROCEDURE IF EXISTS currently_enabled | 12 | 0 | 0.0000 | 6 | 50.0000 | 44cc7e655d08f430e0dd8f3110ed816c |
| DROP PROCEDURE IF EXISTS disable_background_threads | 3 | 0 | 0.0000 | 2 | 66.6667 | 0153b7158dae80672bda6181c73f172c |
| CREATE SCHEMA IF NOT EXISTS ps_helper | 2 | 0 | 0.0000 | 1 | 50.0000 | a12cabd32d1507c758c71478075f5290 |









Percona Live – MySQL Conference

April 25, 2013

Mark GrennanThis is my takeaway from this years MySQL conference in San Francisco.

The community believes Oracle has been a good steward of MySQL and delivered some great code in 5.6. Studies from 451 Research show MySQL is on the decline and a lot of this has to do with the ownership my Oracle and growth of no-SQL.

Big data, Cloud, hadoop and MongoDB are still the on the lips of the “JSON Generation”.  Marty Weiner and Yashh Nelapati (Pinterest) will tell you to start in the Cloud (Amazon) and grow until you can’t stand it. Chris Schneider (Ning.com) will tell you there are lots of tools to make hadoop easy to integrator.

As a DBA and System Administrator I see that computer systems are shrinking and CPU densities have reach the point that racking and installing is dead.  Linux Torvald said “If we are still using the same operating systems and using computers in the same way in 50 or 100 years from now then we have done something wrong“.  In only five years I think we are starting to see this.  We still have the LAMP stack but  it’s installed for us by scripts on computers we’ve never touched.

There was about a 15% growth in attendants this year and maybe a 10% decrease in vendor participation.   This week TokuDB v7 went open source. Still, I feel the community is suffering. Some faces you always expect to see where not there, like Sheeri who was gone “… because I opted to stay home after a crazy year of travel…” and may not be back because “until there is an acceptable Code of Conduct“.

Everyone had fun at the SkySQL booth having their likness drawn by Doug Shannon of  EventToons  and Julian Cash returned this year to take everyone s photo.

I learned a lot and had fun.  I’d like to thank ScaleArc for sponsoring my trip this year.  I’ll be posting more about ScaleArc soon.


Ten years of MySQL developer conference

April 23, 2013


This is the tenth year for the MySQL developer conference and my third.

I’ll be posting some news about iDB From ScaleArc. Oracle is back this year. Everyone is fighting with developers wanting to move away from MySQL. And there is lots of talk about 5.6 and working in virtual environments.



Making a MySQL slave with ZERO downtime

April 10, 2013

It’s about time for this year’s MySQL (Percona) conference in San Francisco.  I haven’t written anything in some time.  So, I thought I’d write something we can talk about should you meet me.

You need to make a slave of your MySQL server.

What if, you’re bonus was tied to database up time or maybe the systems group will not grant you any down time?  You’re ONE Master DB server is so big, coping tables takes hours. Even an ‘scp’ sync takes too long.  99.999% uptime is hard to achieve.

Percona and their tools are your friend… mostly. Xtrabackup will create a ready to use copy of all your databases, document your my.cnf and your current Master Position…  unless it doesn’t.

Xtrabackup uses it’s own XtraDB engine to copy the innodb tables so it doesn’t block master transactions. (Very cool) But that just copies the innodb tables.  All the other files/tables are just copied (cp-ed).

What if, your local disk space is more than 50% full?  You will need to copy it to a different system.  Maybe that space is on a NFS server and maybe that service is mounted with automount.

Now you learn ‘cp’ is not reliable.  (Timeouts? SAN problems? I’m a DBA not a sysadmin and have you read the bastard operator from hell? )

You learn, Xtrabackup can use ‘scp’.  Now all will go…  better.  Now it only fails about one out of six times.  You learn a week later when the developers tell you they are getting odd results half the time.

Odd thing, if you start a slave with all the Innodb data intact and all database (directories) but not all the tables, the replication process will not complain.  It will just ignore the missing tables like they were “black holes”.

So you add a backup to your backup script.  (See Below, the part in RED). You use your “fixed” backup script to create a slave. It only takes six hours for the slave to catch up with the master.  All goes well…  for about a week.


export HOST="db01"
export BKDIR="/mnt/archive/xtrabackup"

if [ ! -d $BKDIR/$HOST ]
/bin/mkdir $BKDIR/$HOST

# archive Old backup
export DT=`/bin/date +20%y-%m-%d_%H-%M-%S`
cd $BKDIR/$HOST/`/bin/date +20%y`*
tar zcf $BKDIR/$HOST/$HOST-$DT.tgz ./*

# Remove old backups
/bin/find $BKDIR/$HOST/$HOST-* -mtime +1 -exec rm {} -Rf \;

# Remove last backup
rm -Rf $BKDIR/$HOST/`/bin/date +20%y`*

# Create backup
export DT=`/bin/date +20%y-%m-%d_%H-%M-%S`
/usr/bin/innobackupex --rsync $BKDIR/$HOST

# Check for missing files
export d=`ls $BKDIR/$HOST | grep ^20`
cd /data/mysql
for f in `find . -regex ".*\.\(frm\|MYD\|MYI\|MAD|\MAI\|MRG\|TRG\|TRN\|ARM\|ARZ\|CSM\|CSV\|opt\|par\)" -printf "%P\n"` ;
if [ ! -f $BKDIR/$HOST/$d/$f ]
echo coping $f
rsync $f $BKDIR/$HOST/$d/$f
/usr/bin/xtrabackup --prepare --target-dir=$BKDIR/$HOST/$d
echo DONE!


You check and… All the innodb tables are all there.  Only a few “less” critical myisam tables are missing.  You take care of the missing tables with mysqldump. The app guys check again for missing data and NO JOY.   You forgot to restart the mysql slave after you duped/copied the tables and the replication engine doesn’t know about them.


Again Percona to the rescue. You use PT-Table-Checksum to verify all the tables are in sync with the master. The DIFFS column show all 0 (zeros).  All is good… or not.  Again you find out from the developers that “some data is missing”.  (They don’t tell you the database or table of course.)

Time for some “old school” testing.

$ for x in `mysql -h db01 -e "show databases"`; do mysql -h db01 $x -e "show tables"; done > bar
$ for x in `mysql -e "show databases"`; do mysql $x -e "show tables"; done > foo
$ diff foo bar
>; forecast_20130318_1705
>; forecast_20130318_2002
>; forecast_gfs_new


Sure enough, you turn up missing and added tables because you’re copy wasn’t quite a “Point in time”.  The developers like to do table renames with their “load data from file” processes.  The myisam data was changing between the time you started your backup and the time you created your slave.

So a few mysqldumps, tables drops, restarts and you got it.

Do you have weeks like this?



In place replacement / upgrade – Redhat and Centos

November 13, 2012

For many years I have recommended startup companies install and work with CentOS until they need and can afford RedHat Enterprise.  The idea, because the two are binary compatible, the applications the company uses and develops will migrate between the two without changes.  I have even said you can do an inplace migration between them.

I’ve been challenged to prove an inplace migration is possible.  This is the results.

There are a couple packages that define a CentOS/Redhat system.

  • The centos/redhat-release package

This package hold the release files, product GPG keyes and repository config files.

  • redhat-logos

This file hold the proprietary artwork owned by the vendor.

You might have packages installed not provided by your vendor.  You can list all the vendor names of installed packages with this:

rpm -qa --qf "%{VENDOR} \n" | sort | uniq

Dell Inc

In this example “Dell Inc” packages are also installed. You can list the packages installed for a single vendor with this command:

rpm -qa --qf "%{NAME} %{VENDOR} \n" | grep "Dell Inc" | cut -d ' ' -f 1 | sort


Migrating CentOS to Redhat

You’re starting with an existing CentOS system.  I’m using a CentOS 6.2 x86_64 minimum install not updated.  I’m migrating to Redhat Enterprise Server 6.3 x86_64.

Why a minimum install?  Because this is what I recommend for application servers like MySQL.

This process is simplest if you are moving to the next minor release. A migration to the same release (6.2 CentOS to 6.2 RHEL) is possible but will require you manual replace more packages.  I can not recommend a migration to the next major release.  Although a migration from 5.8 to 6.0 might work, configuration files and sometimes whole subsystems are replaced.

To do an update you need a repository to update from.  You can use a local Redhat repository if you have one, or you can use a Redhat CD.

mkdir -p /mnt/cdrom
mount /dev/cdrom /mnt/cdrom

mv /etc/yum.repo.d/CentOS* /root
vi /etc/yum.repo.d/rhel.repo

Type this into the file.

name=redhat enterprise linux

You need to remove just a few files if they exist.

yum clean all
rpm -e --nodeps centos-release
rpm -e --nodeps centos-release-cr
rpm -e --nodeps centos-indexhtml
rpm -e --nodeps redhat-logos
rpm -e --nodeps yum-plugin-fastestmirror
rpm -e --nodeps yu

Now you can install them back from the CD repository.

yum -y install redhat-release redhat-logos yum-rhn-plugin

Then update to the latest release.

yum update

Now you need to register/entitle  the server with Redhat.

rm /etc/yum.repo.d/rhel.repo


To purge your system of the last of the CentOS packages (now that you are registered) do:

yum reinstall $(rpm -qa --qf "%{NAME} %{VENDOR} \n" | grep "CentOS" | cut -d ' ' -f 1 | sort)

You may still have a few packages left.  The kernel doesn’t get replace unless you force it and some packages are specific to CentOS.  The boot loader for example.  (Do you care?)  In my test I still had:


How you have upgraded to the next release but you still need to update bug fixes from the Redhat repository.

yum update

With all updates, you should go through your system and look for *.rpm.save and *.rpm.new files.  These are created because of changes made to configuration files on your system.  If the OS thinks it can’t safely replace your .conf file with it’s you get a .rpm.new.  If the OS knows it has to replace your .conf file for the application to work you’ll get a .rpm.save.  Either way you should review the differences and correct what’s needed.

find / -name*.rpm.save’
find / -name*.rpm.new’


Full install Test

To really test this process I did a FULL install with 1,858 packages.  I installed CentOS 6.2 and selected every package I could. I didn’t select virtualization because I was running the test in VMware. I also didn’t select the mainframe option. (Anyone want to lend me a mainframe?)

The full install went almost as simply as the minimum install except, I had to uninstall firefox before I did the ‘yum update’.

rpm -e --nodeps firefox

Of the 1858 packages (not including dependencies) 1077 were updates done by the ‘yum update’ command. This left 1325 packages that were the same release as CentOS.  You could leave them in place.  I found the system worked fine.  I tested many of the stock applications  in X-windows without issues.

The purge of CentOS files with “yum reinstall” also went well.  It found another 1229 packages leaving 97 packages still manufactured by CentOS.  Most of the remaining packages where packages that had Redhat equivalents but the package names didn’t match.  For example:

CentOS Redhat
at-spi.x86_64 at-spi-1.28.1-2.el6.x86_64
at-spi-python.x86_64 at-spi-python-1.28.1.-2.el6.x86_64
redhat-bookmarks redhat-bookmarks-6.1.el6
redhat-lsb-graphics redhat-lsb-graphics-4.0-3.el6
system-config-date system-config-date-1.9.60-1.el6

You can go through by hand and replace these if you are just have to get rid of CentOS 100% or your crazy.  Again, the system works just fine without replacing these files.

Some packages not have a replacement on the CD. Most are like OpenOffice. If you need such a thing you might have just installed the CentOS version anyway.  Maybe these are available in the EPEL repo?

autocorr-en ?
ekiga ?
farsight2 ?
gssdp ?
gupng ?
hyphen ?
hyphen-en ?
iwl6000g2b-firmware ?
libnice ?
libpurple ?
libtextcat ?
lpsolve ?
mythes-en ?
opal ?
openchange ?
openoffice ?
pidgin ?
ptlib ?
samba4-libs ?


Maybe for my next trick I’ll build a list of files from a CentOS to Redhat install and a straight from CD Redhat install and compare them to see if there are any crumbs left over.


In place upgrade – Redhat to Centos

These instructions are from the CentOS Wiki – http://wiki.centos.org/HowTos/MigrationGuide

To migrate from RHEL6 to CentOS 6 fetch the latest versions of the following packages for your architecture and put them in a temporary directory. Rather than fetching packages with wget, as shown below, the CentOS-6 installation media can be mounted and packages installed from it.
For CentOS 6.2 x86_64

yum remove rhnlib abrt-plugin-bugzilla redhat-release-notes*
rpm -e --nodeps redhat-release redhat-indexhtml
cd /tmp
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/centos-release-6-2.el6.centos.7.x86_64.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/centos-indexhtml-6-1.el6.centos.noarch.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/yum-3.2.29-22.el6.centos.noarch.rpm
wget http://mirror.centos.org/centos/6/os/x86_64/Packages/yum-plugin-fastestmirror-1.1.30-10.el6.noarch.rpm
rpm -Uvh *.rpm
cd ..
yum clean all
yum upgrade

One might want to download and add a few more look-and-feel items, or load them from the CentOS DVD:

rpm -Uvh redhat-bookmarks-6-1.el6.centos.noarch.rpm redhat-logos-60.0.14-10.el6.centos.noarch.rpm \
redhat-menus-14.0.0-3.el6.noarch.rpm redhat-rpm-config-9.0.3-34.el6.noarch.rpm

With all updates, you should go through your system and look for *.rpm.save and *.rpm.new files.  These are created because of changes made to configuration files on your system.  If the OS thinks it can’t safely replace your .conf file with it’s you get a .rpm.new.  If the OS knows it has to replace your .conf file for the application to work you’ll get a .rpm.save.  Either way you should review the differences and correct what’s needed.

find / -name*.rpm.save’
find / -name*.rpm.new’

CentOS Redhat
at-spi.x86_64 at-spi-1.28.1-2.el6.x86_64
at-spi-python.x86_64 at-spi-python-1.28.1.-2.el6.x86_64
redhat-bookmarks redhat-bookmarks-6.1.el6
redhat-lsb-graphics redhat-lsb-graphics-4.0-3.el6
system-config-date system-config-date-1.9.60-1.el6






ScaleArc’s iDB Reviewed

August 28, 2012

Stop coding and start scaling!

At the April MySQL conference in San Francisco (2012) I discovered iDB by ScaleArc and was excited by what it can do. After three months of testing and working with ScaleArc I purchased it.

What I do

The company I work for (WDT Inc.) supplies weather data. From a database point of view, we take in data feeds from many sources with “Ingest” applications. All are processed (forecast applications) and provided to many “Out flow” applications. Current conditions are processed into forecasts data, RADAR images, lightning and alerts are mapped into images to be used by governments and individual customers (iPhones). So, we have many applications work with many databases written in many languages.

Coding applications to split read and writes for replication and sharding have been the answer for scaling database access for years. Many companies are working on ways improve replication. The problem with these approaches is they all require code changes. Having been a firewall developer (ipchains), I’ve always thought a proxy was the answer. Think F5-BigIP for databases.

Applications written to work with one MySQL server can’t scale without re-coding.  If read/write splitting was all iDB did I’d be happy, but iDB can also monitors the transactions, creates metrics and cache transactions.


ScaleArc provided me with a VM version and two 30 day licenses. I installed iDB on two VMware servers along with several types of web applications. (some staging for production) The primary application was an in-house Java applications to process weather alerts. Other application included WordPress, MediaWiki, Cacti and Thinkup.

I also ran sysbench and hammerora against iDB. I used two and three MySQL server, with master / master / slave replication between them.

Most test “Servers” where VMware virtual machines running CentOS 6.0 x64. Some tests were run on production class MySQL servers while iDB ran in a VM.


iDB is designed as an appliance. You can purchase it as hardware, software or a VM. You can have ScaleArc pre-configure iDB for your network. I skipped this, so needed to use a live Linux CD to pop the root password and changed the network configuration.

iDB uses a web interface. The setup process is very intuitive. My installation went well until I tried to install the license key. iDB would not take the key. I turned the problem into ScaleArc support to see how they would respond. I also began to dig into the code and in about two hours solved the problem. I found the startup script was calling the wrong program name.

Within a three days, ScaleArc sent a system engineer, in house, to help with my problem and complete the installation. (I already had it working but I let ScaleArc do it their way to test their support.) The engineer upgraded to a newer version of the iDB server and installed the license key.


As a part of digging into ScaleArc’s code, I check iDB’s security. iDB has two methods to authenticate a user. They are, pass through and off load. The default, off load, authenticates the client/application/user itself. Meaning, iDB connects to the database server with a password you give it for that database user. This allows iDB to answer requests from cached queries without connecting to MySQL.

Pass through, as the name implies, just passes the authentication protocol from the client to the server. In this mode iDB can not cache the queries.

Load Balancing

iDB lets you setup “Clusters” of servers. You can designate each database server as either Read/Write or Read only.
You can define a cluster as either Round Robin or Dynamic load balancing.


iDB analyzes the traffic passing through it and summarizes it to create analytic data. You can then create expressions to cache queries.

SELECT type FROM `db`\.`settings` WHERE name\=.*

iDB plots the summarized queries and makes it easy to setup caching rules.


To load balance you need supply iDB with the ID and password of every database user. I’m sure you keep these save somewhere? What about that one, very old, production program that guy (who left two years ago) wrote in Fortran?

MySQL stores a sha1 of each user’s password. After some study I found a way for a proxy to recover the sha1 of the password used by the client. I wrote this up and presented it to ScaleArc. I call it “Transparent Authentication”. ScaleArc has agreed to build this into iDB. I hope to test this soon.

The console doesn’t alway report the current settings. I turned on Access Control List for the cluster and set access to the local LAN only. This creates iptables setting in the underlying Linux server. After changing the access the console didn’t show the iptables setting left. I had to turn off the ACL all together to stop the iptables setting that were left.

Security Issue

I found a number of security problem with ScaleArc. Not all the web pages where protected by iDB security code. I found many test and support scripts in the GUI application that leaked information. I also found a rather scary program that leaked system files. I reported all to ScaleArc.  These bugs have been fixed.

Some Application Issues

I found some applications, like WordPress, report “database issues” if they can’t write to the database right away. It seems the first thing these applications do is log the connection. They do not continue to show the page even though the database can be read.

Hammerora Test

All OLTP test ran without a flaw. What more can I say.

SysBench test

This test was to review the latency created by iDB. The SysBench, iDB and MySQL all ran on different VMs. All VMs ran on the same hardware. The first test was run direct to the MySQL server and the second through iDB. The MySQL server was not load balanced.

Transactions per second

Average transaction time in milliseconds


ScaleArc’s iDB works! Unlike the MySQL Proxy I could not make iDB fail. Application have no idea they were going through a proxy. Queries went directed to read or write servers as needed. Replication was not affected.

Although the heart of iDB is solid, I did have some trouble with the user interface. I believe iDB may have started life as a command line program. The web UI is good but need work.

ScaleArc support is 100% the are very helpful and willing to listen to their customer. I’ve found their support has continued to be great after purchase.

When ScaleArc complete the “Transparent Authentication” iDB will truly be a drop-in scaling solution.



Indexed CSV

July 26, 2012

Why is there no index for CSV files?  Indexes are very simple.

If the first column of your CSV file is in sorted order you can do a binary search to find your data.  But what if you need to find data in the second or third column?

If you have a separate index file pointing to the first byte of each line you could seek to that position in the CSV file and get your data.  Given a file with only your needed column and the byte offset and length of the line you can search the index to find the pointer to the position in the CSV file.



Here is a simple perl program to create just such an index.

# csvindex.pl

use strict;
use warnings;

use Text::CSV_XS;

my $file = $ARGV[0];

my $csv = Text::CSV_XS->new ({
binary    => 1,
auto_diag => 1,
sep_char  => ','    # not really needed as this is the default

open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file'\n";

my $head = 0;
my $tail = 0 ;

while (my $fields = $csv->getline( $data )) {
$tail = tell $data ;
printf("\"%s\",%d,%d\n",$fields->[$ARGV[1]], $head, $tail-$head);
$head = $tail;

This command create an index of the second column (number start with 0).

$ perl csvindex.pl 2009.csv 1 | sort > 2009.cidx

A binary search can be done with the linux ‘look’ command.

$ look \"OK\" 2009.cidx

Here is a simple program to seek into a file and print ‘n’ number of characters.

# csvseek.pl
use strict;
use warnings;

use Text::CSV_XS;

my $file = $ARGV[0];

open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file'\n";

seek $data, $ARGV[1], 0 ;

my $line = "" ;

read $data, $line, $ARGV[2];

print $line . "\n" ;

Here is how it is used.

$ perl csvseek.pl  2009.csv 2134 59

I’d like to see these tools added to CSVKit.

What I’d really like to see is this index/s added to MySQL’s CSV engine.  Would anyone like to help me get this done?

k2sxxx.com k2s.club fapdick.net

Powered by Wordpress and MySQL. Theme by Shlomi Noach, openark.org
Creative Commons License
MySQL Fan Boy by Mark Grennan is licensed under a Creative Commons Attribution-Share Alike 3.0 United States License.